Page MenuHomeSoftware Heritage

ContentPartitionIndexer: Do not index the same content multiple times at once.
ClosedPublic

Authored by vlorentz on Feb 1 2021, 2:42 PM.

Details

Summary

self._index_contents was called multiple times in a loop with the same arguments,
except for the set of hashes to exclude.

It means that, if there were N pages of hashes to exclude, each content was
indexed N times; and the N-1 first iterations didn't even exclude all the
hashes they had to exclude.

Resolves SWH-INDEXER-93 and SWH-INDEXER-7R

(w/ @ardumont)

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4982 (id=17773)

Rebasing onto 3baf8bb919...

Current branch diff-target is up to date.
Changes applied before test
commit 4080b9ee931fe914a91addf1df2d160e56a2d8bb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 1 14:41:23 2021 +0100

    ContentPartitionIndexer: Do not index the same content multiple times at once.
    
    self._index_contents was called multiple times in a loop with the same arguments,
    except for the set of hashes to exclude.
    
    It means that, if there were N pages of hashes to exclude, each content was
    indexed N times; and the N-1 first iterations didn't even exclude all the
    hashes they had to exclude.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/143/ for more details.

This revision is now accepted and ready to land.Feb 1 2021, 2:57 PM