Page MenuHomeSoftware Heritage

indexer: Remove pagination logic using stream_results() instead.
ClosedPublic

Authored by vlorentz on Feb 1 2021, 2:57 PM.

Details

Summary

Simpler code and less error-prone.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4983 (id=17775)

Could not rebase; Attempt merge onto 3baf8bb919...

Updating 3baf8bb..cd42c66
Fast-forward
 swh/indexer/fossology_license.py  | 21 ++++++++--------
 swh/indexer/indexer.py            | 46 +++++++++++++++++-----------------
 swh/indexer/mimetype.py           | 23 +++++++++--------
 swh/indexer/tests/test_indexer.py | 52 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 94 insertions(+), 48 deletions(-)
Changes applied before test
commit cd42c667212a8a37a080fb3aed915ade93704ca4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 1 14:57:20 2021 +0100

    indexer: Remove pagination logic using stream_results() instead.
    
    Simpler code and less error-prone.

commit 4080b9ee931fe914a91addf1df2d160e56a2d8bb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 1 14:41:23 2021 +0100

    ContentPartitionIndexer: Do not index the same content multiple times at once.
    
    self._index_contents was called multiple times in a loop with the same arguments,
    except for the set of hashes to exclude.
    
    It means that, if there were N pages of hashes to exclude, each content was
    indexed N times; and the N-1 first iterations didn't even exclude all the
    hashes they had to exclude.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/144/ for more details.

This revision is now accepted and ready to land.Feb 1 2021, 3:01 PM