Updating 3baf8bb..cd42c66
Fast-forward
 swh/indexer/fossology_license.py  | 21 ++++++++--------
 swh/indexer/indexer.py            | 46 +++++++++++++++++-----------------
 swh/indexer/mimetype.py           | 23 +++++++++--------
 swh/indexer/tests/test_indexer.py | 52 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 94 insertions(+), 48 deletions(-)

Changes applied before test

commit cd42c667212a8a37a080fb3aed915ade93704ca4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 1 14:57:20 2021 +0100

    indexer: Remove pagination logic using stream_results() instead.
    
    Simpler code and less error-prone.

commit 4080b9ee931fe914a91addf1df2d160e56a2d8bb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Feb 1 14:41:23 2021 +0100

    ContentPartitionIndexer: Do not index the same content multiple times at once.
    
    self._index_contents was called multiple times in a loop with the same arguments,
    except for the set of hashes to exclude.
    
    It means that, if there were N pages of hashes to exclude, each content was
    indexed N times; and the N-1 first iterations didn't even exclude all the
    hashes they had to exclude.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/144/ for more details.

Harbormaster completed remote builds in B18916: Diff 17775.Feb 1 2021, 3:00 PM

vlorentz requested review of this revision.Feb 1 2021, 3:00 PM

ardumont accepted this revision.Feb 1 2021, 3:01 PM

This revision is now accepted and ready to land.Feb 1 2021, 3:01 PM

Closed by commit rDCIDXcd42c667212a: indexer: Remove pagination logic using stream_results() instead. (authored by vlorentz). · Explain WhyFeb 1 2021, 3:02 PM

This revision was automatically updated to reflect the committed changes.

vlorentz added a commit: rDCIDXcd42c667212a: indexer: Remove pagination logic using stream_results() instead..