Page MenuHomeSoftware Heritage

pattern: Ensure accurate origin counts returned by run method
ClosedPublic

Authored by anlambert on Sep 28 2022, 4:06 PM.

Details

Summary

Previously, the run method was returning the total count of ListedOrigin
objects sent to scheduler database.

However, some listers can send multiple ListedOrigin objects for a given
origin URL during the listing process, for instance when an origin is
contained in multiple pages (e.g. gogs listing) or when the listing
is gathering multiple versions of an origin spread across multiple
pages (e.g. maven listing).

This changes ensures an accurate count of listed origins by maintaining
a set of origin URLs associated to the sent ListedOrigin objects.

Diff Detail

Repository
rDLS Listers
Branch
lister-accurate-origin-counts
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 31869
Build 49874: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 49873: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8565 (id=30896)

Rebasing onto 3928fc9ee9...

Current branch diff-target is up to date.
Changes applied before test
commit 6a68734e29c3f35461721956cd32f9944f876079
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Sep 28 15:58:00 2022 +0200

    pattern: Ensure accurate origin counts returned by run method
    
    Previously, the run method was returning the total count of ListedOrigin
    objects sent to scheduler database.
    
    However, some listers can send multiple ListedOrigin objects for a given
    origin URL during the listing process, for instance when an origin is
    contained in multiple pages (e.g. gogs listing) or when the listing
    is gathering multiple versions of an origin spread across multiple
    pages (e.g. maven listing).
    
    This changes ensures an accurate count of listed origins by maintaining
    a set of origin URLs associated to the sent ListedOrigin objects.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/714/ for more details.

ardumont added a subscriber: ardumont.

lgtm

one question where i got confused by the old code inline ¯\_(ツ)_/¯ .

swh/lister/debian/lister.py
303

Are you sure about removing those?

Usually, afair, when a lister defines this, it's because it's dealing with huge page of origins that it wants to regularly flush to the scheduler db.
Without this, if the lister somehow breaks, it can lost that "intermediary" results...

Although... now reading back pattern.Lister a bit, it looks like that this did some redundant call after the main send_origins got called...
so picture me confused.

This revision is now accepted and ready to land.Sep 29 2022, 9:55 AM
swh/lister/debian/lister.py
303

This code was a hack to get the accurate counts of debian origins after listing but it is no longer needed.

I forgot to create a second commit for these changes, will do it before landing this.

Separate diff into two commits

Build is green

Patch application report for D8565 (id=30940)

Rebasing onto 3928fc9ee9...

Current branch diff-target is up to date.
Changes applied before test
commit 5426883c49ce0fc4a442b705ae573f868b9f7a62
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Sep 29 11:14:35 2022 +0200

    debian: Remove no longer needed code to get accurate origins count
    
    The base lister class now ensures the count of listed origins will
    be accurate.

commit 8d85b2e4e8d58278f4fb94ec6b056f62c66b7f06
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Sep 29 11:14:08 2022 +0200

    pattern: Ensure accurate origin counts returned by run method
    
    Previously, the run method was returning the total count of ListedOrigin
    objects sent to scheduler database.
    
    However, some listers can send multiple ListedOrigin objects for a given
    origin URL during the listing process, for instance when an origin is
    contained in multiple pages (e.g. gogs listing) or when the listing
    is gathering multiple versions of an origin spread across multiple
    pages (e.g. maven listing).
    
    This changes ensures an accurate count of listed origins by maintaining
    a set of origin URLs associated to the sent ListedOrigin objects.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/717/ for more details.