Page MenuHomeSoftware Heritage

Make the SourceForge lister incremental
ClosedPublic

Authored by Alphare on Apr 30 2021, 11:10 PM.

Details

Summary

SourceForge's sitemaps (1 main one + many sharded) give us a "last
modified" date for every subsitemap and project, allowing us to perform
an incremental listing.

We store the subsitemaps' "last modified" dates in the lister state, as
well as those of the empty projects (projects which don't have any VCS
registered), and the rest comes from the already visited origins from
the database.

The tests try to cover the possible cases of a subsitemap that has
changed, one that hasn't, a project that has change, one that hasn't,
and same for an empty project.

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5659 (id=20212)

Rebasing onto 6f8dd5d3f2...

Current branch diff-target is up to date.
Changes applied before test
commit 9aaa7f7795b20aacb62f2b196231629468926d7c
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Fri Apr 30 21:46:29 2021 +0200

    Make the SourceForge lister incremental
    
    SourceForge's sitemaps (1 main one + many sharded) give us a "last
    modified" date for every subsitemap and project, allowing us to perform
    an incremental listing.
    
    We store the subsitemaps' "last modified" dates in the lister state, as
    well as those of the empty projects (projects which don't have any VCS
    registered), and the rest comes from the already visited origins from
    the database.
    
    The tests try to cover the possible cases of a subsitemap that has
    changed, one that hasn't, a project that has change, one that hasn't,
    and same for an empty project.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/272/ for more details.

Nice trick.

Jenkins/Phabricator reports most of the new code in listed_origins and _get_pages_for_project is not covered by tests, could you check? (It's sometimes wrong, we don't understand why)

swh/lister/sourceforge/lister.py
57–67

Could you make these comments docstrings, so they show up in the docs?

And don't mean they "have no VCS for us"?

158–159

:D

If it's an issue, wouldn't a dict with ((namespace, project) as key and URLs as values perform better?

296–301

Could you should repeat here what "empty project" means?

317–318

Couldn't this happen if this is a new project, or if the project added a VCS since the last listing?

swh/lister/sourceforge/tests/test_lister.py
253–256

Can you compare the values?

Alphare marked 3 inline comments as done.

Fix incremental testing + incorporate suggestions

Build is green

Patch application report for D5659 (id=20214)

Rebasing onto 6f8dd5d3f2...

Current branch diff-target is up to date.
Changes applied before test
commit 934301cec325b4f0cef1ae0b77a64f36363bb94c
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Fri Apr 30 21:46:29 2021 +0200

    Make the SourceForge lister incremental
    
    SourceForge's sitemaps (1 main one + many sharded) give us a "last
    modified" date for every subsitemap and project, allowing us to perform
    an incremental listing.
    
    We store the subsitemaps' "last modified" dates in the lister state, as
    well as those of the empty projects (projects which don't have any VCS
    registered), and the rest comes from the already visited origins from
    the database.
    
    The tests try to cover the possible cases of a subsitemap that has
    changed, one that hasn't, a project that has change, one that hasn't,
    and same for an empty project.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/273/ for more details.

Jenkins/Phabricator reports most of the new code in listed_origins and _get_pages_for_project is not covered by tests, could you check? (It's sometimes wrong, we don't understand why)

That would be because I forgot to add incremental=True to my lister 👼 . I'll see with the next refresh what the coverage looks like.

I've made it so it would break if that happened again.

swh/lister/sourceforge/lister.py
158–159

This is clearly a case of trying to be too clever at 10pm on a Friday, I think a dict will work fine, hehe.

317–318

Yep, it could. I logged a less scary message which would still allow us to debug in case of a mistake.

swh/lister/sourceforge/tests/test_lister.py
253–256

I'm not sure which values you're refering to.

Build is green

Patch application report for D5659 (id=20215)

Rebasing onto 6f8dd5d3f2...

Current branch diff-target is up to date.
Changes applied before test
commit f41424e13ea16b6aa6b295a2a99d00c3b628904f
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Fri Apr 30 21:46:29 2021 +0200

    Make the SourceForge lister incremental
    
    SourceForge's sitemaps (1 main one + many sharded) give us a "last
    modified" date for every subsitemap and project, allowing us to perform
    an incremental listing.
    
    We store the subsitemaps' "last modified" dates in the lister state, as
    well as those of the empty projects (projects which don't have any VCS
    registered), and the rest comes from the already visited origins from
    the database.
    
    The tests try to cover the possible cases of a subsitemap that has
    changed, one that hasn't, a project that has change, one that hasn't,
    and same for an empty project.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/274/ for more details.

swh/lister/sourceforge/tests/test_lister.py
253–256

nvm, my comment doesn't make sense. I most have read your code as assert len(stats.pages) == 1

douardda added a subscriber: douardda.

besides the type aliasing statements that I find a bit confusing, LGTM.

swh/lister/sourceforge/lister.py
51

at first I though these better be type annotations (instead of affectations), but I was wrong. Maybe postfix them with a T (eg. ProjectNameT ) to make it clearer these are actually type aliases?

Also, LastModifed is really only a date (with no time)? (edit: looks so, according the tests below)

This revision is now accepted and ready to land.May 4 2021, 5:17 PM
swh/lister/sourceforge/lister.py
51

Why would these "better be type annotations"? I'm not overly familiar with Python explicit typing, so I'm happy to learn.
I have no problem with postfixing the with a T if that's clearer.
LastModified is indeed a date since SourceForge only provides us with date granularity.

Postfix the type aliases with "T"

Build has FAILED

Patch application report for D5659 (id=20327)

Rebasing onto 6f8dd5d3f2...

Current branch diff-target is up to date.
Changes applied before test
commit 3baf1d0999406492611df4db5c0774cb72850dc3
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Fri Apr 30 21:46:29 2021 +0200

    Make the SourceForge lister incremental
    
    SourceForge's sitemaps (1 main one + many sharded) give us a "last
    modified" date for every subsitemap and project, allowing us to perform
    an incremental listing.
    
    We store the subsitemaps' "last modified" dates in the lister state, as
    well as those of the empty projects (projects which don't have any VCS
    registered), and the rest comes from the already visited origins from
    the database.
    
    The tests try to cover the possible cases of a subsitemap that has
    changed, one that hasn't, a project that has change, one that hasn't,
    and same for an empty project.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/275/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/275/console

This revision was landed with ongoing or failed builds.May 6 2021, 10:32 AM
This revision was automatically updated to reflect the committed changes.