Page MenuHomeSoftware Heritage

packagist: Reimplement lister using new Lister API
ClosedPublic

Authored by anlambert on Feb 2 2021, 11:04 AM.

Details

Summary

The previous implementation was generating tasks for a non implemented
Packagist loader.

The new implementation extracts source repository URL, VCS type and
last update date for each package referenced by Packagist and send
those info to the scheduler.

Packages metadata are retrieved using Packagist API endpoints whose
responses are served from static files, which are guaranteed to be
efficient on the Packagist side (no dymamic queries).
Furthermore, subsequent listing will send the If-Modified-Since HTTP
header to only retrieve packages metadata updated since the previous
listing operation in order to save bandwidth and return only origins
which might have new released versions.

I tested intensively the lister yersteday and it worked without any
issues each time I executed it. First execution took around 90 minutes
and listed 286510 origins with three different visit types: git, hg and
svn. Subsequent calls took less time thanks to the If-Mofified-Since
HTTP header use and only returned packages modified since last listing.

Closes T2991

Diff Detail

Repository
rDLS Listers
Branch
packagist-lister-new-api
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18939
Build 29349: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 29348: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4990 (id=17798)

Rebasing onto 8e4dd178f1...

Current branch diff-target is up to date.
Changes applied before test
commit 478081c1513b240f85c78cc66e9a3109eff91608
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Feb 1 17:34:10 2021 +0100

    packagist: Reimplement lister using new Lister API
    
    The previous implementation was generating tasks for a non implemented
    Packagist loader.
    
    The new implementation extracts source repository URL, VCS type and
    last update date for each package referenced by Packagist and send
    those info to the scheduler.
    
    Packages metadata are retrieved using Packagist API endpoints whose
    responses are served from static files, which are guaranteed to be
    efficient on the Packagist side (no dymamic queries).
    Furthermore, subsequent listing will send the "If-Modified-Since" HTTP
    header to only retrieve packages metadata updated since the previous
    listing operation in order to save bandwidth and return only origins
    which might have new released versions.
    
    Closes T2991

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/234/ for more details.

lgtm

But it's missing some coverage on conditionals (according to jenkins).

Maybe simply enrich the current test dataset with some of
those skipped packages in the current new dataset you added
(one bitbucket entry, another with missing origin_url, another
with missing time, etc...)

swh/lister/packagist/lister.py
3

lol

(a regexp change gone rogue ;)

30

when*

lgtm

But it's missing some coverage on conditionals (according to jenkins).

Maybe simply enrich the current test dataset with some of
those skipped packages in the current new dataset you added
(one bitbucket entry, another with missing origin_url, another
with missing time, etc...)

Ack, will improve coverage then.

swh/lister/packagist/lister.py
3

lol, thanks for spotting !

Rebase and improve coverage

Build is green

Patch application report for D4990 (id=17810)

Rebasing onto 82ab96ad06...

Current branch diff-target is up to date.
Changes applied before test
commit ff05191b7db7b217c8682e9888338b8813e2df6a
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Feb 1 17:34:10 2021 +0100

    packagist: Reimplement lister using new Lister API
    
    The previous implementation was generating tasks for a non implemented
    Packagist loader.
    
    The new implementation extracts source repository URL, VCS type and
    last update date for each package referenced by Packagist and send
    those info to the scheduler.
    
    Packages metadata are retrieved using Packagist API endpoints whose
    responses are served from static files, which are guaranteed to be
    efficient on the Packagist side (no dymamic queries).
    Furthermore, subsequent listing will send the "If-Modified-Since" HTTP
    header to only retrieve packages metadata updated since the previous
    listing operation in order to save bandwidth and return only origins
    which might have new released versions.
    
    Closes T2991

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/238/ for more details.

Thanks.

(I had forgotten to actually validate it ¯\_(ツ)_/¯ )

This revision is now accepted and ready to land.Feb 2 2021, 3:00 PM