Page MenuHomeSoftware Heritage

cpan: Improve listing process by querying the metacpan release endpoint
ClosedPublic

Authored by anlambert on Oct 4 2022, 5:17 PM.

Details

Summary

Instead of querying the metacpan distribution endpoint to list origins,
prefer to use the release endpoint instead enabling to list all artifacts
associated to CPAN packages by scrolling results.

Compared to previous implementation, it enables to compute a last_update
date for all CPAN packages but also to obtain artifact sha256 checksums
that will be used by the CPAN loader to check downloads integrity.

It also enables to save a call to metacapan Web API in cpan loader as all
needed info about package artifacts are now provided as extra loader
arguments.

Related to T2833

When testing this in docker, I could list all CPAN packages and their
artifacts in less than 4 minutes.

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8615 (id=31113)

Rebasing onto 5daead68ad...

Current branch diff-target is up to date.
Changes applied before test
commit fdfd876de96b91d4adc27f81f7145b3d74508eca
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 27 16:34:38 2022 +0200

    cpan: Improve listing process by querying the metacpan release endpoint
    
    Instead of querying the metacpan distribution endpoint to list origins,
    prefer to use the release endpoint instead enabling to list all artifacts
    associated to CPAN packages by scrolling results.
    
    Compared to previous implementation, it enables to compute a last_update
    date for all CPAN packages but also to obtain artifact sha256 checksums
    that will be used by the CPAN loader to check downloads integrity.
    
    It also enables to save a call to metacapan Web API in cpan loader as all
    needed info about package artifacts are now provided as extra loader
    arguments.
    
    Related to T2833

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/748/ for more details.

Rebase and update cpan lister:

  • process all pages from CPAN API before sending origins to scheduler
  • add extrinsic metdata URL for each module version and send it to cpan loader
  • miscelleaenous code improvements

Build is green

Patch application report for D8615 (id=31230)

Rebasing onto 108816f232...

Current branch diff-target is up to date.
Changes applied before test
commit 2177ac9f5a08c2bd276f494b2aa4c8f0d4239e65
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 27 16:34:38 2022 +0200

    cpan: Improve listing process by querying the metacpan release endpoint
    
    Instead of querying the metacpan distribution endpoint to list origins,
    prefer to use the release endpoint instead enabling to list all artifacts
    associated to CPAN packages by scrolling results.
    
    Compared to previous implementation, it enables to compute a last_update
    date for all CPAN packages but also to obtain artifact sha256 checksums
    that will be used by the CPAN loader to check downloads integrity.
    
    As the multiple versions of a module are spread across multiple pages
    from the CPAN API, origins are sent to the scheduler once all pages
    processed, it is also faster to proceed that way.
    
    Also compute extrinsic metadata URL for each perl module versions in
    order for the cpan loader to query it.
    
    Related to T2833

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/774/ for more details.

vlorentz added inline comments.
swh/lister/cpan/lister.py
84

and move it outside the class

87–88

easier to read, IMO

90
137–139

I'd rather pass the BASE_URL and let the loader build this URL; it will allow changing loader behavior without changing the lister too.

swh/lister/cpan/lister.py
137–139

Ok but this means I have to add the release name in module metadata, not a big deal though.

Build has FAILED

Patch application report for D8615 (id=31252)

Rebasing onto 108816f232...

Current branch diff-target is up to date.
Changes applied before test
commit 5042a43e31c091d186a7e38c36df0235f6cd65e7
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 27 16:34:38 2022 +0200

    cpan: Improve listing process by querying the metacpan release endpoint
    
    Instead of querying the metacpan distribution endpoint to list origins,
    prefer to use the release endpoint instead enabling to list all artifacts
    associated to CPAN packages by scrolling results.
    
    Compared to previous implementation, it enables to compute a last_update
    date for all CPAN packages but also to obtain artifact sha256 checksums
    that will be used by the CPAN loader to check downloads integrity.
    
    As the multiple versions of a module are spread across multiple pages
    from the CPAN API, origins are sent to the scheduler once all pages
    processed, it is also faster to proceed that way.
    
    Related to T2833

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/777/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/777/console

Build has FAILED

Patch application report for D8615 (id=31258)

Rebasing onto 108816f232...

Current branch diff-target is up to date.
Changes applied before test
commit 5121157ce326d32411e32f9f984f9a1f6e8710ae
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 27 16:34:38 2022 +0200

    cpan: Improve listing process by querying the metacpan release endpoint
    
    Instead of querying the metacpan distribution endpoint to list origins,
    prefer to use the release endpoint instead enabling to list all artifacts
    associated to CPAN packages by scrolling results.
    
    Compared to previous implementation, it enables to compute a last_update
    date for all CPAN packages but also to obtain artifact sha256 checksums
    that will be used by the CPAN loader to check downloads integrity.
    
    As the multiple versions of a module are spread across multiple pages
    from the CPAN API, origins are sent to the scheduler once all pages
    processed, it is also faster to proceed that way.
    
    Related to T2833

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/779/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/779/console

This revision is now accepted and ready to land.Oct 11 2022, 3:16 PM

Build is green

Patch application report for D8615 (id=31260)

Rebasing onto 108816f232...

Current branch diff-target is up to date.
Changes applied before test
commit e09a31c4c0072ff93453215aa772a7cfcabec5f1
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 27 16:34:38 2022 +0200

    cpan: Improve listing process by querying the metacpan release endpoint
    
    Instead of querying the metacpan distribution endpoint to list origins,
    prefer to use the release endpoint instead enabling to list all artifacts
    associated to CPAN packages by scrolling results.
    
    Compared to previous implementation, it enables to compute a last_update
    date for all CPAN packages but also to obtain artifact sha256 checksums
    that will be used by the CPAN loader to check downloads integrity.
    
    As the multiple versions of a module are spread across multiple pages
    from the CPAN API, origins are sent to the scheduler once all pages
    processed, it is also faster to proceed that way.
    
    Related to T2833

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/781/ for more details.

remove no longer needed variable

Build is green

Patch application report for D8615 (id=31262)

Rebasing onto 108816f232...

Current branch diff-target is up to date.
Changes applied before test
commit f57b8f3a2c49080ae9bc11217b8d6ef4ed8c564e
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 27 16:34:38 2022 +0200

    cpan: Improve listing process by querying the metacpan release endpoint
    
    Instead of querying the metacpan distribution endpoint to list origins,
    prefer to use the release endpoint instead enabling to list all artifacts
    associated to CPAN packages by scrolling results.
    
    Compared to previous implementation, it enables to compute a last_update
    date for all CPAN packages but also to obtain artifact sha256 checksums
    that will be used by the CPAN loader to check downloads integrity.
    
    As the multiple versions of a module are spread across multiple pages
    from the CPAN API, origins are sent to the scheduler once all pages
    processed, it is also faster to proceed that way.
    
    Related to T2833

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/783/ for more details.