Page MenuHomeSoftware Heritage

pubdev: Modify origin URL and retrieve package last update
ClosedPublic

Authored by anlambert on Aug 31 2022, 11:38 AM.

Details

Summary

Use https://pub.dev/packages/{pkgname} as origin URL for a package
instead of https://pub.dev/api/packages/{pkgname}.

Fetch package versions info directly in the lister in order to
compute a last update date to send to scheduler datatabase.

Related to T4465

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8354 (id=30160)

Rebasing onto c6ce862d32...

Current branch diff-target is up to date.
Changes applied before test
commit 605f4991447f7e97d9a1d18b372e6416ebb52620
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Aug 31 11:31:09 2022 +0200

    pubdev: Improve lister implementation
    
    Use https://pub.dev/packages/{pkgname} as origin URL for a package
    instead of https://pub.dev/api/packages/{pkgname}.
    
    Fetch package versions info directly in the lister in order to
    compute a last update date to send to scheduler datatabase.
    
    Pass package versions info as loader extra arguments to avoid
    fetching it again.
    
    Related to T4465

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/635/ for more details.

Check last_update date extraction for a package.

Use https://pub.dev/packages/{pkgname} as origin URL for a package
instead of https://pub.dev/api/packages/{pkgname}.

I'm fine with that

Fetch package versions info directly in the lister in order to
compute a last update date to send to scheduler datatabase.

What is the advantage over making the loader do it?

Pass package versions info as loader extra arguments to avoid
fetching it again.

I would rather avoid that, extra_loader_arguments will bloat the scheduler database.

Build is green

Patch application report for D8354 (id=30163)

Rebasing onto c6ce862d32...

Current branch diff-target is up to date.
Changes applied before test
commit 4b72f907de9c61316467214d3e250db69e3b4b8b
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Wed Aug 31 11:31:09 2022 +0200

    pubdev: Improve lister implementation
    
    Use https://pub.dev/packages/{pkgname} as origin URL for a package
    instead of https://pub.dev/api/packages/{pkgname}.
    
    Fetch package versions info directly in the lister in order to
    compute a last update date to send to scheduler datatabase.
    
    Pass package versions info as loader extra arguments to avoid
    fetching it again.
    
    Related to T4465

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/636/ for more details.

Instead of using the /api/package-names endpoint to list packages, we could use the /api/packages one instead as it
returns info about latest version for a package.

However, the publication date for a package is missing from the /api/packages responses, I have create an issue
in pub.dev repository on the subject. Let's wait and see if that feature request is accepted before updating that diff.

Below is a pub-dev developer answer to this issue:

/api/packages is not an officially supported endpoint (not used by the pub client, and not listed on the https://pub.dev/help/api page. 
While we don't have an immediate plan to discontinue it, we wouldn't really want to encourage people using it, as there is a chance that in the future it is going away.

/api/packages/<package> is part of the official endpoints, it is heavily used and it is cached. 
We don't mind if you access these endpoints for all of the packages periodically.

For package name discovery, there is also /api/package-names, which is a more lightweight API than /api/packages. 
While it is not official either, I'd rather encourage using /api/package-names instead of the /api/packages.

So it seems the approach implemented in that diff to get package last update dates is viable to be landed.
Nevertheless, package metadata should not be set as extra loader arguments to avoid bloating the
scheduler database, will update.

Update:

  • Split changes into two commits:
    • pubdev: Modify origin URL for listed packages
    • pubdev: Retrieve last publication date for each listed package
  • Do not set package_metadata as extra loader arguments
anlambert retitled this revision from pubdev: Improve lister implementation to pubdev: Modify origin URL and retrieve package last update.Sep 2 2022, 4:53 PM
anlambert edited the summary of this revision. (Show Details)

Build is green

Patch application report for D8354 (id=30269)

Rebasing onto b6c69e5075...

Current branch diff-target is up to date.
Changes applied before test
commit 44560c2383bd5170a3ed9c02de115d460fd514d3
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Sep 2 16:18:12 2022 +0200

    pubdev: Retrieve last publication date for each listed package
    
    In order to get a last_update for each ListedOrigin sent to scheduler
    database, send an extra HTTP request for each listed package to the
    /api/packages/<package_name> endpoint of pub.dev API.
    
    A pub.dev developer inform us that endpoint is heavily used and cached
    so there is no particular issues to query that endpoint for each package
    in a row periodically.

commit 49b79b07593637862d34c22163159bee8116da48
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Sep 2 16:12:13 2022 +0200

    pubdev: Modify origin URL for listed packages
    
    Use https://pub.dev/packages/<package_name> instead of
    https://pub.dev/api/packages/<package_name>

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/642/ for more details.

Hi, looks good for me too. Will test in Docker once its merged.

This revision is now accepted and ready to land.Sep 5 2022, 3:52 PM