Page MenuHomeSoftware Heritage

crates: create one origin per package instead of per version
ClosedPublic

Authored by franckbret on Apr 26 2022, 10:53 AM.

Details

Summary

Previously we had as many origins as version for a crate package, url was a link to a specific crate version package.

Refactor to have one origin per package name and add an 'artifacts' entry to extra_loader_arguments that list all versions, package url and checksum.
Origin url is now a link to the related http api endpoint for a package name.

Related to T4104

Diff Detail

Repository
rDLS Listers
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 28934
Build 45230: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 45229: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D7654 (id=27696)

Rebasing onto d0924f39d0...

First, rewinding head to replay your work on top of it...
Applying: Change 'exxtra_loarder_arguments' arg 'name' to 'package_name', and rename 'visit_type' from 'rust-crate' to 'crates' in order to be consistent with what have been decided for the crate loader, see https://forge.softwareheritage.org/D7501.
Changes applied before test
commit 7cd75d3ccb2eb9f6d6d3b6b6cabe8095deb1709c
Author: Franck Bret <franck.bret@octobus.net>
Date:   Tue Apr 26 10:48:02 2022 +0200

    Change 'exxtra_loarder_arguments' arg 'name' to 'package_name', and rename 'visit_type' from 'rust-crate' to 'crates' in order to be consistent with what have been decided for the crate loader, see https://forge.softwareheritage.org/D7501.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/501/ for more details.

wait... we are creating one origin for each version of the crate?

Once @vlorentz's comment is adressed (we should list one origin, the package i guess, and
gather all crate versions under that same origin).

Make sure your commit message (one to one with the diff description) is one short
summary (in imperative form) [1]. For some extra description, use multiple lines.

[1] https://docs.softwareheritage.org/devel/contributing/git-style-guide.html

TIA

wait... we are creating one origin for each version of the crate?

Yes it is.

I guess Instead of returning an url for a version (i.e: https://static.crates.io/crates/{package_name}/{package_name}-{version}.crate ) lister should return an http api url (i.e : https://crates.io/api/v1/crates/{package_name} ) ?

Another way would be to return an url corresponding to the latest released version (i.e: https://static.crates.io/crates/{package_name}/{package_name}-{latest_version}.crate ).

What do you prefer?

Once @vlorentz's comment is adressed (we should list one origin, the package i guess, and
gather all crate versions under that same origin).

Make sure your commit message (one to one with the diff description) is one short
summary (in imperative form) [1]. For some extra description, use multiple lines.

[1] https://docs.softwareheritage.org/devel/contributing/git-style-guide.html

TIA

Sure, sorry for that.

wait... we are creating one origin for each version of the crate?

Yes it is.

I guess Instead of returning an url for a version (i.e: https://static.crates.io/crates/{package_name}/{package_name}-{version}.crate ) lister should return an http api url (i.e : https://crates.io/api/v1/crates/{package_name} ) ?

Yes, this one.

And i'd say, return the list of associated versioned artifacts as extra_loader_arguments.

@vlorentz thoughts?

wait... we are creating one origin for each version of the crate?

Yes it is.

I guess Instead of returning an url for a version (i.e: https://static.crates.io/crates/{package_name}/{package_name}-{version}.crate ) lister should return an http api url (i.e : https://crates.io/api/v1/crates/{package_name} ) ?

Yes, this one.

And i'd say, return the list of associated versioned artifacts as extra_loader_arguments.

@vlorentz thoughts?

Ok, but what will be the usage of versioned artifacts as extra_loader_arguments? (The http api endpoint, now the origin url, will return in any case all related released versions of the crate..)

One thing we should consider, is related to checksum.
The http api endpoint do not provide checksum, but the lister can provide checksum for each version. I we consider checksum utils for loader we can pass as extra_loader_arguments a list of dict like this :

artifacts = [ { "version": {version}, "download_url": https://static.crates.io/crates/{package_name}/{package_name}-{version}.crate, checksum: {cksum} }]

And one last possible thing if we go this way, would be to totally drop the http api call in the loader.
The loader can get all the needed fields for Release via intrinsic metadata (Cargo.toml).
The only thing we will miss from http api endpoint is "updated_at" info that represents the date a released package version has been updated at and is actualy used by the loader to fill the date field of a release.
Note that if do that the loader will receive an url it will never use, it can be confusing...

Ok, but what will be the usage of versioned artifacts as extra_loader_arguments?

We don't want to create one origin per version of a package.
We want all versions of a packages seen under the same origin (well, at the time of the ingestion).

See this detailed thread where we explicit the mistake for the maven ingestion [1]

[1] https://sympa.inria.fr/sympa/arc/swh-devel/2022-04/msg00043.html

Is that possible at all for the rust ingestion?

Cheers,

Ok, but what will be the usage of versioned artifacts as extra_loader_arguments?

We don't want to create one origin per version of a package.
We want all versions of a packages seen under the same origin (well, at the time of the ingestion).

See this thread where we made the mistake for the maven ingestion [1]

[1] https://sympa.inria.fr/sympa/arc/swh-devel/2022-04/msg00043.html

Is that possible at all for the rust ingestion?

Cheers,

Ok, got it for that point. Yes it's possible. Will go as you suggest, one origin with list of versions as extra_loader_arguments.

One thing we should consider, is related to checksum.
The http api endpoint do not provide checksum, but the lister can provide checksum for each version. I we consider checksum utils for loader we can pass as extra_loader_arguments a list of dict like this :

artifacts = [ { "version": {version}, "download_url": https://static.crates.io/crates/{package_name}/{package_name}-{version}.crate, checksum: {cksum} }]

Almost. Use this format: https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#original-artifacts-json

(Then in a future version of the loader, we can dump this in the right table of swh-storage for archival too)

And one last possible thing if we go this way, would be to totally drop the http api call in the loader.
The loader can get all the needed fields for Release via intrinsic metadata (Cargo.toml).
The only thing we will miss from http api endpoint is "updated_at" info that represents the date a released package version has been updated at and is actualy used by the loader to fill the date field of a release.

That seems to be valuable metadata; let's keep the request.

Note that if do that the loader will receive an url it will never use, it can be confusing...

That shouldn't be an issue

Refactor crates.io lister

Updating D7654: Refactor crates.io lister

Previously we had as many origins as version for a crate package, url was a link to a specific crate version package.

Refactor to have one origin per package name and add an 'artifacts' entry to extra_loader_arguments that list all versions, package url and checksum.
Origin url is now a link to the related http api endpoint for a package name.

franckbret retitled this revision from Change 'extra_loarder_arguments' arg 'name' to 'package_name', and rename 'visit_type' from 'rust-crate' to 'crates' in order to be consistent with what have been decided for the crate loader, see https://forge.softwareheritage.org/D7501. to Refactor crates.io lister.Apr 27 2022, 12:52 PM
franckbret edited the summary of this revision. (Show Details)

Build is green

Patch application report for D7654 (id=27815)

Rebasing onto c251594a1f...

First, rewinding head to replay your work on top of it...
Applying: Refactor crates.io lister
Changes applied before test
commit a29e5c05dd0fefa75055144677f78a4ff0ccbcc5
Author: Franck Bret <franck.bret@octobus.net>
Date:   Tue Apr 26 10:48:02 2022 +0200

    Refactor crates.io lister
    
    Previously we had as many origins as version for a crate package, url was a link
    to a specific crate version package.
    
    Refactor to have one origin per package name and add an 'artifacts' entry to
    extra_loader_arguments that list all versions, package url and checksum.
    Origin url is now a link to the related http api endpoint for a package name.
    
    Related to T4104

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/506/ for more details.

Looks good!

Could you change the commit/diff title to explain the change (it's not a refactoring)? eg. "Create one origin per package instead of per version"

franckbret edited the summary of this revision. (Show Details)

Updating D7654: Crates.io lister, create one origin per package instead of per version

franckbret retitled this revision from Refactor crates.io lister to Crates.io lister, create one origin per package instead of per version.Apr 27 2022, 2:34 PM

Build is green

Patch application report for D7654 (id=27825)

Rebasing onto c251594a1f...

First, rewinding head to replay your work on top of it...
Applying: Crates.io lister, create one origin per package instead of per version
Changes applied before test
commit 111f6fda9f1eb6df0256e23bdcc3f87df9c0bfe0
Author: Franck Bret <franck.bret@octobus.net>
Date:   Tue Apr 26 10:48:02 2022 +0200

    Crates.io lister, create one origin per package instead of per version
    
    Previously we had as many origins as version for a crate package, url was a link
    to a specific crate version package.
    
    Refactor to have one origin per package name and add an 'artifacts' entry to
    extra_loader_arguments that list all versions, package url and checksum.
    Origin url is now a link to the related http api endpoint for a package name.
    
    Related to T4104

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/507/ for more details.

I've run the lister in the docker environment, it looks good.

ardumont added 1 blocking reviewer(s): Reviewers.

lgtm

@vlorentz any more comments?

Added 'version' key to 'artifacts' dict.

This is handy for the loader to get the artifact line depending on a version.

Build is green

Patch application report for D7654 (id=27849)

Rebasing onto c251594a1f...

First, rewinding head to replay your work on top of it...
Applying: Crates.io lister, create one origin per package instead of per version
Changes applied before test
commit 40449d1d51f59db2abb6c4d2d068cbd04c3ed7a1
Author: Franck Bret <franck.bret@octobus.net>
Date:   Tue Apr 26 10:48:02 2022 +0200

    Crates.io lister, create one origin per package instead of per version
    
    Previously we had as many origins as version for a crate package, url was a link
    to a specific crate version package.
    
    Refactor to have one origin per package name and add an 'artifacts' entry to
    extra_loader_arguments that list all versions, package url and checksum.
    Origin url is now a link to the related http api endpoint for a package name.
    
    Related to T4104

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/508/ for more details.

Add missing "v1" in http api url

Build is green

Patch application report for D7654 (id=27851)

Rebasing onto c251594a1f...

First, rewinding head to replay your work on top of it...
Applying: Crates.io lister, create one origin per package instead of per version
Changes applied before test
commit f475875629ef05846f6f0d7d70b4184538152c4d
Author: Franck Bret <franck.bret@octobus.net>
Date:   Tue Apr 26 10:48:02 2022 +0200

    Crates.io lister, create one origin per package instead of per version
    
    Previously we had as many origins as version for a crate package, url was a link
    to a specific crate version package.
    
    Refactor to have one origin per package name and add an 'artifacts' entry to
    extra_loader_arguments that list all versions, package url and checksum.
    Origin url is now a link to the related http api endpoint for a package name.
    
    Related to T4104

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/509/ for more details.

Still looks good, but the first line commit message should be: crates: Create one origin per package instead of per version

("lister" is redundant given the repository it's in; and with the long prefix and capitalization it would look like the commit added the lister when skimming the history)

Updating D7654: crates, create one origin per package instead of per version

Amend commit message.

franckbret retitled this revision from Crates.io lister, create one origin per package instead of per version to crates: create one origin per package instead of per version.Apr 28 2022, 9:06 AM

Build is green

Patch application report for D7654 (id=27872)

Rebasing onto c251594a1f...

First, rewinding head to replay your work on top of it...
Applying: crates: Create one origin per package instead of per version
Changes applied before test
commit 1ada257230096c46eaf90a748e5d0a505f49fc41
Author: Franck Bret <franck.bret@octobus.net>
Date:   Tue Apr 26 10:48:02 2022 +0200

    crates: Create one origin per package instead of per version
    
    Previously we had as many origins as version for a crate package, url was a link
    to a specific crate version package.
    
    Refactor to have one origin per package name and add an 'artifacts' entry to
    extra_loader_arguments that list all versions, package url and checksum.
    Origin url is now a link to the related http api endpoint for a package name.
    
    Related to T4104

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/510/ for more details.

Thanks!

Don't forget to add yourself to the /CONTRIBUTORS file as well (same goes for the loader).
(I don't recall you did that already).

Cheers,

This revision is now accepted and ready to land.Apr 28 2022, 11:01 AM
This revision was landed with ongoing or failed builds.Apr 28 2022, 4:11 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D7654 (id=27888)

Rebasing onto 985b71e80c...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-512-D7654.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/512/ for more details.