Page MenuHomeSoftware Heritage

Rust lang, Crates loader
ClosedPublic

Authored by franckbret on Apr 5 2022, 10:34 AM.

Details

Summary

First implementation of Crates loader.

The implementation extract both intrinsic and extrinsic metadata :

Extrinsic metadata is used to get versions of all released package. It's also used, for a specific version, to get the update date, and optionally, authors and description in case we can't get those from Intrinsic Metadata.

Related to T4104

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Looks good, thanks. I have a bunch of minor comments/questions below

Can you remove types from docstrings (both params and returns)? We don't use them on modules with type annotations.

swh/loader/package/crates/loader.py
20–21

They don't seem to be used yet; remove them for now.

24

attributes on a Dict are ignored

92

ditto

154–155

I don't understand; why does the CratesPackageInfo, which is version-specific, have a list of all versions?

160–167
181–183

Equivalent but simpler

187–188

What is the extrinsic metadata based on?

Also, this function is missing two tests for missing intrinsic metadata (with and without the extrinsic metadata)

Same comment for extract_description below

238

For consistency with other loaders, I'd go with visit_type = "crates" (to reference the website and the code's directory)

267–269

checksum and version are unused; I assume a future diff will use them?

273–274

simpler

298–299

to format as a code block

311–312
337–339

You should keep all of them. See what the PyPI loader does when there is more than one tarball for a given version. (in summary, one yield for each of the file names)

367
374–378

Seems technically more accurate, but is less understandable.

What do you think?

swh/loader/package/crates/tests/test_crates.py
37–39

why .get() instead of []? (Same question for below)

This revision now requires changes to proceed.Apr 20 2022, 10:00 AM
ardumont added inline comments.
swh/loader/package/crates/loader.py
53
176

you mention its type directly so no need to repeat it.

195
223
314–315
321
324
326
328

I'm for dropping that untested conditional (that should not happen given how package loader is implemented... [1]).

[1] unless crate declaration gets inconsistent (given the current implem)? Does that happen?

Some fixes after @Alphare review. Mainly add 'name' and 'version' to loader args as those are given through the 'extra_loader_arguments' from lister.

One important thing to note after adding 'name' and 'version' args is that it does not work with the CLI in the docker environment..

franck@debian-franck:~/workspace/swh-environment/docker$ docker-compose exec swh-loader swh loader run crates "https://static.crates.io/crates/micro-timer/micro-timer-0.4.0.crate" version="0.4.0" name="micro-timer"

Traceback (most recent call last):
...
  File "/src/swh-loader-core/swh/loader/cli.py", line 104, in run
    loader = get_loader(type, url=url, storage=conf["storage"], **kw)
TypeError: get_loader() got multiple values for argument 'name'

What do you think about that ?
Better renaming 'name' to 'package_name' in the crate lister and loader or should the cli arg evolve to something like 'loader_name'?

swh/loader/package/crates/loader.py
24

Ok, I've changed to TypedDict, but it forced me to add 3 # type: ignore[misc]
I' ve not found another way for mypy to be happy, mainly because unpacking..

see https://github.com/python/mypy/issues/11753

154–155

You're right, its not used. If needed versions can be retrieved from extrinsic package metadata

187–188

The extrinsic metadata is fetched from http api.
I've added some fake crate with partial intrinsic and/or extrinsic metadata related to author and message + 3 tests to check every case.

267–269

'checksum' and 'version' are provided by the crate lister through 'extra_loader_arguments'.

'checksum' can be used when going incremental.

Regarding 'version', you are right, we don't use self.version anywhere. I'm confused about what to do, if we remove version here we must remove it from the crate lister as it become useless.

What do you think?

328

Ok, i've dropped it. It should not happen in the loader context. The only case I can imagine is via the CLI, if you manually enter an url with an unexisting version number.

337–339

Not sure it's exactly the same case.
There can't be more than one .crate file for a given version.

swh/loader/package/crates/tests/test_crates.py
37–39

You're right no reason to do that here..

franckbret marked 5 inline comments as done.

Fixes after ardumont and vlorentz review

Build has FAILED

Patch application report for D7501 (id=27647)

Rebasing onto c4c0a9b5aa...

First, rewinding head to replay your work on top of it...
Applying: Rust lang, Crates loader
Using index info to reconstruct a base tree...
M	.pre-commit-config.yaml
Falling back to patching base and 3-way merge...
Auto-merging .pre-commit-config.yaml
CONFLICT (content): Merge conflict in .pre-commit-config.yaml
Patch failed at 0001 Rust lang, Crates loader

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto c4c0a9b5aa...

Already up to date.
Changes applied before test
commit ca5895a2521234c00b859a382aeccdd68d625178
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Apr 4 17:18:51 2022 +0200

    Rust lang, Crates loader

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/730/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/730/console

Build has FAILED

Patch application report for D7501 (id=27650)

Rebasing onto c4c0a9b5aa...

Current branch diff-target is up to date.
Changes applied before test
commit 14244c2a8395187973887c465acd28c80d67552f
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Apr 4 17:18:51 2022 +0200

    Rust lang, Crates loader

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/732/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/732/console

Trying another alternative to make mypy happy (don't get why got no error in dev env running tox -e mypy or mypy -c mypy.ini but fail on CI..)

Build is green

Patch application report for D7501 (id=27654)

Rebasing onto c4c0a9b5aa...

Current branch diff-target is up to date.
Changes applied before test
commit ccad689df6df1b06181480a4adbb27a6459c1ddd
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Apr 4 17:18:51 2022 +0200

    Rust lang, Crates loader

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/734/ for more details.

Not sure you have seen this comment before, and because its an important point to go further, I repost :

One important thing to note after adding 'name' and 'version' args is that it does not work with the CLI in the docker environment..

franck@debian-franck:~/workspace/swh-environment/docker$ docker-compose exec swh-loader swh loader run crates "https://static.crates.io/crates/micro-timer/micro-timer-0.4.0.crate" version="0.4.0" name="micro-timer"

Traceback (most recent call last):
...

File "/src/swh-loader-core/swh/loader/cli.py", line 104, in run
  loader = get_loader(type, url=url, storage=conf["storage"], **kw)

TypeError: get_loader() got multiple values for argument 'name'

What do you think about that ?
Better renaming 'name' to 'package_name' in the crate lister and loader or should the cli arg evolve to something like 'loader_name'?

@vlorentz @ardumont Any thoughts about this?

Add crates package loader specification

Build has FAILED

Patch application report for D7501 (id=27671)

Rebasing onto c4c0a9b5aa...

Current branch diff-target is up to date.
Changes applied before test
commit 69b3adc253ad85ea7ab7abe6e62799143bf1bb36
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Apr 4 17:18:51 2022 +0200

    Rust lang, Crates loader

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/736/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/736/console

fix missing whitespace in loader specification documentation

swh/loader/package/crates/loader.py
187–188

but the information from the http api is based only on the content of the crate, right?

267–269

I see. Let's just not set the argument, then. It will be easier to re-add it later if we ever need it.

337–339

ok; then could you either make it an assertion, or use unpacking?

(crate_version,) = [
    crate for crate in e_metadata["versions"] if crate["num"] == version
]

Build is green

Patch application report for D7501 (id=27672)

Rebasing onto c4c0a9b5aa...

Current branch diff-target is up to date.
Changes applied before test
commit 133f6045088b3b70d636f5eb3c7a5a5a3e927ce9
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Apr 4 17:18:51 2022 +0200

    Rust lang, Crates loader

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/737/ for more details.

swh/loader/package/crates/loader.py
337–339

what @vlorentz suggested for the implementation but also add your comment about "one .crate file for a given version"

Better renaming 'name' to 'package_name' in the crate lister and loader or should the cli arg evolve to something like 'loader_name'?

The latter. It keeps names used in the loader consistent.

Not sure you have seen this comment before, and because its an important point to go further, I repost :

One important thing to note after adding 'name' and 'version' args is that it does not work with the CLI in the docker environment..

franck@debian-franck:~/workspace/swh-environment/docker$ docker-compose exec swh-loader swh loader run crates "https://static.crates.io/crates/micro-timer/micro-timer-0.4.0.crate" version="0.4.0" name="micro-timer"

Traceback (most recent call last):
...
File "/src/swh-loader-core/swh/loader/cli.py", line 104, in run
  loader = get_loader(type, url=url, storage=conf["storage"], **kw)
  TypeError: get_loader() got multiple values for argument 'name'

What do you think about that ?
Better renaming 'name' to 'package_name' in the crate lister and loader or should the cli arg evolve to something like 'loader_name'?

@vlorentz @ardumont Any thoughts about this?

yes, rename to package_name, the other change would make for a much wider impact (than you probably want to deal with ;).

yes, rename to package_name, the other change would make for a much wider impact (than you probably want to deal with ;).

It would not. The get_loader function is not used anywhere outside of this cli.py

.pre-commit-config.yaml
22–25

please remove this, it loses commit messages it doesn't like

yes, rename to package_name, the other change would make for a much wider impact (than you probably want to deal with ;).

It would not. The get_loader function is not used anywhere outside of this cli.py

right, nvm then.

I was thinking of something else (iirc, there is something similar in the lister or something)

i'd prefer that renaming nonetheless though, it's more explicit.

franckbret marked 7 inline comments as done.

Refactor according to @vlorentz & @ardumont comments.

Remove fallback logic from intrinsic to extrinsic metadata, it was useless...

Rename 'name' loader arg to 'package_name' in order to avoid arg name collision when running the loader via CLI. Note that to be consistent we need to do the same change for the crate lister.

swh/loader/package/crates/loader.py
187–188

Yeah I've just double checked that and you're right. I was thinking that one could change or add data through crates.io but its not the case, authors and description info from api comes from Cargo.toml, so I suppose the logic of saying if can't find it in intrinsic data then look at extrinsic data is quite useless. Will remove some code and rephrase docstring.

yes, rename to package_name, the other change would make for a much wider impact (than you probably want to deal with ;).

It would not. The get_loader function is not used anywhere outside of this cli.py

right, nvm then.

I was thinking of something else (iirc, there is something similar in the lister or something)

i'd prefer that renaming nonetheless though, it's more explicit.

I've renamed it to 'package_name', will do the same on the lister.

Build is green

Patch application report for D7501 (id=27693)

Rebasing onto c4c0a9b5aa...

Current branch diff-target is up to date.
Changes applied before test
commit a72c01731fe867ad951d504a81b4801f6ca97ddf
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Apr 4 17:18:51 2022 +0200

    Rust lang, Crates loader

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/739/ for more details.

Proposed evolution on crate lister to be consistent with the loader here :

https://forge.softwareheritage.org/D7654

Proposed evolution on crate lister to be consistent with the loader here :

https://forge.softwareheritage.org/D7654

Nice, and now the loader needs some rework to match that diff's adaptations ;)

Proposed evolution on crate lister to be consistent with the loader here :

https://forge.softwareheritage.org/D7654

Nice, and now the loader needs some rework to match that diff's adaptations ;)

Sure ;-) I'm on it.

Maybe we can close or cancel this and go on a new one?

Proposed evolution on crate lister to be consistent with the loader here :

https://forge.softwareheritage.org/D7654

Nice, and now the loader needs some rework to match that diff's adaptations ;)

Sure ;-) I'm on it.

Maybe we can close or cancel this and go on a new one?

Or, we accept it, you land it and then you propose your new version wich deals with one origin, multiple artifacts in another shorter diff.

@vlorentz does that sound good to you as well?
(I prefer this over a new diff which loses all those comments here).

That's a good start 👍

Let's land this and iterate in future diffs

This revision is now accepted and ready to land.Apr 28 2022, 10:48 AM

Proposed evolution on crate lister to be consistent with the loader here :

https://forge.softwareheritage.org/D7654

Nice, and now the loader needs some rework to match that diff's adaptations ;)

Sure ;-) I'm on it.

Maybe we can close or cancel this and go on a new one?

Or, we accept it, you land it and then you propose your new version wich deals with one origin, multiple artifacts in another shorter diff.

@vlorentz does that sound good to you as well?
(I prefer this over a new diff which loses all those comments here).

@vlorentz is fine with that plan as well.
So i'll accept it and then you can push it (make sure you rebase and update the diff first).

Updating D7501: Rust lang, Crates loader

prepare for push

This revision was landed with ongoing or failed builds.Apr 28 2022, 4:09 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D7501 (id=27886)

Rebasing onto 2e27f7c7a6...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-753-D7501.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/753/ for more details.