Changeset View
Changeset View
Standalone View
Standalone View
swh/lister/crates/__init__.py
Show All 14 Lines | |||||
a specific package version. | a specific package version. | ||||
As of August 2022 `Crates.io`_ list 89013 packages name for a total of 588215 released | As of August 2022 `Crates.io`_ list 89013 packages name for a total of 588215 released | ||||
versions. | versions. | ||||
Origins retrieving strategy | Origins retrieving strategy | ||||
--------------------------- | --------------------------- | ||||
A json http api to list packages from crates.io but we choose a `different strategy`_ | A json http api to list packages from crates.io exists but we choose a | ||||
in order to reduce to its bare minimum the amount of http call and bandwidth. | `different strategy`_ in order to reduce to its bare minimum the amount | ||||
We clone a git repository which contains a tree of directories whose last child folder | of http call and bandwidth. | ||||
name corresponds to the package name and contains a Cargo.toml file with some json data | |||||
to describe all existing versions of the package. | We download a `db-dump.tar.gz`_ archives which contains csv files as an export of | ||||
It takes a few seconds to clone the repository and browse it to build a full index of | the crates.io database. Crates.csv list package names, versions.csv list versions | ||||
existing package and related versions. | related to package names. | ||||
The lister is incremental, so the first time it clones and browses the repository as | It takes a few seconds to download the archive and parse csv files to build a | ||||
previously described then stores the last seen commit id. | full index of existing package and related versions. | ||||
Next time, it retrieves the list of new and changed files since last commit id and | |||||
returns new or changed package with all of their related versions. | The archive also contains a metadata.json file with a timestamp corresponding to | ||||
the date the database dump started. The database dump is automatically generated | |||||
Note that all Git related operations are done with `Dulwich`_, a Python | every 24 hours, around 02:00:00 UTC. | ||||
implementation of the Git file formats and protocols. | |||||
The lister is incremental, so the first time it downloads the db-dump.tar.gz archive as | |||||
previously described and store the last seen database dump timestamp. | |||||
Next time, it downloads the db-dump.tar.gz but retrieves only the list of new and | |||||
changed packages since last seen timestamp with all of their related versions. | |||||
Page listing | Page listing | ||||
------------ | ------------ | ||||
Each page is related to one package. | Each page is related to one package. | ||||
Each line of a page corresponds to different versions of this package. | Each line of a page corresponds to different versions of this package. | ||||
The data schema for each line is: | The data schema for each line is: | ||||
* **name**: Package name | * **name**: Package name | ||||
* **version**: Package version | * **version**: Package version | ||||
* **crate_file**: Package download url | * **crate_file**: Package download url | ||||
* **checksum**: Package download checksum | * **checksum**: Package download checksum | ||||
* **yanked**: Whether the package is yanked or not | * **yanked**: Whether the package is yanked or not | ||||
* **last_update**: Iso8601 last update date computed upon git commit date of the | * **last_update**: Iso8601 last update | ||||
related Cargo.toml file | |||||
Origins from page | Origins from page | ||||
----------------- | ----------------- | ||||
The lister yields one origin per page. | The lister yields one origin per page. | ||||
The origin url corresponds to the http api url for a package, for example | The origin url corresponds to the http api url for a package, for example | ||||
"https://crates.io/api/v1/crates/{package}". | "https://crates.io/api/v1/crates/{package}". | ||||
anlambert: This needs to be updated. | |||||
Additionally we add some data set to "extra_loader_arguments": | Additionally we add some data for each version, set to "extra_loader_arguments": | ||||
* **artifacts**: Represent data about the Crates to download, following | * **artifacts**: Represent data about the Crates to download, following | ||||
:ref:`original-artifacts-json specification <extrinsic-metadata-original-artifacts-json>` | :ref:`original-artifacts-json specification <extrinsic-metadata-original-artifacts-json>` | ||||
* **crates_metadata**: To store all other interesting attributes that do not belongs | * **crates_metadata**: To store all other interesting attributes that do not belongs | ||||
to artifacts. For now it mainly indicate when a version is `yanked`_. | to artifacts. For now it mainly indicate when a version is `yanked`_, and the version | ||||
last_update timestamp. | |||||
Origin data example:: | Origin data example:: | ||||
{ | { | ||||
"url": "https://crates.io/api/v1/crates/rand", | "url": "https://crates.io/api/v1/crates/regex-syntax", | ||||
"artifacts": [ | "artifacts": [ | ||||
{ | { | ||||
"0.1.0": { | |||||
"checksums": { | "checksums": { | ||||
"sha256": "48a45b46c2a8c38348adb1205b13c3c5eb0174e0c0fec52cc88e9fb1de14c54d", # noqa: B950 | "sha256": "398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944", # noqa: B950 | ||||
}, | |||||
"filename": "rand-0.1.1.crate", | |||||
"url": "https://static.crates.io/crates/rand/rand-0.1.1.crate", | |||||
"version": "0.1.1", | |||||
}, | }, | ||||
{ | "filename": "regex-syntax-0.1.0.crate", | ||||
"checksums": { | "url": "https://static.crates.io/crates/regex-syntax/regex-syntax-0.1.0.crate", # noqa: B950 | ||||
"sha256": "6e229ed392842fa93c1d76018d197b7e1b74250532bafb37b0e1d121a92d4cf7", # noqa: B950 | |||||
}, | }, | ||||
"filename": "rand-0.1.2.crate", | |||||
"url": "https://static.crates.io/crates/rand/rand-0.1.2.crate", | |||||
"version": "0.1.2", | |||||
}, | }, | ||||
], | ], | ||||
"crates_metadata": [ | "crates_metadata": [ | ||||
{ | { | ||||
"version": "0.1.1", | "0.1.0": { | ||||
"last_update": "2017-11-30 03:37:17.449539", | |||||
"yanked": False, | "yanked": False, | ||||
}, | }, | ||||
{ | |||||
"version": "0.1.2", | |||||
"yanked": False, | |||||
}, | }, | ||||
], | ], | ||||
Done Inline Actionsditto as we switched back to list. anlambert: ditto as we switched back to list. | |||||
} | }, | ||||
Running tests | Running tests | ||||
------------- | ------------- | ||||
Activate the virtualenv and run from within swh-lister directory: | Activate the virtualenv and run from within swh-lister directory: | ||||
pytest -s -vv --log-cli-level=DEBUG swh/lister/crates/tests | pytest -s -vv --log-cli-level=DEBUG swh/lister/crates/tests | ||||
Show All 14 Lines | |||||
.. _Crates.io: https://crates.io | .. _Crates.io: https://crates.io | ||||
.. _packages: https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html | .. _packages: https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html | ||||
.. _Rust language: https://www.rust-lang.org/ | .. _Rust language: https://www.rust-lang.org/ | ||||
.. _layout specifications: https://doc.rust-lang.org/cargo/guide/project-layout.html | .. _layout specifications: https://doc.rust-lang.org/cargo/guide/project-layout.html | ||||
.. _Cargo: https://doc.rust-lang.org/cargo/guide/why-cargo-exists.html#enter-cargo | .. _Cargo: https://doc.rust-lang.org/cargo/guide/why-cargo-exists.html#enter-cargo | ||||
.. _Cargo.toml: https://doc.rust-lang.org/cargo/reference/manifest.html | .. _Cargo.toml: https://doc.rust-lang.org/cargo/reference/manifest.html | ||||
.. _different strategy: https://crates.io/data-access | .. _different strategy: https://crates.io/data-access | ||||
.. _Dulwich: https://www.dulwich.io/ | |||||
.. _yanked: https://doc.rust-lang.org/cargo/reference/publishing.html#cargo-yank | .. _yanked: https://doc.rust-lang.org/cargo/reference/publishing.html#cargo-yank | ||||
.. _db-dump.tar.gz: https://static.crates.io/db-dump.tar.gz | |||||
""" | """ | ||||
def register(): | def register(): | ||||
from .lister import CratesLister | from .lister import CratesLister | ||||
return { | return { | ||||
"lister": CratesLister, | "lister": CratesLister, | ||||
"task_modules": ["%s.tasks" % __name__], | "task_modules": ["%s.tasks" % __name__], | ||||
} | } |
This needs to be updated.