Page MenuHomeSoftware Heritage

Crates.io: Add last_update for each version of a crate
Needs RevisionPublic

Authored by franckbret on Tue, Sep 13, 8:28 AM.

Details

Reviewers
anlambert
Group Reviewers
Reviewers
Summary

In order to reduce http api call amount made by the loader, download a
crates.io database dump, and parse its csv files to get a last_update
value for each versions of a Crate.
Those values are sent to the loader through extra_loader_arguments
'crates_metadata'.

'artifacts' and 'crates_metadata' now uses "version" as key.

Related T4104, D8171

Diff Detail

Repository
rDLS Listers
Branch
crates-incremental
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 31580
Build 49393: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 49392: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8454 (id=30473)

Rebasing onto 67211adb60...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit f11694d26bbac1d969848a0a2d0686692f9034e7
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/656/ for more details.

@vlorentz @ardumont Here is a new version of the crates.io lister which loads csv fies from crates.io database dump.

Before I go to adapt the loader can you tell me If you are ok with this one.
Need to update the documentation and docstring of the lister too.

Also I think it can be doable to totally remove the GIT part of the lister. The csv files have everything we need. For the incremental part a metadata.json file at the root of the archive a date and a commit hash that represents the date of the db dump.
In incremental case the lister can compare that date to the last update of the crate.

Some comments, assuming ardumont is fine with the design:

swh/lister/crates/lister.py
120–121

you should actually stream the bytes; this causes a full copy to be allocated in memory before writing

123
137–138

doesn't need to be recursive + doesn't hurt to assert there is only one file matching each pattern.

(if there are more than one, it's a bug and should be addressed)

140–141

Use with crates_csv_path.open() as fd etc. so we don't rely on CPython-specific behavior to avoid leaking FDs. (Not a big deal since we currently use only CPython, I just want to be safe)

swh/lister/crates/lister.py
137–138

(oops, I didn't mean to write next( instead of list()

Also I think it can be doable to totally remove the GIT part of the lister. The csv files have everything we need. For the incremental part a metadata.json file at the root of the archive a date and a commit hash that represents the date of the db dump.
In incremental case the lister can compare that date to the last update of the crate.

I also think getting rid of the git part would be a good idea.

By testing that diff in docker, I quickly got an error as the git repository contains more recent crate versions
as those extracted from the db dump, see below (ftr, I added some debug logs):

docker-swh-lister-1  | [2022-09-13 14:48:59,849: INFO/MainProcess] Task swh.lister.crates.tasks.CratesListerTask[d88bb21b-1613-4230-b4ec-5bdd5092982c] received
docker-swh-lister-1  | [2022-09-13 14:48:59,851: DEBUG/ForkPoolWorker-1] Loading config file /lister.yml
docker-swh-lister-1  | Enumerating objects: 158660, done.
Counting objects: 100% (1216/1216), done.  0% (1/1216)
Compressing objects: 100% (601/601), done.:   0% (1/601)
docker-swh-lister-1  | Total 158660 (delta 715), reused 1090 (delta 589), pack-reused 157444
docker-swh-lister-1  | [2022-09-13 14:51:14,659: DEBUG/ForkPoolWorker-1] Found 25 crates in crates_index
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.0', 'deps': [{'name': 'hex', 'req': '^0.4.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'rand', 'req': '^0.8.5', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}], 'cksum': '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.1', 'deps': [{'name': 'hex', 'req': '^0.4.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'rand', 'req': '^0.8.5', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}], 'cksum': '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.2', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.2', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': 'c25097f191e32ad6550e402f6c5e6fbae7115a60bfedea2a4f5351c16a286229', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,662: DEBUG/ForkPoolWorker-1] Listing crates origin completed with last commit id 81cd3beb5d62f3b898607ab5b266a856b0e9fab8
docker-swh-lister-1  | [2022-09-13 14:51:17,965: DEBUG/ForkPoolWorker-1] Successfully removed /tmp/crates.io-index directory
docker-swh-lister-1  | [2022-09-13 14:51:18,058: DEBUG/ForkPoolWorker-1] Successfully removed /tmp/crates.io-db_dump directory
docker-swh-lister-1  | [2022-09-13 14:51:18,066: ERROR/ForkPoolWorker-1] Task swh.lister.crates.tasks.CratesListerTask[d88bb21b-1613-4230-b4ec-5bdd5092982c] raised unexpected: KeyError('0.0.2')
docker-swh-lister-1  | Traceback (most recent call last):
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 451, in trace_task
docker-swh-lister-1  |     R = retval = fun(*args, **kwargs)
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/scheduler/task.py", line 61, in __call__
docker-swh-lister-1  |     result = super().__call__(*args, **kwargs)
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 734, in __protected_call__
docker-swh-lister-1  |     return self.run(*args, **kwargs)
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/crates/tasks.py", line 14, in list_crates
docker-swh-lister-1  |     return CratesLister.from_configfile(**lister_args).run().dict()
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/pattern.py", line 127, in run
docker-swh-lister-1  |     for page in self.get_pages():
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/crates/lister.py", line 245, in get_pages
docker-swh-lister-1  |     entry["version"]
docker-swh-lister-1  | KeyError: '0.0.2'

Working only with the CSV files should guarantee crates data are consistent.

Also I think it can be doable to totally remove the GIT part of the lister. The csv files have everything we need. For the incremental part a metadata.json file at the root of the archive a date and a commit hash that represents the date of the db dump.
In incremental case the lister can compare that date to the last update of the crate.

I also think getting rid of the git part would be a good idea.

By testing that diff in docker, I quickly got an error as the git repository contains more recent crate versions
as those extracted from the db dump, see below (ftr, I added some debug logs):

docker-swh-lister-1  | [2022-09-13 14:48:59,849: INFO/MainProcess] Task swh.lister.crates.tasks.CratesListerTask[d88bb21b-1613-4230-b4ec-5bdd5092982c] received
docker-swh-lister-1  | [2022-09-13 14:48:59,851: DEBUG/ForkPoolWorker-1] Loading config file /lister.yml
docker-swh-lister-1  | Enumerating objects: 158660, done.
Counting objects: 100% (1216/1216), done.  0% (1/1216)
Compressing objects: 100% (601/601), done.:   0% (1/601)
docker-swh-lister-1  | Total 158660 (delta 715), reused 1090 (delta 589), pack-reused 157444
docker-swh-lister-1  | [2022-09-13 14:51:14,659: DEBUG/ForkPoolWorker-1] Found 25 crates in crates_index
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.0', 'deps': [{'name': 'hex', 'req': '^0.4.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'rand', 'req': '^0.8.5', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}], 'cksum': '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.1', 'deps': [{'name': 'hex', 'req': '^0.4.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'rand', 'req': '^0.8.5', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}], 'cksum': '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.2', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.2', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': 'c25097f191e32ad6550e402f6c5e6fbae7115a60bfedea2a4f5351c16a286229', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,662: DEBUG/ForkPoolWorker-1] Listing crates origin completed with last commit id 81cd3beb5d62f3b898607ab5b266a856b0e9fab8
docker-swh-lister-1  | [2022-09-13 14:51:17,965: DEBUG/ForkPoolWorker-1] Successfully removed /tmp/crates.io-index directory
docker-swh-lister-1  | [2022-09-13 14:51:18,058: DEBUG/ForkPoolWorker-1] Successfully removed /tmp/crates.io-db_dump directory
docker-swh-lister-1  | [2022-09-13 14:51:18,066: ERROR/ForkPoolWorker-1] Task swh.lister.crates.tasks.CratesListerTask[d88bb21b-1613-4230-b4ec-5bdd5092982c] raised unexpected: KeyError('0.0.2')
docker-swh-lister-1  | Traceback (most recent call last):
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 451, in trace_task
docker-swh-lister-1  |     R = retval = fun(*args, **kwargs)
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/scheduler/task.py", line 61, in __call__
docker-swh-lister-1  |     result = super().__call__(*args, **kwargs)
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 734, in __protected_call__
docker-swh-lister-1  |     return self.run(*args, **kwargs)
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/crates/tasks.py", line 14, in list_crates
docker-swh-lister-1  |     return CratesLister.from_configfile(**lister_args).run().dict()
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/pattern.py", line 127, in run
docker-swh-lister-1  |     for page in self.get_pages():
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/crates/lister.py", line 245, in get_pages
docker-swh-lister-1  |     entry["version"]
docker-swh-lister-1  | KeyError: '0.0.2'

Working only with the CSV files should guarantee crates data are consistent.

Didn't test this one yet on docker but guessed this situation. The backup is generated everyday or so, the git repo changes everyday.
One other point in favor of getting rid of git is that they frequently squash the history.

Didn't test this one yet on docker but guessed this situation. The backup is generated everyday or so, the git repo changes everyday.

Do they document at what time of day it is generated? Would be nice to run the lister right after to minimize lag

Use csv listing only

Stop relying on Git repository https://github.com/rust-lang/crates.io-index for discovering origins
State is now based on index_last_update

Build is green

Patch application report for D8454 (id=30583)

Rebasing onto f1a1b30fd1...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit aba7f646eb94c5df5a97c7e6dc0b1be96b0455ce
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/658/ for more details.

franckbret added inline comments.
swh/lister/crates/lister.py
123

here i want to extract to this path:

PosixPath('/tmp/crates.io-db_dump/db-dump')

archive_path.stem will return "db-dump.tar"

137–138

I used rglob because the top directory of the tar.gz extracted archive is date based so it is different each time we download a new archive.

ipdb> tar.getmembers()
[<TarInfo '.' at 0x7f144f378e58>, <TarInfo './2022-08-08-020027' at 0x7f144d379f20>, <TarInfo './2022-08-08-020027/data' at 0x7f1446695048>, <TarInfo './2022-08-08-020027/data/crates.csv' at 0x7f14466952a0>, <TarInfo './2022-08-08-020027/data/versions.csv' at 0x7f1446695368>]

Should had a comment about that.

Didn't test this one yet on docker but guessed this situation. The backup is generated everyday or so, the git repo changes everyday.

Do they document at what time of day it is generated? Would be nice to run the lister right after to minimize lag

It's generated everyday.
The timestamp in metadata.json at the root at the archive is the one at which the database backup started.

It looks lokie its automated and generated at the same hour, here is two different timestamp for example:

"timestamp": "2022-08-08T02:00:27.645191645Z",

"timestamp": "2022-09-05T02:00:27.687167108Z",

Also I think it can be doable to totally remove the GIT part of the lister. The csv files have everything we need. For the incremental part a metadata.json file at the root of the archive a date and a commit hash that represents the date of the db dump.
In incremental case the lister can compare that date to the last update of the crate.

I also think getting rid of the git part would be a good idea.

By testing that diff in docker, I quickly got an error as the git repository contains more recent crate versions
as those extracted from the db dump, see below (ftr, I added some debug logs):

Working only with the CSV files should guarantee crates data are consistent.

The last commit should fix the errors.

I've tried in docker environment:

swh-lister_1                        | [2022-09-15 16:17:47,866: INFO/ForkPoolWorker-1] Task swh.lister.crates.tasks.CratesListerTask[4ec8663f-e304-4500-97e8-ca2b9ba88e21] succeeded in 1523.850461518974s: {'pages': 92039, 'origins': 92039}

swh-scheduler=# select count(*) from listed_origins where visit_type='crates';
 count 
-------
 92039
(1 row)

After adapting the loader, a complete run:

swh-scheduler=# select count(*) from origin_visit_stats where visit_type='crates';
 count 
-------
 92039
(1 row)

swh-scheduler=# select count(*) from origin_visit_stats where visit_type='crates' and last_visit_status='successful';
 count 
-------
 92017
(1 row)

swh-scheduler=# select count(*) from origin_visit_stats where visit_type='crates' and last_visit_status='not_found';
 count 
-------
     0
(1 row)

swh-scheduler=# select count(*) from origin_visit_stats where visit_type='crates' and last_visit_status='failed';
 count 
-------
     6
(1 row)

The failed one are related to missing authors entry in some toml files, easy to fix.

I think we can go on with this patch.

Build is green

Patch application report for D8454 (id=30588)

Rebasing onto f1a1b30fd1...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit f992594620bfaf85f5702225982259f6b291f0f0
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/659/ for more details.

swh/lister/crates/lister.py
123

Then extract_to = archive_path.with_suffix("") should do it. Forget this comment if it doesn't either

137–138

My bad, I misunderstood rglob. Anyway, you can use this:

(crates_csv_path,) = list(db_dump_path.glob("*/data/crates.csv"))
(versions_csv_path,) = next(db_dump_path.glob("*/data/versions.csv"))

@franckbret, I added the improvements for the crates lister we discussed last week as inline comments.

I also think you could merge the get_db_dumb and parse_db_dumb into a single method so you
could use a temporary directory and let Pytjhon automatically delete it, see below:

def get_and_parse_db_dump(self) -> Dict[str, Any]:
    """Download and parse csv files from db_dump_path.

    Returns a dict where each entry corresponds to a package name with its related versions.
    """

    with tempfile.TemporaryDirectory() as tmpdir:

        file_name = self.DB_DUMP_URL.split("/")[-1]
        archive_path = Path(tmpdir) / file_name

        # Download the Db dump
        with self.http_request(self.DB_DUMP_URL, stream=True) as res:
            with open(archive_path, "wb") as out_file:
                for chunk in res.iter_content(chunk_size=1024):
                    out_file.write(chunk)

        # Extract the Db dump
        db_dump_path = Path(str(archive_path).split(".tar.gz")[0])
        tar = tarfile.open(archive_path)
        tar.extractall(path=db_dump_path)
        tar.close()

        csv.field_size_limit(1000000)

        crates_csv_path = list(db_dump_path.rglob("*crates.csv"))[0]
        versions_csv_path = list(db_dump_path.rglob("*versions.csv"))[0]
        index_metadata_json_path = list(db_dump_path.rglob("*metadata.json"))[0]

        with index_metadata_json_path.open("rb") as index_metadata_json:
            self.index_metadata = json.load(index_metadata_json)

        crates: Dict[str, Any] = {}
        with crates_csv_path.open() as crates_fd:
            crates_csv = csv.DictReader(crates_fd)
            for item in crates_csv:
                if self.is_new(item["updated_at"]):
                    # crate 'id' as key
                    crates[item["id"]] = {
                        "name": item["name"],
                        "updated_at": item["updated_at"],
                        "versions": {},
                    }

        data: Dict[str, Any] = {}
        with versions_csv_path.open() as versions_fd:
            versions_csv = csv.DictReader(versions_fd)
            for version in versions_csv:
                if version["crate_id"] in crates.keys():
                    crate: Dict[str, Any] = crates[version["crate_id"]]
                    crate["versions"][version["num"]] = version
                    # crate 'name' as key
                    data[crate["name"]] = crate
        return data
swh/lister/crates/lister.py
4

Please add a new line between license header and imports.

64

We should use the HTML page of a crate as origin URL:

CRATE_ORIGIN_URL_PATTERN = "https://crates.io/crates/{crate}"
77

to remove

100–103
return not last or (last is not None and last < dt)
116–120

Use this instead:

with self.http_request(self.DB_DUMP_URL, stream=True) as res:
    with open(archive_path, "wb") as out_file:
        for chunk in res.iter_content(chunk_size=1024):
            out_file.write(chunk)
220
url = self.CRATE_ORIGIN_URL_PATTERN.format(crate=page[0]["name"])
223–224

Use dicts instead of lists here in order to simplify crates loader processing.

artifacts = {}
crates_metadata = {}
229–245
artifacts[f"{entry['version']}"] = {
    "filename": entry["filename"],
    "url": entry["crate_file"],
    "checksums": {
        "sha256": entry["checksum"],
    },
}

crates_metadata[f"{entry['version']}"] = {
    "yanked": entry["yanked"],
    "last_update": entry["last_update"],
}
This revision now requires changes to proceed.Tue, Sep 27, 12:18 PM
swh/lister/crates/lister.py
223–224

Ignore this comment, I was not aware that we should use this format

229–245

Ignore this comment, I was not aware that we should use this format