Page MenuHomeSoftware Heritage

Crates.io: Add last_update for each version of a crate
ClosedPublic

Authored by franckbret on Sep 13 2022, 8:28 AM.

Details

Summary

In order to reduce http api call amount made by the loader, download a
crates.io database dump, and parse its csv files to get a last_update
value for each versions of a Crate.
Those values are sent to the loader through extra_loader_arguments
'crates_metadata'.

'artifacts' and 'crates_metadata' now uses "version" as key.

Related T4104, D8171

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8454 (id=30473)

Rebasing onto 67211adb60...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit f11694d26bbac1d969848a0a2d0686692f9034e7
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/656/ for more details.

@vlorentz @ardumont Here is a new version of the crates.io lister which loads csv fies from crates.io database dump.

Before I go to adapt the loader can you tell me If you are ok with this one.
Need to update the documentation and docstring of the lister too.

Also I think it can be doable to totally remove the GIT part of the lister. The csv files have everything we need. For the incremental part a metadata.json file at the root of the archive a date and a commit hash that represents the date of the db dump.
In incremental case the lister can compare that date to the last update of the crate.

Some comments, assuming ardumont is fine with the design:

swh/lister/crates/lister.py
105–106

you should actually stream the bytes; this causes a full copy to be allocated in memory before writing

108
122–123

doesn't need to be recursive + doesn't hurt to assert there is only one file matching each pattern.

(if there are more than one, it's a bug and should be addressed)

125–126

Use with crates_csv_path.open() as fd etc. so we don't rely on CPython-specific behavior to avoid leaking FDs. (Not a big deal since we currently use only CPython, I just want to be safe)

swh/lister/crates/lister.py
122–123

(oops, I didn't mean to write next( instead of list()

Also I think it can be doable to totally remove the GIT part of the lister. The csv files have everything we need. For the incremental part a metadata.json file at the root of the archive a date and a commit hash that represents the date of the db dump.
In incremental case the lister can compare that date to the last update of the crate.

I also think getting rid of the git part would be a good idea.

By testing that diff in docker, I quickly got an error as the git repository contains more recent crate versions
as those extracted from the db dump, see below (ftr, I added some debug logs):

docker-swh-lister-1  | [2022-09-13 14:48:59,849: INFO/MainProcess] Task swh.lister.crates.tasks.CratesListerTask[d88bb21b-1613-4230-b4ec-5bdd5092982c] received
docker-swh-lister-1  | [2022-09-13 14:48:59,851: DEBUG/ForkPoolWorker-1] Loading config file /lister.yml
docker-swh-lister-1  | Enumerating objects: 158660, done.
Counting objects: 100% (1216/1216), done.  0% (1/1216)
Compressing objects: 100% (601/601), done.:   0% (1/601)
docker-swh-lister-1  | Total 158660 (delta 715), reused 1090 (delta 589), pack-reused 157444
docker-swh-lister-1  | [2022-09-13 14:51:14,659: DEBUG/ForkPoolWorker-1] Found 25 crates in crates_index
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.0', 'deps': [{'name': 'hex', 'req': '^0.4.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'rand', 'req': '^0.8.5', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}], 'cksum': '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.1', 'deps': [{'name': 'hex', 'req': '^0.4.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'rand', 'req': '^0.8.5', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}], 'cksum': '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.2', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.2', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': 'c25097f191e32ad6550e402f6c5e6fbae7115a60bfedea2a4f5351c16a286229', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,662: DEBUG/ForkPoolWorker-1] Listing crates origin completed with last commit id 81cd3beb5d62f3b898607ab5b266a856b0e9fab8
docker-swh-lister-1  | [2022-09-13 14:51:17,965: DEBUG/ForkPoolWorker-1] Successfully removed /tmp/crates.io-index directory
docker-swh-lister-1  | [2022-09-13 14:51:18,058: DEBUG/ForkPoolWorker-1] Successfully removed /tmp/crates.io-db_dump directory
docker-swh-lister-1  | [2022-09-13 14:51:18,066: ERROR/ForkPoolWorker-1] Task swh.lister.crates.tasks.CratesListerTask[d88bb21b-1613-4230-b4ec-5bdd5092982c] raised unexpected: KeyError('0.0.2')
docker-swh-lister-1  | Traceback (most recent call last):
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 451, in trace_task
docker-swh-lister-1  |     R = retval = fun(*args, **kwargs)
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/scheduler/task.py", line 61, in __call__
docker-swh-lister-1  |     result = super().__call__(*args, **kwargs)
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 734, in __protected_call__
docker-swh-lister-1  |     return self.run(*args, **kwargs)
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/crates/tasks.py", line 14, in list_crates
docker-swh-lister-1  |     return CratesLister.from_configfile(**lister_args).run().dict()
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/pattern.py", line 127, in run
docker-swh-lister-1  |     for page in self.get_pages():
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/crates/lister.py", line 245, in get_pages
docker-swh-lister-1  |     entry["version"]
docker-swh-lister-1  | KeyError: '0.0.2'

Working only with the CSV files should guarantee crates data are consistent.

Also I think it can be doable to totally remove the GIT part of the lister. The csv files have everything we need. For the incremental part a metadata.json file at the root of the archive a date and a commit hash that represents the date of the db dump.
In incremental case the lister can compare that date to the last update of the crate.

I also think getting rid of the git part would be a good idea.

By testing that diff in docker, I quickly got an error as the git repository contains more recent crate versions
as those extracted from the db dump, see below (ftr, I added some debug logs):

docker-swh-lister-1  | [2022-09-13 14:48:59,849: INFO/MainProcess] Task swh.lister.crates.tasks.CratesListerTask[d88bb21b-1613-4230-b4ec-5bdd5092982c] received
docker-swh-lister-1  | [2022-09-13 14:48:59,851: DEBUG/ForkPoolWorker-1] Loading config file /lister.yml
docker-swh-lister-1  | Enumerating objects: 158660, done.
Counting objects: 100% (1216/1216), done.  0% (1/1216)
Compressing objects: 100% (601/601), done.:   0% (1/601)
docker-swh-lister-1  | Total 158660 (delta 715), reused 1090 (delta 589), pack-reused 157444
docker-swh-lister-1  | [2022-09-13 14:51:14,659: DEBUG/ForkPoolWorker-1] Found 25 crates in crates_index
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.0', 'deps': [{'name': 'hex', 'req': '^0.4.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'rand', 'req': '^0.8.5', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}], 'cksum': '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.1', 'deps': [{'name': 'hex', 'req': '^0.4.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'rand', 'req': '^0.8.5', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}], 'cksum': '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,660: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1-dev.2', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.3', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.1', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'vers': '0.0.2', 'deps': [{'name': 'getrandom', 'req': '^0.2', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'normal'}, {'name': 'criterion', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'nanoid', 'req': '^0.4.0', 'features': [], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}, {'name': 'uuid', 'req': '^1.1.2', 'features': ['v4', 'rng'], 'optional': False, 'default_features': True, 'target': None, 'kind': 'dev'}], 'cksum': 'c25097f191e32ad6550e402f6c5e6fbae7115a60bfedea2a4f5351c16a286229', 'features': {}, 'yanked': False, 'links': None}
docker-swh-lister-1  | [2022-09-13 14:51:14,661: DEBUG/ForkPoolWorker-1] {'name': 'colorid', 'updated_at': '2022-09-11 12:08:17.012383', 'versions': {'0.0.1-dev.0': OrderedDict([('checksum', '878b6701a5ab722ef3c30f2af1a25539c50d83c97da98941998c684b0f5c52cd'), ('crate_id', '653834'), ('crate_size', '2821'), ('created_at', '2022-08-28 11:43:09.333693'), ('downloads', '22'), ('features', '{}'), ('id', '610177'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.0'), ('published_by', '163342'), ('updated_at', '2022-08-28 11:43:09.333693'), ('yanked', 'f')]), '0.0.1': OrderedDict([('checksum', '1cb7dccc5e4128b4ebe8c46ca29e440e52bbca4daad5dcea864a74f25dcee0ce'), ('crate_id', '653834'), ('crate_size', '7999'), ('created_at', '2022-09-11 12:08:17.012383'), ('downloads', '12'), ('features', '{}'), ('id', '618675'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1'), ('published_by', '163342'), ('updated_at', '2022-09-11 12:08:17.012383'), ('yanked', 'f')]), '0.0.1-dev.2': OrderedDict([('checksum', '215f42225dffe2a135d1480662d379620445628b4bfe17aee56a20cf0d4590ce'), ('crate_id', '653834'), ('crate_size', '7547'), ('created_at', '2022-08-31 16:02:07.17775'), ('downloads', '23'), ('features', '{}'), ('id', '611995'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.2'), ('published_by', '163342'), ('updated_at', '2022-08-31 16:02:07.17775'), ('yanked', 'f')]), '0.0.1-dev.1': OrderedDict([('checksum', '2d5fb208766898bb8dcf9c2e270143b5b71b6271698d45ac86dc4d3e97ef178e'), ('crate_id', '653834'), ('crate_size', '2940'), ('created_at', '2022-08-29 15:32:43.066959'), ('downloads', '23'), ('features', '{}'), ('id', '610858'), ('license', 'MIT'), ('links', ''), ('num', '0.0.1-dev.1'), ('published_by', '163342'), ('updated_at', '2022-08-29 15:32:43.066959'), ('yanked', 'f')])}}
docker-swh-lister-1  | [2022-09-13 14:51:14,662: DEBUG/ForkPoolWorker-1] Listing crates origin completed with last commit id 81cd3beb5d62f3b898607ab5b266a856b0e9fab8
docker-swh-lister-1  | [2022-09-13 14:51:17,965: DEBUG/ForkPoolWorker-1] Successfully removed /tmp/crates.io-index directory
docker-swh-lister-1  | [2022-09-13 14:51:18,058: DEBUG/ForkPoolWorker-1] Successfully removed /tmp/crates.io-db_dump directory
docker-swh-lister-1  | [2022-09-13 14:51:18,066: ERROR/ForkPoolWorker-1] Task swh.lister.crates.tasks.CratesListerTask[d88bb21b-1613-4230-b4ec-5bdd5092982c] raised unexpected: KeyError('0.0.2')
docker-swh-lister-1  | Traceback (most recent call last):
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 451, in trace_task
docker-swh-lister-1  |     R = retval = fun(*args, **kwargs)
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/scheduler/task.py", line 61, in __call__
docker-swh-lister-1  |     result = super().__call__(*args, **kwargs)
docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 734, in __protected_call__
docker-swh-lister-1  |     return self.run(*args, **kwargs)
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/crates/tasks.py", line 14, in list_crates
docker-swh-lister-1  |     return CratesLister.from_configfile(**lister_args).run().dict()
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/pattern.py", line 127, in run
docker-swh-lister-1  |     for page in self.get_pages():
docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/crates/lister.py", line 245, in get_pages
docker-swh-lister-1  |     entry["version"]
docker-swh-lister-1  | KeyError: '0.0.2'

Working only with the CSV files should guarantee crates data are consistent.

Didn't test this one yet on docker but guessed this situation. The backup is generated everyday or so, the git repo changes everyday.
One other point in favor of getting rid of git is that they frequently squash the history.

Didn't test this one yet on docker but guessed this situation. The backup is generated everyday or so, the git repo changes everyday.

Do they document at what time of day it is generated? Would be nice to run the lister right after to minimize lag

Use csv listing only

Stop relying on Git repository https://github.com/rust-lang/crates.io-index for discovering origins
State is now based on index_last_update

Build is green

Patch application report for D8454 (id=30583)

Rebasing onto f1a1b30fd1...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit aba7f646eb94c5df5a97c7e6dc0b1be96b0455ce
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/658/ for more details.

franckbret added inline comments.
swh/lister/crates/lister.py
108

here i want to extract to this path:

PosixPath('/tmp/crates.io-db_dump/db-dump')

archive_path.stem will return "db-dump.tar"

122–123

I used rglob because the top directory of the tar.gz extracted archive is date based so it is different each time we download a new archive.

ipdb> tar.getmembers()
[<TarInfo '.' at 0x7f144f378e58>, <TarInfo './2022-08-08-020027' at 0x7f144d379f20>, <TarInfo './2022-08-08-020027/data' at 0x7f1446695048>, <TarInfo './2022-08-08-020027/data/crates.csv' at 0x7f14466952a0>, <TarInfo './2022-08-08-020027/data/versions.csv' at 0x7f1446695368>]

Should had a comment about that.

Didn't test this one yet on docker but guessed this situation. The backup is generated everyday or so, the git repo changes everyday.

Do they document at what time of day it is generated? Would be nice to run the lister right after to minimize lag

It's generated everyday.
The timestamp in metadata.json at the root at the archive is the one at which the database backup started.

It looks lokie its automated and generated at the same hour, here is two different timestamp for example:

"timestamp": "2022-08-08T02:00:27.645191645Z",

"timestamp": "2022-09-05T02:00:27.687167108Z",

Also I think it can be doable to totally remove the GIT part of the lister. The csv files have everything we need. For the incremental part a metadata.json file at the root of the archive a date and a commit hash that represents the date of the db dump.
In incremental case the lister can compare that date to the last update of the crate.

I also think getting rid of the git part would be a good idea.

By testing that diff in docker, I quickly got an error as the git repository contains more recent crate versions
as those extracted from the db dump, see below (ftr, I added some debug logs):

Working only with the CSV files should guarantee crates data are consistent.

The last commit should fix the errors.

I've tried in docker environment:

swh-lister_1                        | [2022-09-15 16:17:47,866: INFO/ForkPoolWorker-1] Task swh.lister.crates.tasks.CratesListerTask[4ec8663f-e304-4500-97e8-ca2b9ba88e21] succeeded in 1523.850461518974s: {'pages': 92039, 'origins': 92039}

swh-scheduler=# select count(*) from listed_origins where visit_type='crates';
 count 
-------
 92039
(1 row)

After adapting the loader, a complete run:

swh-scheduler=# select count(*) from origin_visit_stats where visit_type='crates';
 count 
-------
 92039
(1 row)

swh-scheduler=# select count(*) from origin_visit_stats where visit_type='crates' and last_visit_status='successful';
 count 
-------
 92017
(1 row)

swh-scheduler=# select count(*) from origin_visit_stats where visit_type='crates' and last_visit_status='not_found';
 count 
-------
     0
(1 row)

swh-scheduler=# select count(*) from origin_visit_stats where visit_type='crates' and last_visit_status='failed';
 count 
-------
     6
(1 row)

The failed one are related to missing authors entry in some toml files, easy to fix.

I think we can go on with this patch.

Build is green

Patch application report for D8454 (id=30588)

Rebasing onto f1a1b30fd1...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit f992594620bfaf85f5702225982259f6b291f0f0
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/659/ for more details.

swh/lister/crates/lister.py
108

Then extract_to = archive_path.with_suffix("") should do it. Forget this comment if it doesn't either

122–123

My bad, I misunderstood rglob. Anyway, you can use this:

(crates_csv_path,) = list(db_dump_path.glob("*/data/crates.csv"))
(versions_csv_path,) = next(db_dump_path.glob("*/data/versions.csv"))

@franckbret, I added the improvements for the crates lister we discussed last week as inline comments.

I also think you could merge the get_db_dumb and parse_db_dumb into a single method so you
could use a temporary directory and let Pytjhon automatically delete it, see below:

def get_and_parse_db_dump(self) -> Dict[str, Any]:
    """Download and parse csv files from db_dump_path.

    Returns a dict where each entry corresponds to a package name with its related versions.
    """

    with tempfile.TemporaryDirectory() as tmpdir:

        file_name = self.DB_DUMP_URL.split("/")[-1]
        archive_path = Path(tmpdir) / file_name

        # Download the Db dump
        with self.http_request(self.DB_DUMP_URL, stream=True) as res:
            with open(archive_path, "wb") as out_file:
                for chunk in res.iter_content(chunk_size=1024):
                    out_file.write(chunk)

        # Extract the Db dump
        db_dump_path = Path(str(archive_path).split(".tar.gz")[0])
        tar = tarfile.open(archive_path)
        tar.extractall(path=db_dump_path)
        tar.close()

        csv.field_size_limit(1000000)

        crates_csv_path = list(db_dump_path.rglob("*crates.csv"))[0]
        versions_csv_path = list(db_dump_path.rglob("*versions.csv"))[0]
        index_metadata_json_path = list(db_dump_path.rglob("*metadata.json"))[0]

        with index_metadata_json_path.open("rb") as index_metadata_json:
            self.index_metadata = json.load(index_metadata_json)

        crates: Dict[str, Any] = {}
        with crates_csv_path.open() as crates_fd:
            crates_csv = csv.DictReader(crates_fd)
            for item in crates_csv:
                if self.is_new(item["updated_at"]):
                    # crate 'id' as key
                    crates[item["id"]] = {
                        "name": item["name"],
                        "updated_at": item["updated_at"],
                        "versions": {},
                    }

        data: Dict[str, Any] = {}
        with versions_csv_path.open() as versions_fd:
            versions_csv = csv.DictReader(versions_fd)
            for version in versions_csv:
                if version["crate_id"] in crates.keys():
                    crate: Dict[str, Any] = crates[version["crate_id"]]
                    crate["versions"][version["num"]] = version
                    # crate 'name' as key
                    data[crate["name"]] = crate
        return data
swh/lister/crates/lister.py
4–5

Please add a new line between license header and imports.

63

We should use the HTML page of a crate as origin URL:

CRATE_ORIGIN_URL_PATTERN = "https://crates.io/crates/{crate}"
75

to remove

84–85
return not last or (last is not None and last < dt)
107–111

Use this instead:

with self.http_request(self.DB_DUMP_URL, stream=True) as res:
    with open(archive_path, "wb") as out_file:
        for chunk in res.iter_content(chunk_size=1024):
            out_file.write(chunk)
206
url = self.CRATE_ORIGIN_URL_PATTERN.format(crate=page[0]["name"])
209–210

Use dicts instead of lists here in order to simplify crates loader processing.

artifacts = {}
crates_metadata = {}
214–232
artifacts[f"{entry['version']}"] = {
    "filename": entry["filename"],
    "url": entry["crate_file"],
    "checksums": {
        "sha256": entry["checksum"],
    },
}

crates_metadata[f"{entry['version']}"] = {
    "yanked": entry["yanked"],
    "last_update": entry["last_update"],
}
This revision now requires changes to proceed.Sep 27 2022, 12:18 PM
swh/lister/crates/lister.py
209–210

Ignore this comment, I was not aware that we should use this format

214–232

Ignore this comment, I was not aware that we should use this format

franckbret marked 13 inline comments as done.

extra_loader_arguments "artifacts" and "crates_metadata" are now lists + some code improvment

swh/lister/crates/lister.py
209–210

switched to lists

214–232

switched to lists

extra_loader_arguments "artifacts" and "crates_metadata" are now lists + some code improvment

@anlambert will adapt to get_and_parse_db_dump in next commit

Build is green

Patch application report for D8454 (id=31145)

Rebasing onto f2377c283a...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit 1f4faed67dd7e1c4d0c26f9707f1768b56c72753
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/759/ for more details.

Merge get_db_dumb and parse_db_dumb into get_and_parse_db_dump

Build is green

Patch application report for D8454 (id=31152)

Rebasing onto f2377c283a...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit c36ee0aecfafdc933743b3013fa720d6e56d52b9
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/760/ for more details.

Build is green

Patch application report for D8454 (id=31155)

Rebasing onto f2377c283a...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit 1c23445b70816e2aa0568980fd143cab2685bec2
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/761/ for more details.

Looks good to me, some minor changes to handle before I can accept it though.

swh/lister/crates/__init__.py
60–61

This needs to be updated.

74–91

ditto as we switched back to list.

swh/lister/crates/lister.py
59

You can remove that variable now.

247–250

This can be removed now.

This revision now requires changes to proceed.Oct 5 2022, 3:53 PM
franckbret marked 4 inline comments as done.

Fix documentation and remove finalize cleanup and related test now that we use tempdir

Build is green

Patch application report for D8454 (id=31167)

Rebasing onto f2377c283a...

First, rewinding head to replay your work on top of it...
Applying: Crates.io: Add last_update for each version of a crate
Changes applied before test
commit 546c9c6a3e72c7b1674f485529bb1e5b7fc4d38f
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/764/ for more details.

This revision is now accepted and ready to land.Oct 5 2022, 4:52 PM
This revision was landed with ongoing or failed builds.Oct 5 2022, 5:13 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D8454 (id=31171)

Rebasing onto 2e6e282d44...

Current branch diff-target is up to date.
Changes applied before test
commit 4a09f660b35aa77744d9ed7b0ded84ba253305f3
Author: Franck Bret <franck.bret@octobus.net>
Date:   Mon Sep 12 21:33:07 2022 +0200

    Crates.io: Add last_update for each version of a crate
    
    In order to reduce http api call amount made by the loader, download a
    crates.io database dump, and parse its csv files to get a last_update
    value for each versions of a Crate.
    Those values are sent to the loader through extra_loader_arguments
    'crates_metadata'.
    
    'artifacts' and 'crates_metadata' now uses "version" as key.
    
    Related T4104, D8171

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/766/ for more details.

LGTM, thanks !

Thanks for the review!