Page MenuHomeSoftware Heritage

Copy metadata on revisions to the extrinsic metadata storage
Closed, MigratedEdits Locked

Description

Previous runs of loaders didn't write to the metadata storage, only on revision objects

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

vlorentz changed the task status from Open to Work in Progress.Aug 28 2020, 1:20 PM
vlorentz moved this task from Backlog to Work in progress on the Roadmap 2020 board.

This task will take us one step towards a searchable archive :-)

We should keep a very conservative approach, I would suggest to keep the metadata and just copy.
this way, you don't need to distinguish between the fields that require and those that do not.

Finally, it will be less stressful to run a script that doesn't change the archive but is very useful for the search mechanisms we want to implement on the ERMDS (Extrinsic Raw MetaData Storage).

olasd added a subscriber: olasd.

The script is now running on getty.

Tail of log:

Processed 0.46M rows (~0.2%, last revision: 0095624edf008b754fb1ed5bd656d22c63f984ff)
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 1204, in <module>
    main(storage_dbconn, storage_url, deposit_dbconn, bytes.fromhex(first_id), True)
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 1165, in main
    handle_row(row, storage, deposit_cur, dry_run)
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 975, in handle_row
    storage, row["id"], metadata["original_artifact"][0]["filename"]
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 261, in pypi_origin_from_filename
    project_name = pypi_project_from_filename(filename)
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 252, in pypi_project_from_filename
    assert match, original_filename
AssertionError: pypops-201408-r4.tar.gz

I'll make the script log the revisions it's unable to process, rather than uselessly fall flat on its face.

(I've also noticed dry_run was = True, so I fixed that as well :P)

2021-04-06 20:19:19,898 __main__     ERROR    Could not parse revision metadata 00959a167bd98452c98ce73382f4b42179d53d32
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 1161, in main
    handle_row(row, storage, deposit_cur, dry_run)
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 979, in handle_row
    storage, row["id"], metadata["original_artifact"][0]["filename"]
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 265, in pypi_origin_from_filename
    project_name = pypi_project_from_filename(filename)
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 256, in pypi_project_from_filename
    assert match, original_filename
AssertionError: pypops-201408-r4.tar.gz
2021-04-06 20:54:44,962 __main__     ERROR    Could not parse revision metadata 00c6e2fe046dee3b5ef629f74f4801345840e70a
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 1161, in main
    handle_row(row, storage, deposit_cur, dry_run)
  File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 843, in handle_row
    assert "id" in actual_metadata or "title" in actual_metadata
AssertionError

I've relaunched the latest version of the migrate_extrinsic_metadata script on getty...

thanks @olasd for persevering, is there an ETA for the relaunch from October 21st?