Page MenuHomeSoftware Heritage

Support disordered directory entries in git
Closed, MigratedEdits Locked

Description

Some git directories don't have their entries in the right order, which is an issue for us because:

  • swh-model reorders them before checksuming
  • the postgresql storage does not preserve the relative order of file and dir entries
  • the cassandra storage does not preserve order at all (they are sorted by null-padded name)

Relevant issue to consider to regarding Cassandra: T3582

Sentry: SWH-LOADER-GIT-QF

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

vlorentz triaged this task as Normal priority.Sep 22 2021, 1:34 PM
vlorentz created this task.
vlorentz updated the task description. (Show Details)

Possible solution: store a rank along with each directory entry, but ignore it unless we are reconstructing a git object or computing a SWHID (v1?)

Complete proposal to implement the above solution:

  1. add this to swh.model.model.DirectoryEntry (Directory.from_dict would set the rank if it's missing):
rank = attr.ib(type=int, validators=type_validator())
"""Zero-based index of this entry in a directory."""
  1. In postgresql, change the type of dir_entries/file_entries/rev_entries from bigint[] to bigint[2][]: pairs of (directory_entry object id, rank). The migration would be: 1. duplicate columns and make python write to both but read only the old 2. fill the new columns (looooong) 3. drop the old columns, rename and make python use the new ones
  1. in cassandra, add column rank to directory_entry table. We can initialize them at 0 and make the Python code fill it when reading. (we can also have a script fill it, but it's not mandatory)

I came across a rather small repository [1] which i believe raise the same issue.
So it may help to keep its reference to ease the testing of the improvment discussed here.
Feel free to dismiss if not that useful.

swh-loader_1                     | [2021-10-22 11:47:39,586: INFO/MainProcess] Task swh.loader.git.tasks.UpdateGitRepository[3b8b9037-f344-44e5-ab97-a3a310d9214f] received
swh-loader_1                     | [2021-10-22 11:47:39,632: INFO/ForkPoolWorker-1] Load origin 'https://github.com/technoweenie/attachment_fu' with type 'git'
swh-loader_1                     | Enumerating objects: 3549, done.
swh-loader_1                     | Total 3549 (delta 0), reused 0 (delta 0), pack-reused 3549
swh-loader_1                     | [2021-10-22 11:47:41,486: INFO/ForkPoolWorker-1] Listed 23 refs for repo https://github.com/technoweenie/attachment_fu
swh-loader_1                     | [2021-10-22 11:47:42,063: ERROR/ForkPoolWorker-1] Loading failure, updating to `failed` status
swh-loader_1                     | Traceback (most recent call last):
swh-loader_1                     |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 339, in load
swh-loader_1                     |     self.store_data()
swh-loader_1                     |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 457, in store_data
swh-loader_1                     |     for directory in self.get_directories():
swh-loader_1                     |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/git/loader.py", line 376, in get_directories
swh-loader_1                     |     yield converters.dulwich_tree_to_directory(raw_obj)
swh-loader_1                     |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/git/converters.py", line 104, in dulwich_tree_to_directory
swh-loader_1                     |     check_id(dir_)
swh-loader_1                     |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/git/converters.py", line 39, in check_id
swh-loader_1                     |     f"Expected {type(obj).__name__} hash to be {obj.id.hex()}, "
swh-loader_1                     | swh.loader.git.converters.HashMismatch: Expected Directory hash to be 6ac72a0858a5d5028d7f502de8777fbd5bdb8cae, got e23127f28dd0e1cf6e92a7b81cb9dbc53b44aa37
$ time git clone https://github.com/technoweenie/attachment_fu
Cloning into 'attachment_fu'...
remote: Enumerating objects: 1737, done.
remote: Total 1737 (delta 0), reused 0 (delta 0), pack-reused 1737
Receiving objects: 100% (1737/1737), 377.44 KiB | 3.85 MiB/s, done.
Resolving deltas: 100% (740/740), done.
git clone https://github.com/technoweenie/attachment_fu  0.05s user 0.02s system 8% cpu 0.837 total
$ du -sh attachment_fu
964K    attachment_fu
vlorentz claimed this task.

We decided to store manifests instead. T3594#74385