Page MenuHomeSoftware Heritage

Use a hash as id/ unicity key for MetadataFetcher and MetadataAuthority
Open, HighPublic

Description

These two model objects are currently used in a weird way, in that if they have their metadata attribute set to None they are used as IDs (eg. as attributes in RawExtrinsicMetadata objects), and otherwise they are actual objects.

So we should move to a different scheme. The obvious solution is to use a 2-tuple, but I think instead we should use a hash of all their fields (resp. name+version+metadata and type+url+metadata), because:

  • it would be consistent with other model objects, which use a hash of (almost) all their fields
  • it allows having different metadata for the same authority or fetcher. I don't yet see a use for authorities, but for fetchers it would allow multiple fetcher configurations with the same fetcher name+version.
  • (as a consequence of the previous point) it would be consistent with indexer "tools", which include the indexer config in the unicity key
  • it would solve T2686 as a side-effect

Event Timeline

vlorentz triaged this task as High priority.Wed, Oct 14, 2:07 PM
vlorentz created this task.
olasd added a comment.Wed, Oct 14, 3:03 PM

This line of reasoning makes sense to me.

I think we can go with an approach similar to the way we're computing the intrinsic identifiers of revisions and releases:

  • serialize the object to a line-based key/value manifest
  • hash the manifest using git-like salted sha1s

My main concern is the type of the metadata attribute (Optional[ImmutableDict[str, Any]]), which makes the generation of the manifest a bit tricky. The current data would allow us to turn it into an ImmutableDict[str,str] (sort of; we have one metadata_fetcher entry with an integer value for sword_version, but it looks like they're now strings and we could convert that one with no loss of generality).

Would that work with the current metadata migration script?

15:00 guest@softwareheritage => select * from metadata_authority;
    id    │      type      │                                url                                │   metadata   
──────────┼────────────────┼───────────────────────────────────────────────────────────────────┼──────────────
        1 │ deposit_client │ https://hal.archives-ouvertes.fr/                                 │ {}
        2 │ deposit_client │ https://www.softwareheritage.org                                  │ {}
      433 │ forge          │ https://npmjs.com/                                                │ {}
   673900 │ forge          │ https://nix-community.github.io/nixpkgs-swh/sources-unstable.json │ {}
        5 │ deposit_client │ https://software.intel.com                                        │ {}
 36775450 │ registry       │ https://softwareheritage.org/                                     │ {}
 50652232 │ forge          │ https://guix.gnu.org/sources.json                                 │ {}
 60991205 │ deposit_client │ https://inria.halpreprod.archives-ouvertes.fr/                    │ {"name": ""}
      449 │ forge          │ https://pypi.org/                                                 │ {}
      206 │ deposit_client │ https://www.ipol.im/                                              │ {}
      250 │ deposit_client │ https://doi.org/10.5201/                                          │ {}
(11 lignes)

Temps : 27,088 ms
15:00 guest@softwareheritage => select * from metadata_fetcher;
    id    │                      name                       │ version │        metadata        
──────────┼─────────────────────────────────────────────────┼─────────┼────────────────────────
        1 │ swh-deposit                                     │ 0.0.1   │ {"sword_version": 2}
      713 │ swh.loader.package.npm.loader.NpmLoader         │ 0.8.0   │ {}
      730 │ swh.loader.package.pypi.loader.PyPILoader       │ 0.8.0   │ {}
   674180 │ swh.loader.package.nixguix.loader.NixGuixLoader │ 0.8.0   │ {}
  1461597 │ swh.loader.package.npm.loader.NpmLoader         │ 0.8.1   │ {}
  1461642 │ swh.loader.package.pypi.loader.PyPILoader       │ 0.8.1   │ {}
  2008789 │ swh.loader.package.nixguix.loader.NixGuixLoader │ 0.8.1   │ {}
 25956428 │ swh.loader.package.pypi.loader.PyPILoader       │ 0.9.1   │ {}
 25956432 │ swh.loader.package.npm.loader.NpmLoader         │ 0.9.1   │ {}
 25991606 │ swh.loader.package.nixguix.loader.NixGuixLoader │ 0.9.1   │ {}
 36775341 │ swh.loader.package.npm.loader.NpmLoader         │ 0.10.0  │ {}
 36775439 │ swh.loader.package.pypi.loader.PyPILoader       │ 0.10.0  │ {}
 37597840 │ swh.loader.package.cran.loader.CRANLoader       │ 0.10.0  │ {}
 37929744 │ swh.loader.package.nixguix.loader.NixGuixLoader │ 0.10.0  │ {}
 61029681 │ swh-deposit                                     │ 0.0.90  │ {"sword_version": "2"}
 61122976 │ swh-deposit                                     │ 0.1.0   │ {"sword_version": "2"}
 61190593 │ swh.loader.package.npm.loader.NpmLoader         │ 0.11.0  │ {}
 65980869 │ swh-deposit                                     │ 0.2.0   │ {"sword_version": "2"}
 66559381 │ swh.loader.package.npm.loader.NpmLoader         │ 0.13.1  │ {}
(19 lignes)

Temps : 25,172 ms