Currently, loading Nix and Guix as single origins with a huge snapshot, with each branch name being a URL is wrong.
We need to replace the Nixguix loader with a lister, which creates as many origins referenced by Nix and Guix public manifests.
This would be closer to what we do with Debian/Ubuntu.
Define the following (see the hedgedoc [1] which details a proposition):
- [x] target structure sketch of the data in the archive
- [x] define origin urls
- [x] what kind of extrinsic metadata and/or extids are we storing
- [x] what kind of snapshots we're generating
Plan:
- [x] D8341: Implement lister
~~- [ ] D8406, ...: Adapt archive loader (package loader) to accept tarball from nixguix manifests~~ (cannot work [2])
- [x] D8581: Implement ContentLoader (possibly as a ~~package~~ [2] core loader) to deal with content file with intrinsic metadata (out of nixguix manifests)
- [x] D8584: Implement DirectoryLoader (possibly as a ~~package~~ [2] core loader~~) to deal with tarball with intrinsic metadata (out of nixguix manifests)
- [x] D8587: Update implementations ^ dealing with unsupported integrity hash (sha512)
- [x] T3781#92605: lister run through docker
- [x] D8601, T3781#92610: loaders run through docker (directory ok, contents ok too but they are creating mismatchs due to faulty manifest integrity references)
- [x] D8605: lister: Randomize origins order to ingest
- [x] D8606: lister: Deal with mistyped origins
- [x] D8607: lister: Fix expired ssl certificate
- [x] D8611: lister: Fix connection error
- [x] D8612: lister: Deal with pseudo url with missing schema
- [x] D8619: lister> Deal with exotic urls so tarballs are recognized
- [x] D8620: lister: Deal with misplaced git urls
- [x] D8624: nixguix: Improve content type detection (those with charset were off)
- [x] D8623: swh.core.tarball: Add missing mimetype application/x-gzip
- [x] D8626: lister: Refactor to simplify some computations
- [x] D8627: Make jenkins build with nix-store inside so future builds that needs it run correctly
- [x] T3781#92684: Fix mismatched computations for nixpkgs manifests -> nar hash support (impacts both lister and loader)
- [x] D8614: lister adaptation to provide the correct information to the loaders
- [x] D8618: {Content|Directory}Loader adaptation to be able to check this ^
- [x] D8630: Adapt standard/nar hash mismatch computation behavior (so they fail loading)
- [x] D8636: Content "nar" checksum computation. files with "recursive" hashOutputMode exist
~~- [ ] T3781#92850: P1489: P1490: hash mismatch edge cases (so far) we cannot do anything about (yet?!),~~ see next point
- [x] T4608: D8637: lister: Exclude faulty origins
- [x] T4608: Notify upstream nixpkgs community about the missing information on "faulty" origins
- [x] T4609: Notify upstream nixpkgs community about the misqualified "git" repositories as urls
- [x] P1482: ContentLoader run in docker
- [x] P1483: DirectoryLoader run in docker
- [x] D8621, D8622: Deploy in docker
- [x] P1486: Fix misqualified repositories detected as file (see pastes)
- [x] P1487: Contents
- [x] P1488: Directories
- [x] D8757: Add support for more tarball/zip extension
- [x] D8758: swh.core: Wire war support (and check other tarballs are already supported)
- [x] D8761: Harden tarball support test dataset
- [x] D8763: lister: Add another diff to filter out irrelevant origins (.iso, .bin, ...)
- [x] T3781#97852: Status -> further fixes (/me *sighs*)
- [ ] D8773: nixguix: Deal with edge case url with version instead of extension
- [ ] D8774: Use content-disposition
- [ ] Deploy in staging
- [ ] Drop no longer relevant nixguix loader
- [ ] Call for public review
- [ ] Deploy in production when ok ^
[1] Draft pad: https://hedgedoc.softwareheritage.org/2AQFbVB0S-OrOtkJV2yNJw
[2] It cannot. We may not have any versions received and package loader are currently relying on that particular data for its main ingestion algorithm.