Currently, loading Nix and Guix as single origins with a huge snapshot, with each branch name being a URL is wrong.
We need to replace the Nixguix loader with a lister, which creates as many origins referenced by Nix and Guix public manifests.
This would be closer to what we do with Debian/Ubuntu.
Define the following (see the hedgedoc [1] which details a proposition):
- target structure sketch of the data in the archive
- define origin urls
- what kind of extrinsic metadata and/or extids are we storing
- what kind of snapshots we're generating
Plan:
- D8341: Implement lister
- [ ] D8406, ...: Adapt archive loader (package loader) to accept tarball from nixguix manifests (cannot work [2])
- D8581: Implement ContentLoader (possibly as a
package[2] core loader) to deal with content file with intrinsic metadata (out of nixguix manifests) - D8584: Implement DirectoryLoader (possibly as a
package[2] core loader~~) to deal with tarball with intrinsic metadata (out of nixguix manifests) - D8587: Update implementations ^ dealing with unsupported integrity hash (sha512)
- T3781#92605: lister run through docker
- D8601, T3781#92610: loaders run through docker (directory ok, contents ok too but they are creating mismatchs due to faulty manifest integrity references)
- D8605: lister: Randomize origins order to ingest
- D8606: lister: Deal with mistyped origins
- D8607: lister: Fix expired ssl certificate
- D8611: lister: Fix connection error
- D8612: lister: Deal with pseudo url with missing schema
- D8619: lister> Deal with exotic urls so tarballs are recognized
- D8620: lister: Deal with misplaced git urls
- D8624: nixguix: Improve content type detection (those with charset were off)
- D8623: swh.core.tarball: Add missing mimetype application/x-gzip
- D8626: lister: Refactor to simplify some computations
- D8627: Make jenkins build with nix-store inside so future builds that needs it run correctly
- T3781#92684: Fix mismatched computations for nixpkgs manifests -> nar hash support (impacts both lister and loader)
- D8614: lister adaptation to provide the correct information to the loaders
- D8618: {Content|Directory}Loader adaptation to be able to check this ^
- D8630: Adapt standard/nar hash mismatch computation behavior (so they fail loading)
- D8636: Content "nar" checksum computation. files with "recursive" hashOutputMode exist
- [ ] T3781#92850: P1489: P1490: hash mismatch edge cases (so far) we cannot do anything about (yet?!), see next point
- T4608: D8637: lister: Exclude faulty origins
- T4608: Notify upstream nixpkgs community about the missing information on "faulty" origins
- T4609: Notify upstream nixpkgs community about the misqualified "git" repositories as urls
- P1482: ContentLoader run in docker
- P1483: DirectoryLoader run in docker
- D8621, D8622: Deploy in docker
- P1486: Fix misqualified repositories detected as file (see pastes)
- D8757: Add support for more tarball/zip extension
- D8758: swh.core: Wire war support (and check other tarballs are already supported)
- D8761: Harden tarball support test dataset
- D8763: lister: Add another diff to filter out irrelevant origins (.iso, .bin, ...)
- T3781#97852: Status -> further fixes (/me *sighs*)
- D8773: nixguix: Deal with edge case url with version instead of extension
- D8774: Use content-disposition
- infra/sysadm-environment#4655 Deploy in staging
- Drop no longer relevant nixguix loader
- Call for public review
- Deploy in production when ok ^
[1] Draft pad: https://hedgedoc.softwareheritage.org/2AQFbVB0S-OrOtkJV2yNJw
[2] It cannot. We may not have any versions received and package loader are currently relying on that particular data for its main ingestion algorithm.