Currently, loading Nix and Guix as single origins with a huge snapshot, with each branch name being a URL is wrong.
We need to replace the Nixguix loader with a lister, which creates as many origins referenced by Nix and Guix public manifests.
This would be closer to what we do with Debian/Ubuntu.
Define the following (see the hedgedoc [1] which details a proposition):
- [x] target structure sketch of the data in the archive
- [x] define origin urls
- [x] what kind of extrinsic metadata and/or extids are we storing
- [x] what kind of snapshots we're generating
Plan:
- [x] D8341: Implement lister
~~- [ ] D8406, ...: Adapt archive loader (package loader) to accept tarball from nixguix manifests~~ (cannot work [2])
- [x] D8581: Implement ContentLoader (possibly as a ~~package~~ [2] core loader) to deal with content file with intrinsic metadata (out of nixguix manifests)
- [x] D8584: Implement DirectoryLoader (possibly as a ~~package~~ [2] core loader~~) to deal with tarball with intrinsic metadata (out of nixguix manifests)
- [x] D8587: Update implementations ^ dealing with unsupported integrity hash (sha512)
- [x] T3781#92605: lister run through docker
- [x] D8601, T3781#92610: loaders run through docker (directory ok, contents ok too but they are creating mismatchs due to faulty manifest integrity references)
- [x] D8605: lister: Randomize origins order to ingest
- [x] D8606: lister: Deal with mistyped origins
- [x] D8607: lister: Fix expired ssl certificate
- [x] D8611: lister: Fix connection error
- [x] D8612: lister: Deal with pseudo url with missing schema
- [ ] D8619: Deal with exotic urls so tarballs are recognized
- [ ] D8620: Deal with misplaced git urls
- [ ] T3781#92684: Fix mismatched computations for nixpkgs manifests -> nar hash support (impacts both lister and loader)
- [x] D8614: lister adaptation to provide the correct information to the loaders
- [ ] D8618: {Content|Directory}Loader adaptation to be able to check this ^
- [ ] Deploy in staging
- [ ] Drop no longer relevant nixguix loader
- [ ] Call for public review
- [ ] Deploy in production when ok ^
[1] Draft pad: https://hedgedoc.softwareheritage.org/2AQFbVB0S-OrOtkJV2yNJw
[2] It cannot. We may not have any versions received and package loader are currently relying on that particular data for its main ingestion algorithm.