Page MenuHomeSoftware Heritage

Replace the Nixguix loader with a lister
Closed, MigratedEdits Locked

Description

Currently, loading Nix and Guix as single origins with a huge snapshot, with each branch name being a URL is wrong.
We need to replace the Nixguix loader with a lister, which creates as many origins referenced by Nix and Guix public manifests.
This would be closer to what we do with Debian/Ubuntu.

Define the following (see the hedgedoc [1] which details a proposition):

  • target structure sketch of the data in the archive
  • define origin urls
  • what kind of extrinsic metadata and/or extids are we storing
  • what kind of snapshots we're generating

Plan:

  • D8341: Implement lister

- [ ] D8406, ...: Adapt archive loader (package loader) to accept tarball from nixguix manifests (cannot work [2])

  • D8581: Implement ContentLoader (possibly as a package [2] core loader) to deal with content file with intrinsic metadata (out of nixguix manifests)
  • D8584: Implement DirectoryLoader (possibly as a package [2] core loader~~) to deal with tarball with intrinsic metadata (out of nixguix manifests)
  • D8587: Update implementations ^ dealing with unsupported integrity hash (sha512)
  • T3781#92605: lister run through docker
  • D8601, T3781#92610: loaders run through docker (directory ok, contents ok too but they are creating mismatchs due to faulty manifest integrity references)
  • D8605: lister: Randomize origins order to ingest
  • D8606: lister: Deal with mistyped origins
  • D8607: lister: Fix expired ssl certificate
  • D8611: lister: Fix connection error
  • D8612: lister: Deal with pseudo url with missing schema
  • D8619: lister> Deal with exotic urls so tarballs are recognized
  • D8620: lister: Deal with misplaced git urls
  • D8624: nixguix: Improve content type detection (those with charset were off)
  • D8623: swh.core.tarball: Add missing mimetype application/x-gzip
  • D8626: lister: Refactor to simplify some computations
  • D8627: Make jenkins build with nix-store inside so future builds that needs it run correctly
  • T3781#92684: Fix mismatched computations for nixpkgs manifests -> nar hash support (impacts both lister and loader)
    • D8614: lister adaptation to provide the correct information to the loaders
    • D8618: {Content|Directory}Loader adaptation to be able to check this ^
    • D8630: Adapt standard/nar hash mismatch computation behavior (so they fail loading)
    • D8636: Content "nar" checksum computation. files with "recursive" hashOutputMode exist

- [ ] T3781#92850: P1489: P1490: hash mismatch edge cases (so far) we cannot do anything about (yet?!), see next point

  • T4608: D8637: lister: Exclude faulty origins
  • T4608: Notify upstream nixpkgs community about the missing information on "faulty" origins
  • T4609: Notify upstream nixpkgs community about the misqualified "git" repositories as urls
  • P1482: ContentLoader run in docker
  • P1483: DirectoryLoader run in docker
  • D8621, D8622: Deploy in docker
  • P1486: Fix misqualified repositories detected as file (see pastes)
  • D8757: Add support for more tarball/zip extension
  • D8758: swh.core: Wire war support (and check other tarballs are already supported)
  • D8761: Harden tarball support test dataset
  • D8763: lister: Add another diff to filter out irrelevant origins (.iso, .bin, ...)
  • T3781#97852: Status -> further fixes (/me *sighs*)
  • D8773: nixguix: Deal with edge case url with version instead of extension
  • D8774: Use content-disposition
  • infra/sysadm-environment#4655 Deploy in staging
  • Drop no longer relevant nixguix loader
  • Call for public review
  • Deploy in production when ok ^

[1] Draft pad: https://hedgedoc.softwareheritage.org/2AQFbVB0S-OrOtkJV2yNJw

[2] It cannot. We may not have any versions received and package loader are currently relying on that particular data for its main ingestion algorithm.

Revisions and Commits

rCDFJ Dockerfiles for Jenkins
D8627
rDENV Development environment
Abandoned
D8625
rDLDBASE Generic VCS/Package Loader
Abandoned
D8636
D8630
D8618
D8601
D8587
D8584
D8581
rDLS Listers
D8774
D8773
D8763
D8761
D8757
D8637
D8632
D8631
D8626
D8624
D8620
D8619
D8614
D8612
D8611
D8610
D8607
D8606
D8605
D8341
D8341
D8341
rDMOD Data model
Abandoned
rDCORE Foundations and core functionalities
D8758
D8623
D8603

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Out of the paste [1] (csv extract from swh-scheduler dev db after 3 lister runs on
docker), here is the state of detected files [2] so far (computed with [3]):

[1] P1487

[2]

In [29]: extensions
Out[29]:
['.0',
 '.04',
 '.1',
 '.10',
 '.11',
 '.13',
 '.15',
 '.16',
 '.2',
 '.24',
 '.3',
 '.4',
 '.4&id=9808325853ba9eb035115e5b056305a1c9d362a0',
 '.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399',
 '.5',
 '.6',
 '.7',
 '.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50',
 '.8',
 '.9',
 '.9-assembly',
 '.AppImage',
 '.L',
 '.M',
 '.S',
 '.VSIXPackage',
 '.aff',
 '.assoc',
 '.at',
 '.beta5',
 '.bin',
 '.c',
 '.c?format=diff',
 '.cab',
 '.cgi?id=1389687',
 '.cgi?id=240935',
 '.cgi?id=359589',
 '.cgi?id=361056',
 '.cgi?id=364774',
 '.cgi?id=535944',
 '.cgi?id=612792',
 '.cgi?id=79507',
 '.cgi?id=830',
 '.deb',
 '.def',
 '.desktop',
 '.dic',
 '.diff',
 '.diff;att=2;bug=665779',
 '.diff?inline=false',
 '.dtd',
 '.edict',
 '.exe',
 '.fref',
 '.git',
 '.git;a=commitdiff_plain;h=59032494e81a1a65c0b960aaae7ec4c2cc9db35a',
 '.git;a=commitdiff_plain;h=d57c99458933a21fdf94f508191f145ad8d5ec58',
 '.git;a=commitdiff_plain;h=ec1cc0263f1',
 '.git;a=patch;h=049e14870c13235cd066758f29c42dc96c1ccdf8',
 '.git;a=patch;h=24a461715d5bce47f63cb0097606fc336230589f',
 '.git;a=patch;h=32e4e8b4bcbacbf92af7c88337efae21986d9603',
 '.git;a=patch;h=91c6387e69c09beaa9b9ca1e28471751a834fc24',
 '.git;a=patch;h=9c2585c58b49815a0eab8d683f0a94f75cbbe64e',
 '.git;a=patch;h=a507b139adf37d2c742e039815601cdc2aa00a84',
 '.git;a=patch;h=a9bd3dec9fde',
 '.git;a=patch;h=b82e9b6d6b46877e5c3763cc3bc641c66fa7eb54',
 '.git;a=patch;h=c3f7414;hp=4c4fce51072c9189cfb11b52aa54fed79f5741bd',
 '.git;a=patch;h=cee7cefc610d42fd383b3c80c12cbc675443176a',
 '.git;a=patch;h=f5712c9949d026e4b891b25837edd2edc166151f',
 '.h',
 '.hpp',
 '.hs',
 '.ht',
 '.img',
 '.ini',
 '.iso',
 '.js',
 '.json',
 '.kak',
 '.linux64',
 '.lock',
 '.love',
 '.lua',
 '.md',
 '.menu',
 '.msi',
 '.nvim',
 '.obj',
 '.org',
 '.otf',
 '.oxt',
 '.pak',
 '.patch',
 '.patch?full_index=1',
 '.patch?h=btanks',
 '.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c',
 '.patch?h=gfm',
 '.patch?h=gprbuild&id=1d4e8a5cb982e79135a0aaa3ef87654bed1fe4f0',
 '.patch?h=hugs',
 '.patch?h=icon-slicer',
 '.patch?h=ike&id=3a56735ddc26f750df4720f4baba0728bb4cb458',
 '.patch?h=palm-novacom-git',
 '.patch?h=perl-www-curl&id=261d84887d736cc097abef61164339216fb79180',
 '.patch?h=qlandkartegt',
 '.patch?h=qt5-styleplugins',
 '.patch?h=tilp',
 '.patch?h=ventoy-bin&id=ce4c26c67a1de4b761f9448bf92e94ffae1c8148',
 '.patch?h=w3m-mouse&id=5b5f0fbb59f674575e87dd368fed834641c35f03',
 '.patch?id=04a3a7b1bd88c2d5502292fad27e0e02d084698d',
 '.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba',
 '.patch?id=11f6b4d4206b0648182e7b41cd57dcc9ccea0728',
 '.patch?id=1615f58890e8f9881c4228c78a6b39b9aab1303a',
 '.patch?id=192ac7421ddd4093125f4997898fb62e8a140a44',
 '.patch?id=1bbcfc9ae3dfdfcbdd35151cb7b6050776215e4d',
 '.patch?id=1e7bef484f96e7647f5f0911d3c8caa48131c33b',
 '.patch?id=21ba7540d385a9864b44850d6987893dfa16bfc0',
 '.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9',
 '.patch?id=3400945dbbb8a87065360963e4caa0e17d3dcc61',
 '.patch?id=36f8689f7903548f5d89827a6e7bdf70a9882cee',
 '.patch?id=3fe8e9910002b6523d995512a646b063565d0447',
 '.patch?id=409d0e2a9c9c899fb1fb04cc808fe0aff3f745ca',
 '.patch?id=426002bfe2789fb6213fba832c8bfee634d68d02',
 '.patch?id=4569a839f070a1a38d5dbce2a4d19233d25aeed2',
 '.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4',
 '.patch?id=4d35c076ce77bfac7655f60c4c3e4c86933ab7dd',
 '.patch?id=55fcb515620a8f7d3bb77eba938aa0fcf0d67c96',
 '.patch?id=56bd759df1d0c750a065b8c845e93d5dfa6b549d',
 '.patch?id=6751a93dca26b0b3ceec9eb151272253a2fe497e',
 '.patch?id=688d9675782dfc162d4e6cff04c668f7516118d0',
 '.patch?id=730cdcef6901750f4029d4c3b8639ce02ee3ead1',
 '.patch?id=7553a3c8dfa7bcec07241a07e6a4e7dcf5bb4f26',
 '.patch?id=7f371172f5c',
 '.patch?id=b510df361241e8f16314b1f14642305f0111dac6',
 '.patch?id=b6ea17ef8e4d652de0a85047bac8d41e90b25555',
 '.patch?id=c4256f68d3589570443075eccbbafacf661f785f',
 '.patch?id=c6d0ed89ad5653421f21cbf3b3d40fd9a1361828',
 '.patch?id=cec727ad614986ca1e6b9468eea7f1a5a9183382',
 '.patch?id=eab07e78b691ae7866267fc04d31c7c3ad6b0eeb',
 '.patch?id=f25d3fb08341b60b6ccef424399f060dfcf3f1a5',
 '.patch?id=f28a947813dbc0a1fd1a8d4a712d58a64c48ca01',
 '.patch?inline=false',
 '.patch?rev=2',
 '.patch?revision=1447925&view=co&pathrev=1484457',
 '.phar',
 '.php',
 '.php?4',
 '.php?id=194',
 '.php?s=file_download&id=25',
 '.pl',
 '.png',
 '.pom',
 '.py',
 '.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba',
 '.rb',
 '.rpm',
 '.rules',
 '.scm',
 '.sh',
 '.shar',
 '.svg',
 '.tcl',
 '.ttc',
 '.ttf',
 '.ttf?raw=true',
 '.txt',
 '.uqm',
 '.vsix',
 '.war',
 '.whl',
 '.xml',
 '.zsh']

[3]

In [6]: from pathlib import Path

In [15]: with open(filepath, "r") as f: data=[line.rstrip() for line in f.readlines()]

In [25]:  extensions = set([Path(url).suffixes[-1] for url in data if Path(url).suffixes])

In [26]:  extensions = list(set([Path(url).suffixes[-1] for url in data if Path(url).suffixes]))

In [28]: extensions.sort()

It must be more interesting to read it with a frequency [1]:

[1]

In [31]: from collections import defaultdict

In [32]: extensions = defaultdict(int)

In [33]: for url in data:
    ...:     suffixes = Path(url).suffixes
    ...:     if suffixes:
    ...:         extensions[Path(url).suffixes[-1]] += 1
    ...:


In [35]: dict(extensions)
Out[35]:
{'.pom': 279,
 '.patch': 1099,
 '.VSIXPackage': 127,
 '.8': 1,
 '.txt': 14,
 '.deb': 40,
 '.git;a=commitdiff_plain;h=59032494e81a1a65c0b960aaae7ec4c2cc9db35a': 1,
 '.oxt': 3,
 '.AppImage': 5,
 '.3': 1,
 '.rb': 2,
 '.whl': 18,
 '.py': 6,
 '.kak': 1,
 '.patch?h=tilp': 1,
 '.patch?id=b510df361241e8f16314b1f14642305f0111dac6': 10,
 '.diff': 84,
 '.0': 10,
 '.patch?id=730cdcef6901750f4029d4c3b8639ce02ee3ead1': 2,
 '.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399': 1,
 '.patch?id=21ba7540d385a9864b44850d6987893dfa16bfc0': 1,
 '.phar': 9,
 '.pl': 4,
 '.def': 1,
 '.patch?id=55fcb515620a8f7d3bb77eba938aa0fcf0d67c96': 1,
 '.php?4': 1,
 '.at': 4,
 '.1': 12,
 '.git;a=patch;h=91c6387e69c09beaa9b9ca1e28471751a834fc24': 1,
 '.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c': 8,
 '.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9': 7,
 '.rules': 1,
 '.patch?id=c4256f68d3589570443075eccbbafacf661f785f': 1,
 '.patch?h=gprbuild&id=1d4e8a5cb982e79135a0aaa3ef87654bed1fe4f0': 1,
 '.git;a=patch;h=a507b139adf37d2c742e039815601cdc2aa00a84': 1,
 '.ini': 1,
 '.S': 1,
 '.ttf': 38,
 '.otf': 7,
 '.png': 10,
 '.linux64': 1,
 '.patch?id=192ac7421ddd4093125f4997898fb62e8a140a44': 2,
 '.lua': 1,
 '.vsix': 1,
 '.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba': 15,
 '.13': 1,
 '.c': 11,
 '.war': 3,
 '.sh': 3,
 '.4': 3,
 '.rpm': 5,
 '.php': 1,
 '.patch?rev=2': 1,
 '.diff?inline=false': 3,
 '.patch?h=qt5-styleplugins': 2,
 '.pak': 1,
 '.cgi?id=364774': 1,
 '.2': 4,
 '.7': 3,
 '.16': 1,
 '.4&id=9808325853ba9eb035115e5b056305a1c9d362a0': 1,
 '.patch?h=perl-www-curl&id=261d84887d736cc097abef61164339216fb79180': 1,
 '.9': 2,
 '.git;a=patch;h=049e14870c13235cd066758f29c42dc96c1ccdf8': 1,
 '.json': 2,
 '.ttf?raw=true': 1,
 '.patch?id=56bd759df1d0c750a065b8c845e93d5dfa6b549d': 2,
 '.love': 4,
 '.patch?id=1615f58890e8f9881c4228c78a6b39b9aab1303a': 1,
 '.scm': 1,
 '.hs': 1,
 '.fref': 1,
 '.iso': 3,
 '.cgi?id=830': 1,
 '.msi': 4,
 '.patch?id=7553a3c8dfa7bcec07241a07e6a4e7dcf5bb4f26': 1,
 '.git;a=patch;h=b82e9b6d6b46877e5c3763cc3bc641c66fa7eb54': 1,
 '.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4': 4,
 '.patch?id=f28a947813dbc0a1fd1a8d4a712d58a64c48ca01': 1,
 '.patch?id=4d35c076ce77bfac7655f60c4c3e4c86933ab7dd': 1,
 '.uqm': 3,
 '.patch?id=eab07e78b691ae7866267fc04d31c7c3ad6b0eeb': 1,
 '.exe': 1,
 '.cab': 1,
 '.ht': 2,
 '.04': 1,
 '.aff': 1,
 '.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50': 1,
 '.patch?h=btanks': 1,
 '.git;a=patch;h=24a461715d5bce47f63cb0097606fc336230589f': 1,
 '.diff;att=2;bug=665779': 1,
 '.js': 1,
 '.git;a=patch;h=f5712c9949d026e4b891b25837edd2edc166151f': 1,
 '.git;a=patch;h=c3f7414;hp=4c4fce51072c9189cfb11b52aa54fed79f5741bd': 1,
 '.zsh': 1,
 '.svg': 1,
 '.patch?h=qlandkartegt': 9,
 '.git;a=patch;h=32e4e8b4bcbacbf92af7c88337efae21986d9603': 1,
 '.L': 1,
 '.M': 1,
 '.php?id=194': 1,
 '.patch?id=7f371172f5c': 2,
 '.5': 1,
 '.ttc': 2,
 '.patch?id=3fe8e9910002b6523d995512a646b063565d0447': 1,
 '.patch?id=04a3a7b1bd88c2d5502292fad27e0e02d084698d': 1,
 '.patch?id=36f8689f7903548f5d89827a6e7bdf70a9882cee': 1,
 '.hpp': 1,
 '.patch?h=ike&id=3a56735ddc26f750df4720f4baba0728bb4cb458': 1,
 '.obj': 1,
 '.php?s=file_download&id=25': 1,
 '.patch?id=688d9675782dfc162d4e6cff04c668f7516118d0': 1,
 '.assoc': 1,
 '.menu': 1,
 '.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba': 2,
 '.bin': 1,
 '.patch?id=11f6b4d4206b0648182e7b41cd57dcc9ccea0728': 1,
 '.patch?id=409d0e2a9c9c899fb1fb04cc808fe0aff3f745ca': 2,
 '.cgi?id=535944': 1,
 '.patch?h=icon-slicer': 1,
 '.patch?id=4569a839f070a1a38d5dbce2a4d19233d25aeed2': 1,
 '.cgi?id=79507': 1,
 '.patch?id=b6ea17ef8e4d652de0a85047bac8d41e90b25555': 1,
 '.patch?id=c6d0ed89ad5653421f21cbf3b3d40fd9a1361828': 2,
 '.org': 1,
 '.dic': 2,
 '.tcl': 1,
 '.24': 1,
 '.patch?id=1e7bef484f96e7647f5f0911d3c8caa48131c33b': 1,
 '.patch?h=palm-novacom-git': 1,
 '.patch?id=1bbcfc9ae3dfdfcbdd35151cb7b6050776215e4d': 1,
 '.patch?id=6751a93dca26b0b3ceec9eb151272253a2fe497e': 1,
 '.patch?id=f25d3fb08341b60b6ccef424399f060dfcf3f1a5': 2,
 '.lock': 2,
 '.c?format=diff': 1,
 '.git;a=patch;h=a9bd3dec9fde': 1,
 '.patch?h=w3m-mouse&id=5b5f0fbb59f674575e87dd368fed834641c35f03': 1,
 '.15': 1,
 '.git': 2,
 '.patch?h=hugs': 1,
 '.xml': 1,
 '.git;a=patch;h=9c2585c58b49815a0eab8d683f0a94f75cbbe64e': 1,
 '.patch?id=426002bfe2789fb6213fba832c8bfee634d68d02': 1,
 '.md': 1,
 '.desktop': 1,
 '.git;a=patch;h=cee7cefc610d42fd383b3c80c12cbc675443176a': 1,
 '.patch?h=ventoy-bin&id=ce4c26c67a1de4b761f9448bf92e94ffae1c8148': 1,
 '.6': 1,
 '.10': 1,
 '.nvim': 2,
 '.patch?full_index=1': 1,
 '.9-assembly': 1,
 '.beta5': 1,
 '.edict': 1,
 '.h': 2,
 '.dtd': 1,
 '.11': 1,
 '.cgi?id=240935': 1,
 '.cgi?id=361056': 1,
 '.shar': 1,
 '.cgi?id=359589': 1,
 '.cgi?id=612792': 1,
 '.git;a=commitdiff_plain;h=ec1cc0263f1': 1,
 '.patch?id=cec727ad614986ca1e6b9468eea7f1a5a9183382': 1,
 '.img': 1,
 '.cgi?id=1389687': 1,
 '.patch?revision=1447925&view=co&pathrev=1484457': 1,
 '.patch?id=3400945dbbb8a87065360963e4caa0e17d3dcc61': 1,
 '.git;a=commitdiff_plain;h=d57c99458933a21fdf94f508191f145ad8d5ec58': 1,
 '.patch?inline=false': 1,
 '.patch?h=gfm': 1}

Finally, more concentrated frequency dict:

In [46]: extensions = defaultdict(int)

In [47]: for url in data:
    ...:     suffixes = Path(url).suffixes
    ...:     if suffixes:
    ...:         if ".patch" in suffixes or ".patch" in suffixes[-1]:
    ...:             key = ".patch"
    ...:         elif ".git" in suffixes or ".git" in suffixes[-1]:
    ...:             key = ".git"
    ...:         elif ".cgi" in suffixes or ".cgi" in suffixes[-1]:
    ...:             key = ".cgi"
    ...:         else:
    ...:             key = suffixes[-1]
    ...:         extensions[key] += 1
    ...:

In [48]: dict(extensions)
Out[48]:
{'.pom': 279,
 '.patch': 1204,
 '.VSIXPackage': 127,
 '.8': 1,
 '.txt': 14,
 '.deb': 40,
 '.git': 16,
 '.oxt': 3,
 '.AppImage': 5,
 '.3': 1,
 '.rb': 2,
 '.whl': 18,
 '.py': 6,
 '.kak': 1,
 '.diff': 84,
 '.0': 10,
 '.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399': 1,
 '.phar': 9,
 '.pl': 4,
 '.def': 1,
 '.php?4': 1,
 '.at': 4,
 '.1': 12,
 '.rules': 1,
 '.ini': 1,
 '.S': 1,
 '.ttf': 38,
 '.otf': 7,
 '.png': 10,
 '.linux64': 1,
 '.lua': 1,
 '.vsix': 1,
 '.13': 1,
 '.c': 11,
 '.war': 3,
 '.sh': 3,
 '.4': 3,
 '.rpm': 5,
 '.php': 1,
 '.diff?inline=false': 3,
 '.pak': 1,
 '.cgi': 9,
 '.2': 4,
 '.7': 3,
 '.16': 1,
 '.4&id=9808325853ba9eb035115e5b056305a1c9d362a0': 1,
 '.9': 2,
 '.json': 2,
 '.ttf?raw=true': 1,
 '.love': 4,
 '.scm': 1,
 '.hs': 1,
 '.fref': 1,
 '.iso': 3,
 '.msi': 4,
 '.uqm': 3,
 '.exe': 1,
 '.cab': 1,
 '.ht': 2,
 '.04': 1,
 '.aff': 1,
 '.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50': 1,
 '.diff;att=2;bug=665779': 1,
 '.js': 1,
 '.zsh': 1,
 '.svg': 1,
 '.L': 1,
 '.M': 1,
 '.php?id=194': 1,
 '.5': 1,
 '.ttc': 2,
 '.hpp': 1,
 '.obj': 1,
 '.php?s=file_download&id=25': 1,
 '.assoc': 1,
 '.menu': 1,
 '.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba': 2,
 '.bin': 1,
 '.org': 1,
 '.dic': 2,
 '.tcl': 1,
 '.24': 1,
 '.lock': 2,
 '.c?format=diff': 1,
 '.15': 1,
 '.xml': 1,
 '.md': 1,
 '.desktop': 1,
 '.6': 1,
 '.10': 1,
 '.nvim': 2,
 '.9-assembly': 1,
 '.beta5': 1,
 '.edict': 1,
 '.h': 2,
 '.dtd': 1,
 '.11': 1,
 '.shar': 1,
 '.img': 1}

The actual nixpkgs manifests are either not built properly or not complete yet. They
sometimes are referencing hash we cannot compute back as only the derivation is
containing the information [1] [2].

In [1], the fs layout is required to build properly the same hash.

In [2], the executable bit permission is required on the file to compute the proper hash.

There may exist many other more discrepancies. That and the fact that the nixpkgs
manifest has not been updated for a while, Oct 11, 2021. (from [3] to [4])

[1] P1489#10067

[2] P1490

[3] https://nix-community.github.io/nixpkgs-swh/

[4] https://github.com/NixOS/nixpkgs/tree/e4ef597edfd8a0ba5f12362932fc9b1dd01a0aef

With D8637, listing is less noisy [1] (code [2]):

[1]

dataset: guix

{'.1': 1,
 '.10': 1,
 '.14': 1,
 '.15': 1,
 '.19': 2,
 '.2': 2,
 '.4': 1,
 '.5': 1,
 '.9': 1,
 '.c': 3,
 '.cfg?revision=59745': 1,
 '.el': 45,
 '.el?id=dcc9ba03252ee5d39e03bba31b420e0708c3ba0c': 1,
 '.lisp': 1,
 '.love': 1,
 '.map': 1,
 '.py': 1,
 '.sf3': 1,
 '.ttf': 1}

dataset: nixpkgs

{'.0': 10,
 '.1': 12,
 '.10': 1,
 '.11': 1,
 '.13': 1,
 '.15': 1,
 '.16': 1,
 '.2': 4,
 '.24': 1,
 '.3': 1,
 '.4': 3,
 '.4&id=9808325853ba9eb035115e5b056305a1c9d362a0': 1,
 '.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399': 1,
 '.5': 1,
 '.6': 1,
 '.7': 3,
 '.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50': 1,
 '.8': 1,
 '.9': 2,
 '.9-assembly': 1,
 '.AppImage': 5,
 '.L': 1,
 '.M': 1,
 '.S': 1,
 '.VSIXPackage': 127,
 '.aff': 1,
 '.assoc': 1,
 '.at': 4,
 '.beta5': 1,
 '.bin': 1,
 '.c': 11,
 '.c?format=diff': 1,
 '.cab': 1,
 '.cgi': 9,
 '.deb': 40,
 '.def': 1,
 '.desktop': 1,
 '.dic': 2,
 '.diff': 84,
 '.diff;att=2;bug=665779': 1,
 '.diff?inline=false': 3,
 '.dtd': 1,
 '.edict': 1,
 '.exe': 1,
 '.fref': 1,
 '.git': 14,
 '.h': 2,
 '.hpp': 1,
 '.hs': 1,
 '.img': 1,
 '.ini': 1,
 '.iso': 3,
 '.js': 1,
 '.json': 2,
 '.linux64': 1,
 '.lock': 2,
 '.love': 4,
 '.menu': 1,
 '.msi': 4,
 '.obj': 1,
 '.otf': 6,
 '.oxt': 3,
 '.pak': 1,
 '.patch': 1204,
 '.phar': 9,
 '.php?4': 1,
 '.php?id=194': 1,
 '.php?s=file_download&id=25': 1,
 '.pl': 4,
 '.png': 10,
 '.pom': 279,
 '.py': 6,
 '.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba': 2,
 '.rb': 1,
 '.rpm': 4,
 '.rules': 1,
 '.scm': 1,
 '.sh': 1,
 '.shar': 1,
 '.svg': 1,
 '.tcl': 1,
 '.ttf': 29,
 '.txt': 9,
 '.uqm': 3,
 '.vsix': 1,
 '.war': 3,
 '.whl': 18,
 '.xml': 1,
 '.zsh': 1}

[2]

from typing import Dict, List
from pathlib import Path
from collections import defaultdict


def read_dataset(dataset_name: str) -> List[str]:
    filepath = f'/var/tmp/nixguix/dataset/20221007/list-contents-{dataset_name}.csv'
    with open(filepath, "r") as f:
        data=[line.rstrip() for line in f]
    return data


def group_by_extensions(data: List[str]) -> Dict[int, str]:
    extensions = defaultdict(int)
    for url in data:
        suffixes = Path(url).suffixes
        if suffixes:
            if ".patch" in suffixes or ".patch" in suffixes[-1]:
                key = ".patch"
            elif ".git" in suffixes or ".git" in suffixes[-1]:
                key = ".git"
            elif ".cgi" in suffixes or ".cgi" in suffixes[-1]:
                key = ".cgi"
            else:
                key = suffixes[-1]
            extensions[key] += 1
    return dict(extensions)

for dataset_name in ["guix", "nixpkgs"]:
    data = read_dataset(dataset_name)
    print(f"dataset: {dataset_name}\n")
    extensions = group_by_extensions(data)
    from pprint import pprint
    pprint(extensions)
    print()

[3]

[4]

Improved version with noisy urls printed alongside the hash output [1] [2]:

[1]

dataset: guix

https://downloads.mariadb.org/f/connector-c-3.1.13/mariadb-connector-c-3.1.13-src.tar.gz/from/https%3A//mirrors.ukfast.co.uk/sites/mariadb/?serve
http://git.savannah.gnu.org/cgit/emacs/elpa.git/plain/packages/pinentry/pinentry.el?id=dcc9ba03252ee5d39e03bba31b420e0708c3ba0c
https://tug.org/svn/texlive/tags/texlive-2021.3/Master/texmf-dist/web2c/updmap.cfg?revision=59745
http://apps.fz-juelich.de/jsc/jube/jube2/download.php?version=2.2.2
{'.1': 1,
 '.10': 1,
 '.14': 1,
 '.15': 1,
 '.19': 2,
 '.2': 1,
 '.4': 1,
 '.5': 1,
 '.9': 1,
 '.c': 3,
 '.cfg': 1,
 '.el': 46,
 '.lisp': 1,
 '.love': 1,
 '.map': 1,
 '.php': 1,
 '.py': 1,
 '.sf3': 1,
 '.ttf': 1}

dataset: nixpkgs

https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-fs/cryfs/files/cryfs-0.10.2-unbundle-libs.patch?id=192ac7421ddd4093125f4997898fb62e8a140a44
https://aur.archlinux.org/cgit/aur.git/plain/remove-broken-kde-support.patch?h=tilp
https://salsa.debian.org/debian/autogen/-/raw/debian/1%255.18.16-4/debian/patches/20_no_Werror.diff?inline=false
https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=10;filename=fix_window_resizing.diff;att=2;bug=665779
https://git.samba.org/?p=rsync.git;a=patch;h=c3f7414;hp=4c4fce51072c9189cfb11b52aa54fed79f5741bd
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=65438a7ec0f4cddccf810136da6f280bd148af71
https://git.alpinelinux.org/aports/plain/main/net-snmp/fix-includes.patch?id=f25d3fb08341b60b6ccef424399f060dfcf3f1a5
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/patch/?id=cc087b11462af9f971a2c090d07e8d780a867b50
https://aur.archlinux.org/cgit/aur.git/plain/improve-gpx-name.patch?h=qlandkartegt
https://bazaar.launchpad.net/~arnouten/pastebinit/python38/diff/264?context=3
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/pari-2.7.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-fs/cryfs/files/cryfs-0.10.2-install-targets.patch?id=192ac7421ddd4093125f4997898fb62e8a140a44
https://git.sagemath.org/sage.git/plain/build/pkgs/glpk/patches/error_recovery.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.sagemath.org/sage.git/plain/build/pkgs/pynac/patches/realpartloop.patch?h=9.4.beta5
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.1-bluray_pow_freespace.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://aur.archlinux.org/cgit/aur.git/plain/fix_operator_ambiguity.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://aur.archlinux.org/cgit/aur.git/plain/libical3.patch?h=orage-4.10
https://src.fedoraproject.org/cgit/rpms/dbus-c++.git/plain/dbus-c++-writechar.patch?id=7f371172f5c
https://git.savannah.gnu.org/cgit/gsl.git/patch/?id=9cc12d
http://openarena.ws/request.php?4
https://aur.archlinux.org/cgit/aur.git/plain/sanitize.patch?h=ventoy-bin&id=ce4c26c67a1de4b761f9448bf92e94ffae1c8148
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-glibc2.6.90.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.alpinelinux.org/aports/plain/main/libexecinfo/20-define-gnu-source.patch?id=730cdcef6901750f4029d4c3b8639ce02ee3ead1
https://aur.archlinux.org/cgit/aur.git/plain/fix_deprecated_boost_api.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://git.videolan.org/?p=ffmpeg.git;a=commitdiff_plain;h=59032494e81a1a65c0b960aaae7ec4c2cc9db35a
https://bugzilla.redhat.com/attachment.cgi?id=1389687
https://git.sagemath.org/sage.git/plain/build/pkgs/maxima/patches/undoing_true_false_printing_patch.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=74dfb854b8199ddb0a27e89296fa565f4706cb9d
https://git.savannah.gnu.org/cgit/cpio.git/patch/?id=dd96882877721703e19272fe25034560b794061b
https://git.ghostscript.com/?p=ghostpdl.git;a=patch;h=a9bd3dec9fde
https://git.savannah.gnu.org/cgit/cpio.git/patch/?id=641d3f489cf6238bb916368d4ba0d9325a235afb
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-reload.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.sagemath.org/sage.git/plain/build/pkgs/maxima/patches/infodir.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://aur.archlinux.org/cgit/aur.git/plain/0002-fix-gtk2-background.patch?h=qt5-styleplugins
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libafterimage/files/libafterimage-giflib5-v2.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=32e4e8b4bcbacbf92af7c88337efae21986d9603
http://bashburn.dose.se/index.php?s=file_download&id=25
https://git.sagemath.org/sage.git/plain/build/pkgs/maxima/patches/matrixexp.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=fde47bb227b8fa817c88d7e10a8eb771c46de1df
https://aur.archlinux.org/cgit/aur.git/plain/hotspotfix.patch?h=icon-slicer
https://git.alpinelinux.org/aports/plain/community/vte3/fix-W_EXITCODE.patch?id=4d35c076ce77bfac7655f60c4c3e4c86933ab7dd
https://git.sagemath.org/sage.git/plain/build/pkgs/elliptic_curves/spkg-install.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.alpinelinux.org/aports/plain/main/lvm2/mallinfo.patch?h=3.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50
https://cgit.freedesktop.org/xorg/driver/xf86-video-xgi/patch/?id=bd94c475035739b42294477cff108e0c5f15ef67
https://aur.archlinux.org/cgit/aur.git/plain/fix-incomplete-type.patch?h=qlandkartegt
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libafterimage/files/libafterimage-libpng15.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4
https://aur.archlinux.org/cgit/aur.git/plain/improve-gpx-creator.patch?h=qlandkartegt
https://src.fedoraproject.org/cgit/rpms/SDL.git/plain/SDL-1.2.15-x11-Bypass-SetGammaRamp-when-changing-gamma.patch?id=04a3a7b1bd88c2d5502292fad27e0e02d084698d
https://aur.archlinux.org/cgit/aur.git/plain/fix_throw_specifications.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.1-noevent.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://aur.archlinux.org/cgit/aur.git/plain/fix_ffmpeg30.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://cgit.freedesktop.org/poppler/poppler/patch/?id=004e3c10df0abda214f0c293f9e269fdd979c5ee
https://git.savannah.gnu.org/cgit/cpio.git/patch/?id=236684f6deb3178043fe72a8e2faca538fa2aae1
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-libs/argp-standalone/files/argp-standalone-1.3-shared.patch?id=409d0e2a9c9c899fb1fb04cc808fe0aff3f745ca
https://git.sagemath.org/sage.git/plain/build/pkgs/giac/patches/pari_2_11.patch?id=21ba7540d385a9864b44850d6987893dfa16bfc0
https://git.claws-mail.org/?p=claws.git;a=patch;h=9c2585c58b49815a0eab8d683f0a94f75cbbe64e
https://gitlab.haskell.org/ghc/head.hackage/-/raw/e48738ee1be774507887a90a0d67ad1319456afc/patches/language-haskell-extract-0.2.4.patch?inline=false
https://git.savannah.gnu.org/cgit/guile.git/patch/?id=2fbde7f02adb8c6585e9baf6e293ee49cd23d4c4
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-text/opensp/files/opensp-1.5.2-c11-using.patch?id=688d9675782dfc162d4e6cff04c668f7516118d0
https://aur.archlinux.org/cgit/aur.git/plain/https.patch?h=w3m-mouse&id=5b5f0fbb59f674575e87dd368fed834641c35f03
https://git.sagemath.org/sage.git/plain/build/pkgs/cython/patches/trashcan.patch?id=4569a839f070a1a38d5dbce2a4d19233d25aeed2
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=f5712c9949d026e4b891b25837edd2edc166151f
https://git.kernel.org/pub/scm/network/wireless/iwd.git/patch/?id=ed10b00afa3f4c087b46d7ba0b60a47bd05d8b39
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-libs/argp-standalone/files/argp-standalone-1.3-throw-in-funcdef.patch?id=409d0e2a9c9c899fb1fb04cc808fe0aff3f745ca
https://aur.archlinux.org/cgit/aur.git/plain/fix_ptr2bool_cast.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://aur.archlinux.org/cgit/aur.git/plain/remove-broken-kde-support.patch?h=gfm
https://code.qt.io/cgit/qt/qtwebengine-chromium.git/patch/?id=193c5bed1cff123e21b7e6d12f464d6709ace2e3
https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/patch/?id=a3272b93725a406bc98b67373da67a4bdf6fcdb0
https://git.sagemath.org/sage.git/patch?id2=9.4&id=9808325853ba9eb035115e5b056305a1c9d362a0
https://aur.archlinux.org/cgit/aur.git/plain/fix_ffmpeg_codecid.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://git.sagemath.org/sage.git/plain/build/pkgs/rubiks/patches/dietz-cu2-Makefile.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-wctomb-r1.patch?id=b510df361241e8f16314b1f14642305f0111dac6
http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/app-arch/unzip/files/unzip-6.0-natspec.patch?revision=1.1
https://git.alpinelinux.org/aports/plain/main/net-snmp/netsnmp-swinst-crash.patch?id=f25d3fb08341b60b6ccef424399f060dfcf3f1a5
https://gitweb.gentoo.org/repo/gentoo.git/plain/dev-tcltk/tix/files/tix-8.4.3-tcl8.5.patch?id=56bd759df1d0c750a065b8c845e93d5dfa6b549d
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/imlib/files/imlib-1.9.15-giflib51-1.patch?id=c6d0ed89ad5653421f21cbf3b3d40fd9a1361828
https://git.sagemath.org/sage.git/plain/build/pkgs/ppl/patches/clang5-support.patch?h=9.2
https://code.qt.io/cgit/qt/qtwebengine-chromium.git/patch/?id=1a53f599
https://git.sagemath.org/sage.git/plain/build/pkgs/ratpoints/patches/sturm_and_rp_private.patch?id=1615f58890e8f9881c4228c78a6b39b9aab1303a
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/lcalc-1.23_default_parameters_1.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libfpx/files/libfpx-1.3.1_p6-gcc6.patch?id=f28a947813dbc0a1fd1a8d4a712d58a64c48ca01
http://git.fluxbox.org/fluxbox.git/patch/?id=22866c4d30f5b289c429c5ca88d800200db4fc4f
https://github.com/bcpierce00/unison/commit/14b885316e0a4b41cb80fe3daef7950f88be5c8f.patch?full_index=1
http://git.marmaro.de/?p=mmh;a=snapshot;h=431604647f89d5aac7b199a7883e98e56e4ccf9e;sf=tgz
https://git.savannah.gnu.org/cgit/emacs.git/patch/?id=a88f63500e475f842e5fbdd9abba4ce122cdb082
http://git.0pointer.net/libcanberra.git/patch/?id=c0620e432650e81062c1967cc669829dbd29b310
https://bug787443.bugzilla-attachments.gnome.org/attachment.cgi?id=359589
https://git.savannah.gnu.org/cgit/guix.git/plain/gnu/packages/patches/glibc-reinstate-prlimit64-fallback.patch?id=eab07e78b691ae7866267fc04d31c7c3ad6b0eeb
https://bugzilla.gnome.org/attachment.cgi?id=364774
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-sound/mp3gain/files/mp3gain-1.6.2-CVE-2019-18359-plus.patch?id=36f8689f7903548f5d89827a6e7bdf70a9882cee
https://gitweb.gentoo.org/repo/gentoo.git/plain/dev-perl/Crypt-Curve25519/files/Crypt-Curve25519-0.60.0-fmul-fixedvar.patch?id=cec727ad614986ca1e6b9468eea7f1a5a9183382
https://www.earthbyte.org/download/8421/?uid=b89bb31428
https://git.strongswan.org/?p=strongswan.git;a=patch;h=91c6387e69c09beaa9b9ca1e28471751a834fc24
https://git.sagemath.org/sage.git/plain/build/pkgs/conway_polynomials/spkg-install.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/proj/gcc-patches.git/plain/4.9.4/gentoo/100_all_avoid-ustat-glibc-2.28.patch?id=55fcb515620a8f7d3bb77eba938aa0fcf0d67c96
https://git.sagemath.org/sage.git/plain/build/pkgs/rubiks/patches/dietz-solver-Makefile.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
http://sources.gentoo.org/viewcvs.py/*checkout*/gentoo-x86/sys-apps/fxload/files/fxload-20020411-linux-headers-2.6.21.patch?rev=1.1
https://bugzilla-attachments.libsdl.org/attachment.cgi?id=830
https://bug787443.bugzilla-attachments.gnome.org/attachment.cgi?id=361056
https://bug697543.bugzilla-attachments.gnome.org/attachment.cgi?id=240935
https://git.sagemath.org/sage.git/plain/build/pkgs/cypari/patches/trashcan.patch?id=b6ea17ef8e4d652de0a85047bac8d41e90b25555
https://git.sagemath.org/sage.git/plain/build/pkgs/ecl/patches/write_error.patch?h=9.2
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-macros.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
https://aur.archlinux.org/cgit/aur.git/plain/0001-Use-usb_bulk_-read-write-instead-of-homemade-handler.patch?h=palm-novacom-git
https://aur.archlinux.org/cgit/aur.git/plain/openssl-1.1.0.patch?h=ike&id=3a56735ddc26f750df4720f4baba0728bb4cb458
https://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/app-misc/bfr/files/bfr-1.6-perl.patch?revision=1.1
https://aur.archlinux.org/cgit/aur.git/plain/fix-timespec.patch?h=qlandkartegt
https://marc.info/?l=grub-devel&m=146193404929072&q=mbox
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/time.h.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-fts-obstack.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
https://aur.archlinux.org/cgit/aur.git/plain/fix-ver_str.patch?h=qlandkartegt
https://bugs.gentoo.org/attachment.cgi?id=612792
https://aur.archlinux.org/cgit/aur.git/plain/0001-fix-build-against-Qt-5.15.patch?h=qt5-styleplugins
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=a507b139adf37d2c742e039815601cdc2aa00a84
https://aur.archlinux.org/cgit/aur.git/plain/lua52.patch?h=btanks
https://git.sagemath.org/sage.git/plain/build/pkgs/ecl/patches/16.1.2-getcwd.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
http://git.annexia.org/?p=virt-top.git;a=patch;h=24a461715d5bce47f63cb0097606fc336230589f
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-qsort_r.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=b82e9b6d6b46877e5c3763cc3bc641c66fa7eb54
https://aur.archlinux.org/cgit/aur.git/plain/fix_c++11_literal_warnings.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://w1.fi/cgit/hostap/patch/?id=7800725afb27397f7d6033d4969e2aeb61af4737
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.1-lastshort.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://bazaar.launchpad.net/~arnouten/pastebinit/pastebin-com-https/diff/264?context=3
https://aur.archlinux.org/cgit/aur.git/plain/autoptr2uniqueptr.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=ec1cc0263f1
https://aur.archlinux.org/cgit/aur.git/plain/curl-7.71.0.patch?h=perl-www-curl&id=261d84887d736cc097abef61164339216fb79180
https://w1.fi/cgit/hostap/patch/?id=0388992905a5c2be5cba9497504eaea346474754
https://src.fedoraproject.org/cgit/rpms/dbus-c++.git/plain/dbus-c++-threading.patch?id=7f371172f5c
https://bugzilla.redhat.com/attachment.cgi?id=79507
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=fd08479625b5845e4d725ab628628f7ebfccc407
https://git.savannah.gnu.org/cgit/cpio.git/patch/?id=dfc801c44a93bed7b3951905b188823d6a0432c8
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/Lcommon.h.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-dvddl-r1.patch?id=b510df361241e8f16314b1f14642305f0111dac6
http://www.linux-phc.org/forum/download/file.php?id=194
https://git.sagemath.org/sage.git/plain/build/pkgs/rubiks/patches/dietz-mcube-Makefile.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://salsa.debian.org/debian/autogen/-/raw/debian/1%255.18.16-4/debian/patches/30_ag_macros.m4_syntax_error.diff?inline=false
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=e12ec26e19e02281d3e7258c3aabb88a5cf5ec1d
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-video/rtmpdump/files/rtmpdump-openssl-1.1.patch?id=1e7bef484f96e7647f5f0911d3c8caa48131c33b
https://aur.archlinux.org/cgit/aur.git/plain/hsbase_inline.patch?h=hugs
https://bugs.gentoo.org/attachment.cgi?id=535944
https://w1.fi/cgit/hostap/patch/?id=a0541334a6394f8237a4393b7372693cd7e96f15
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.1-bluray_srm+pow.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.alpinelinux.org/aports/plain/main/libexecinfo/10-execinfo.patch?id=730cdcef6901750f4029d4c3b8639ce02ee3ead1
https://aur.archlinux.org/cgit/aur.git/plain/relocatable-build.patch?h=gprbuild&id=1d4e8a5cb982e79135a0aaa3ef87654bed1fe4f0
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=557c3c373a9992d45d4358a6a2ccf53b03276f39
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-sysmacros.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=5bb7eb173b72256f70c6b3f3916d7a444be93340
https://git.kernel.org/pub/scm/utils/dash/dash.git/patch/?id=6f6d1f2da03468c0e131fdcbdcfa9771ffca2614
https://git.alpinelinux.org/aports/plain/main/figlet/musl-fix-cplusplus-decls.patch?h=3.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399
https://gitweb.gentoo.org/repo/gentoo.git/plain/dev-tcltk/tix/files/tix-8.4.3-tcl8.6.patch?id=56bd759df1d0c750a065b8c845e93d5dfa6b549d
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-wexit.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.sagemath.org/sage.git/plain/build/pkgs/maxima/patches/maxima.system.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
http://git.savannah.gnu.org/cgit/src-highlite.git/patch/?id=904949c9026cb772dc93fbe0947a252ef47127f4
https://aur.archlinux.org/cgit/aur.git/plain/fix-qt5-build.patch?h=qlandkartegt
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/imlib/files/imlib-1.9.15-giflib51-2.patch?id=c6d0ed89ad5653421f21cbf3b3d40fd9a1361828
https://git.kernel.org/pub/scm/utils/dash/dash.git/patch/?id=29d6f2148f10213de4e904d515e792d2cf8c968e
https://git.alpinelinux.org/aports/plain/main/elfutils/fix-aarch64_fregs.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libafterimage/files/libafterimage-gif.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4
https://salsa.debian.org/debian/autogen/-/raw/debian/1%255.18.16-4/debian/patches/31_allow_overriding_AGexe_for_crossbuild.diff?inline=false
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=d57c99458933a21fdf94f508191f145ad8d5ec58
https://gitweb.gentoo.org/repo/gentoo.git/plain/games-strategy/scorched3d/files/scorched3d-44-fix-c++14.patch?id=1bbcfc9ae3dfdfcbdd35151cb7b6050776215e4d
https://svnweb.mageia.org/packages/cauldron/bombono-dvd/current/SOURCES/bombono-dvd-1.2.4-scons-python3.patch?revision=1447925&view=co&pathrev=1484457
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-block/partimage/files/partimage-0.6.9-openssl-1.1-compatibility.patch?id=3fe8e9910002b6523d995512a646b063565d0447
https://gitweb.gentoo.org/repo/gentoo.git/plain/sci-libs/vtk/files/vtk-8.2.0-gcc-10.patch?id=c4256f68d3589570443075eccbbafacf661f785f
https://git.sagemath.org/sage.git/plain/build/pkgs/rubiks/patches/reid-Makefile.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.alpinelinux.org/aports/plain/testing/mapbox-gl-native/0002-skip-license-check.patch?id=6751a93dca26b0b3ceec9eb151272253a2fe497e
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libafterimage/files/libafterimage-makefile.in.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4
https://aur.archlinux.org/cgit/aur.git/plain/fix-qtgui-include.patch?h=qlandkartegt
https://cgit.freedesktop.org/xorg/driver/xf86-video-xgi/patch/?id=78d1138dd6e214a200ca66fa9e439ee3c9270ec8
https://git.alpinelinux.org/aports/plain/community/date/538-output-date-pc-for-pkg-config.patch?id=11f6b4d4206b0648182e7b41cd57dcc9ccea0728
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-asm-ptrace-h.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=cee7cefc610d42fd383b3c80c12cbc675443176a
http://bugs.icu-project.org/trac/changeset/39484?format=diff
https://code.qt.io/cgit/qt/qtwebengine-chromium.git/patch/?id=fad3e27bfb50d1e23a07577f087a826b5e00bb1d
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-strndupa.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
https://aur.archlinux.org/cgit/aur.git/plain/fix-proj_api.patch?h=qlandkartegt
https://aur.archlinux.org/cgit/aur.git/plain/fix-gps_read.patch?h=qlandkartegt
https://git.alpinelinux.org/aports/plain/main/lynx/CVE-2021-38165.patch?id=3400945dbbb8a87065360963e4caa0e17d3dcc61
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=1d0db85949a5bdd96375f6131d393a11204302a6
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/lcalc-1.23_default_parameters_2.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-apps/xinetd/files/xinetd-2.3.15-creds.patch?id=426002bfe2789fb6213fba832c8bfee634d68d02
https://build.opensuse.org/public/source/openSUSE:Factory/btar/btar-librsync.patch?rev=2
https://projects.duckcorp.org/projects/bip/repository/revisions/39414f8ff9df63c8bc2e4eee34f09f829a5bf8f5/diff/src/connection.c?format=diff
https://git.sagemath.org/sage.git/plain/build/pkgs/giac/patches/nofltk-check.patch?id=7553a3c8dfa7bcec07241a07e6a4e7dcf5bb4f26
http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/dev-db/xbase/files/xbase-3.1.2-gcc47.patch?revision=1.1
https://cgit.freedesktop.org/libreoffice/libcdr/patch/?id=bf3e7f3bbc414d4341cf1420c99293debf1bd894
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-strerror_r.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
{'.0': 10,
 '.1': 8,
 '.11': 1,
 '.13': 1,
 '.15': 1,
 '.16': 1,
 '.2': 2,
 '.24': 1,
 '.3': 1,
 '.4': 3,
 '.5': 1,
 '.6': 1,
 '.7': 3,
 '.8': 1,
 '.9': 2,
 '.9-assembly': 1,
 '.AppImage': 5,
 '.L': 1,
 '.M': 1,
 '.S': 1,
 '.VSIXPackage': 127,
 '.aff': 1,
 '.assoc': 1,
 '.at': 4,
 '.bin': 1,
 '.c': 12,
 '.cab': 1,
 '.cgi': 10,
 '.deb': 40,
 '.def': 1,
 '.desktop': 1,
 '.dic': 2,
 '.diff': 87,
 '.dtd': 1,
 '.edict': 1,
 '.exe': 1,
 '.fref': 1,
 '.git': 1,
 '.h': 2,
 '.hpp': 1,
 '.hs': 1,
 '.img': 1,
 '.ini': 1,
 '.iso': 3,
 '.js': 1,
 '.json': 2,
 '.linux64': 1,
 '.lock': 2,
 '.love': 4,
 '.menu': 1,
 '.msi': 4,
 '.obj': 1,
 '.otf': 6,
 '.oxt': 3,
 '.pak': 1,
 '.patch': 1214,
 '.phar': 9,
 '.php': 3,
 '.pl': 4,
 '.png': 10,
 '.pom': 279,
 '.py': 8,
 '.rb': 1,
 '.rpm': 4,
 '.rules': 1,
 '.scm': 1,
 '.sh': 1,
 '.shar': 1,
 '.svg': 1,
 '.tcl': 1,
 '.ttf': 29,
 '.txt': 9,
 '.uqm': 3,
 '.vsix': 1,
 '.war': 3,
 '.whl': 18,
 '.xml': 1,
 '.zsh': 1}

[2]

from typing import Dict, List
from pathlib import Path
from collections import defaultdict


def read_dataset(dataset_name: str) -> List[str]:
    filepath = f'/var/tmp/nixguix/dataset/20221007/list-contents-{dataset_name}.csv'
    with open(filepath, "r") as f:
        data=[line.rstrip() for line in f]
    return data


def group_by_extensions(data: List[str]) -> Dict[int, str]:
    extensions = defaultdict(int)
    for url in data:
        path = Path(url)
        filename = path.name
        if "?" in filename:
            print(url)
            path, _ = filename.split('?')
            suffixes = Path(path).suffixes
        else:
            suffixes = path.suffixes

        if suffixes:
            if ".patch" in suffixes or ".patch" in suffixes[-1]:
                key = ".patch"
            elif ".git" in suffixes or ".git" in suffixes[-1]:
                key = ".git"
            elif ".cgi" in suffixes or ".cgi" in suffixes[-1]:
                key = ".cgi"
            else:
                key = suffixes[-1]
            extensions[key] += 1
    return dict(extensions)

for dataset_name in ["guix", "nixpkgs"]:
    data = read_dataset(dataset_name)
    print(f"dataset: {dataset_name}\n")
    extensions = group_by_extensions(data)
    from pprint import pprint
    pprint(extensions)
    print()

I had a pass on extensions to further check what's a tarball or not [1]

By the way, still no news from the upstream issues opened... [2] [3]

[1] P1503

[2] T4608

[3] T4609

Checks that newly detected extensions are actually supported already.
Summary [1] and the actual checks [2]:

[1]

|-------------+----|
| war         | ok |
| whl         | ok |
| oxt         | ok |
| pak         | ok |
| love        | ok |
| vsix        | ok |
| VSIXPackage | ok |
|-------------+----|

[2]

$ ipython
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pathlib import Path

In [5]: dir_ = Path('./manifest-files-output/')

In [12]: workdir = Path('/tmp/workdir')

In [13]: import shutil

In [14]: shutil.rmtree(workdir)

In [15]: workdir.mkdir()

In [16]: all_tarballs = list(dir_.iterdir())

In [17]: all_tarballs
Out[17]:
[PosixPath('manifest-files-output/51-trezor.rules'),
 PosixPath('manifest-files-output/Microsoft.VisualStudio.Services.VSIXPackage'),
 PosixPath('manifest-files-output/android_amd64.img'),
 PosixPath('manifest-files-output/wp-cli-2.5.0.phar'),
 PosixPath('manifest-files-output/tempora-lgc-unicode.otf'),
 PosixPath('manifest-files-output/openprinting-ppds-postscript-lexmark-20160218-1lsb3.2.noarch.rpm'),
 PosixPath('manifest-files-output/android-udev-rules'),
 PosixPath('manifest-files-output/attachment.obj'),
 PosixPath('manifest-files-output/0.9.9-assembly'),
 PosixPath('manifest-files-output/trilium.svg'),
 PosixPath('manifest-files-output/ckan.exe'),
 PosixPath('manifest-files-output/Wire-3.26.2941_amd64.deb'),
 PosixPath('manifest-files-output/1cd6a87c-623f-4407-a52d-c31be49e925c_e19f60808bdcbfbd3c3df6be3e71ffc52e43261e.cab'),
 PosixPath('manifest-files-output/da_DK-2.5.189.oxt'),
 PosixPath('manifest-files-output/virtio-win.iso'),
 PosixPath('manifest-files-output/superProductivity-7.5.1.AppImage'),
 PosixPath('manifest-files-output/cron_4.1.shar'),
 PosixPath('manifest-files-output/uqm-0.7.0-voice.uqm'),
 PosixPath('manifest-files-output/webtorrent-desktop.desktop'),
 PosixPath('manifest-files-output/nethack.def'),
 PosixPath('manifest-files-output/wine-mono-6.4.0-x86.msi'),
 PosixPath('manifest-files-output/ms-python-release.vsix'),
 PosixPath('manifest-files-output/Lightning.scm'),
 PosixPath('manifest-files-output/TeensyduinoInstall.linux64'),
 PosixPath('manifest-files-output/unifont-14.0.01.ttf'),
 PosixPath('manifest-files-output/py-yajl.git'),
 PosixPath('manifest-files-output/streamlit-0.50.2-py2.py3-none-any.whl'),
 PosixPath('manifest-files-output/NotoFonts.pak'),
 PosixPath('manifest-files-output/gorilla1537_64.bin'),
 PosixPath('manifest-files-output/2048.tcl'),
 PosixPath('manifest-files-output/0561ddcedcd12ea1f98b7ddedb93686ed8a5ffa4.patch'),
 PosixPath('manifest-files-output/jenkins.war')]

In [18]: tarball = [entry for entry in all_tarballs if entry.name.endswith('.whl')][0]

In [19]: tarball
Out[19]: PosixPath('manifest-files-output/streamlit-0.50.2-py2.py3-none-any.whl')

In [20]: archive = [entry for entry in all_tarballs if entry.name.endswith('.whl')][0]

In [21]: from swh.core import tarball

In [23]: workdir.mkdir(exist_ok=True)

In [25]: list(workdir.iterdir())
Out[25]: []

In [26]: tarball._unpack_zip(archive, workdir)
Out[26]: PosixPath('/tmp/workdir')

In [27]: list(workdir.iterdir())
Out[27]:
[PosixPath('/tmp/workdir/streamlit-0.50.2.dist-info'),
 PosixPath('/tmp/workdir/streamlit-0.50.2.data'),
 PosixPath('/tmp/workdir/streamlit')]

In [28]: shutil.rmtree(workdir)

In [30]: workdir.mkdir()

In [31]: archive = [entry for entry in all_tarballs if entry.name.endswith('.oxt')][0]

In [32]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.oxt')][0])

In [34]: archive.exists()
Out[34]: True

In [35]: tarball._unpack_zip(archive, workdir)
Out[35]: PosixPath('/tmp/workdir')

In [36]: list(workdir.iterdir())
Out[36]:
[PosixPath('/tmp/workdir/description'),
 PosixPath('/tmp/workdir/da_DK.aff'),
 PosixPath('/tmp/workdir/hyph_da_DK.dic'),
 PosixPath('/tmp/workdir/README_da_DK.txt'),
 PosixPath('/tmp/workdir/description.xml'),
 PosixPath('/tmp/workdir/th_da_DK.dat'),
 PosixPath('/tmp/workdir/Images'),
 PosixPath('/tmp/workdir/META-INF'),
 PosixPath('/tmp/workdir/help'),
 PosixPath('/tmp/workdir/da_DK.dic'),
 PosixPath('/tmp/workdir/dictionaries.xcu'),
 PosixPath('/tmp/workdir/th_da_DK.idx'),
 PosixPath('/tmp/workdir/HYPH_da_DK_README.txt')]

In [37]: shutil.rmtree(workdir)

In [38]: workdir.mkdir()

In [39]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.pak')][0])

In [40]: archive
Out[40]: PosixPath('manifest-files-output/NotoFonts.pak')

In [41]: archive.exists()
Out[41]: True

In [42]: tarball._unpack_zip(archive, workdir)
Out[42]: PosixPath('/tmp/workdir')

In [43]: list(workdir.iterdir())
Out[43]: [PosixPath('/tmp/workdir/Fonts')]

In [44]: shutil.rmtree(workdir)

In [45]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.war')][0])

In [46]: archive
Out[46]: PosixPath('manifest-files-output/jenkins.war')

In [47]: archive.exists()
Out[47]: True

In [48]: tarball._unpack_jar(war, workdir)
Out[48]: PosixPath('/tmp/workdir')

In [49]: list(workdir.iterdir())
Out[49]:
[PosixPath('/tmp/workdir/Main$FileAndDescription.class'),
 PosixPath('/tmp/workdir/LogFileOutputStream$1.class'),
 PosixPath('/tmp/workdir/JNLPMain.class'),
 PosixPath('/tmp/workdir/robots.txt'),
 PosixPath('/tmp/workdir/scripts'),
 PosixPath('/tmp/workdir/MainDialog$1$1.class'),
 PosixPath('/tmp/workdir/images'),
 PosixPath('/tmp/workdir/jsbundles'),
 PosixPath('/tmp/workdir/favicon.ico'),
 PosixPath('/tmp/workdir/MainDialog.class'),
 PosixPath('/tmp/workdir/LogFileOutputStream.class'),
 PosixPath('/tmp/workdir/WEB-INF'),
 PosixPath('/tmp/workdir/LogFileOutputStream$2.class'),
 PosixPath('/tmp/workdir/css'),
 PosixPath('/tmp/workdir/bootstrap'),
 PosixPath('/tmp/workdir/executable'),
 PosixPath('/tmp/workdir/Main.class'),
 PosixPath('/tmp/workdir/winstone.jar'),
 PosixPath('/tmp/workdir/META-INF'),
 PosixPath('/tmp/workdir/help'),
 PosixPath('/tmp/workdir/MainDialog$1.class'),
 PosixPath('/tmp/workdir/ColorFormatter.class')]

In [50]: shutil.rmtree(workdir)

In [51]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.whl')][0])

In [52]: archive
Out[52]: PosixPath('manifest-files-output/streamlit-0.50.2-py2.py3-none-any.whl')

In [53]: archive.exists()
Out[53]: True

In [55]: workdir.mkdir()

In [56]: list(workdir.iterdir())
Out[56]: []

In [57]: tarball._unpack_zip(archive, workdir)
Out[57]: PosixPath('/tmp/workdir')

In [58]: list(workdir.iterdir())
Out[58]:
[PosixPath('/tmp/workdir/streamlit-0.50.2.dist-info'),
 PosixPath('/tmp/workdir/streamlit-0.50.2.data'),
 PosixPath('/tmp/workdir/streamlit')]

In [59]: archive
Out[59]: PosixPath('manifest-files-output/streamlit-0.50.2-py2.py3-none-any.whl')

In [60]: shutil.rmtree(workdir)

In [62]: all_tarballs = list(dir_.iterdir())

In [63]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.love')][0])

In [64]: archive
Out[64]: PosixPath('manifest-files-output/vapor_dbf509f.love')

In [65]: archive.exists()
Out[65]: True

In [67]: workdir.mkdir()

In [68]: list(workdir.iterdir())
Out[68]: []

In [69]: tarball._unpack_zip(archive, workdir)
Out[69]: PosixPath('/tmp/workdir')

In [70]: list(workdir.iterdir())
Out[70]:
[PosixPath('/tmp/workdir/assets'),
 PosixPath('/tmp/workdir/state_vapor.lua'),
 PosixPath('/tmp/workdir/lib'),
 PosixPath('/tmp/workdir/git.lua'),
 PosixPath('/tmp/workdir/core'),
 PosixPath('/tmp/workdir/main.lua'),
 PosixPath('/tmp/workdir/conf.lua'),
 PosixPath('/tmp/workdir/state_load.lua'),
 PosixPath('/tmp/workdir/games.json')]

In [71]: shutil.rmtree(workdir); workdir.mkdir(); list(workdir.iterdir())
Out[71]: []

In [73]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.vsix')][0])

In [74]: archive
Out[74]: PosixPath('manifest-files-output/ms-python-release.vsix')

In [75]: archive.exists()
Out[75]: True

In [76]: tarball._unpack_zip(archive, workdir)
Out[76]: PosixPath('/tmp/workdir')

In [77]: list(workdir.iterdir())
Out[77]:
[PosixPath('/tmp/workdir/[Content_Types].xml'),
 PosixPath('/tmp/workdir/extension'),
 PosixPath('/tmp/workdir/extension.vsixmanifest')]

In [78]: shutil.rmtree(workdir); workdir.mkdir(); list(workdir.iterdir())
Out[78]: []

In [79]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.VSIXPackage')][0])

In [80]: archive
Out[80]: PosixPath('manifest-files-output/Microsoft.VisualStudio.Services.VSIXPackage')

In [81]: archive.exists()
Out[81]: True

In [82]: tarball._unpack_zip(archive, workdir)
Out[82]: PosixPath('/tmp/workdir')

In [83]: shutil.rmtree(workdir); workdir.mkdir(); list(workdir.iterdir())
Out[83]: []

In [84]: tarball._unpack_zip(archive, workdir)
Out[84]: PosixPath('/tmp/workdir')

In [85]: list(workdir.iterdir())
Out[85]:
[PosixPath('/tmp/workdir/[Content_Types].xml'),
 PosixPath('/tmp/workdir/extension'),
 PosixPath('/tmp/workdir/extension.vsixmanifest')]

Last analysis without [1]. That last diff should fix the key entries marked with the key 'only-version-should-be-tarball'.

@vlorentz @anlambert ^

contents datasets attached below [2] [3]

[1] D8773

$ python -m analyze-result --dataset guix --dataset nixpkgs --obj-type contents --dataset-date 20221025
dataset <guix> with type contents: /var/tmp/nixguix/dataset/20221025/list-contents-guix.csv

{'.c': 3,
 '.cfg': 1,
 '.el': 46,
 '.lisp': 1,
 '.map': 1,
 '.php': 1,
 '.py': 1,
 '.ttf': 1,
 'ending-version-ok': 8,
 'only-version-should-be-tarball': 3}

dataset <nixpkgs> with type contents: /var/tmp/nixguix/dataset/20221025/list-contents-nixpkgs.csv

{'.L': 1,
 '.M': 1,
 '.S': 1,
 '.aff': 1,
 '.assoc': 1,
 '.at': 4,
 '.c': 12,
 '.cab': 1,
 '.cgi': 10,
 '.def': 1,
 '.desktop': 1,
 '.diff': 90,
 '.dtd': 1,
 '.edict': 1,
 '.fref': 1,
 '.h': 2,
 '.hpp': 1,
 '.hs': 1,
 '.img': 1,
 '.ini': 1,
 '.js': 1,
 '.json': 2,
 '.lock': 2,
 '.menu': 1,
 '.obj': 1,
 '.otf': 6,
 '.patch': 1221,
 '.phar': 9,
 '.php': 3,
 '.pl': 4,
 '.pom': 279,
 '.py': 8,
 '.rb': 1,
 '.rules': 1,
 '.scm': 1,
 '.sh': 1,
 '.shar': 1,
 '.svg': 1,
 '.tcl': 1,
 '.ttf': 29,
 '.txt': 9,
 '.uqm': 3,
 '.xml': 1,
 '.zsh': 1,
 'ending-version-ok': 5,
 'only-version-should-be-tarball': 33}

[2] guix

[3] nixpkgs

With latest diffs, the filtering seems to sort properly the files and tarballs for the guix manifest:

$ python -m analyze-result --dataset guix --obj-type contents --obj-type directories --dataset-date 20221026
dataset <guix> with type contents: /var/tmp/nixguix/dataset/20221026/list-contents-guix.csv

{'.c': 3,
 '.cfg': 1,
 '.el': 46,
 '.lisp': 1,
 '.map': 1,
 '.php': 1,
 '.py': 1,
 '.ttf': 1,
 'ending-version-ok': 8}

dataset <guix> with type directories: /var/tmp/nixguix/dataset/20221026/list-directories-guix.csv

{'.7z': 3,
 '.Z': 3,
 '.crate': 2987,
 '.gem': 376,
 '.gz': 6991,
 '.jar': 60,
 '.love': 1,
 '.lz': 17,
 '.lzma': 2,
 '.php': 1,
 '.tar': 96,
 '.tbz': 30,
 '.tgz': 104,
 '.txz': 1,
 '.xz': 1211,
 '.z': 1,
 '.zip': 180,
 '.zst': 1,
 'ending-version-ok': 628,
 'only-version-should-be-tarball': 24}

[1] guix contents

[2] guix directories