⚓ T3781 Replace the Nixguix loader with a lister

Status	Assigned	Task
Migrated	gitlab-migration	T1352 ingest Guix (SD) packages
Migrated	gitlab-migration	T1991 Implement a Guix/Nix loader
Migrated	gitlab-migration	T2879 Finalize nixguix loader implementation
Migrated	gitlab-migration	T3781 Replace the Nixguix loader with a lister
Migrated	gitlab-migration	T3294 nixguix: Add support for pseudo-URLs with a missing schema
Migrated	gitlab-migration	T4608 nixpkgs manifests list "recursive" file which are missing information to recompute their hashes
Migrated	gitlab-migration	T4609 nipxkgs manifests list "git" origins as "urls"
Migrated	gitlab-migration	T4662 staging: Deploy nixguix lister and loader

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ardumont added a revision: D8621: wip: Add nixguix lister and loader.Oct 5 2022, 10:34 AM

ardumont updated the task description. (Show Details)Oct 5 2022, 10:34 AM

ardumont added a revision: D8623: core.tarball: Add missing mimetype to the list.Oct 5 2022, 11:09 AM

ardumont added a commit: rDLS2fbd66778f32: nixguix: Improve tarball detection.Oct 5 2022, 11:11 AM

ardumont added a commit: rDLSff80a91f0af8: nixguix: Improve git origins detection.

ardumont added a revision: D8624: nixguix: Improve further tarball detection.Oct 5 2022, 11:12 AM

ardumont updated the task description. (Show Details)Oct 5 2022, 11:25 AM

ardumont added a commit: rDCOREbe9403c67696: core.tarball: Add missing mimetype to the list.Oct 5 2022, 11:25 AM

ardumont added a commit: rDLS2ee103e2bcbb: nixguix: Improve further tarball detection.

ardumont added a revision: D8625: docker: Install nix binaries in swh/stack image.Oct 5 2022, 11:30 AM

ardumont added a revision: D8626: nixguix: Improve is_tarball detection pattern.Oct 5 2022, 11:53 AM

ardumont added a revision: D8627: base-buster/Dockerfile: Install nix binaries in buster image.Oct 5 2022, 11:57 AM

anlambert added a commit: rDENV8e164268c784: docker: Install nix binaries in swh/stack image.Oct 5 2022, 11:58 AM

anlambert added a commit: rCDFJb7f329d73d65: base-buster/Dockerfile: Install nix binaries in buster image.

ardumont updated the task description. (Show Details)Oct 5 2022, 12:04 PM

ardumont added a commit: rDLSf2377c283ac5: nixguix: Improve is_tarball detection pattern.Oct 5 2022, 12:12 PM

ardumont added a revision: D8630: {Cnt|Dir}Loader: Fix standard/nar hash mismatch behavior to fail loading.Oct 5 2022, 2:48 PM

ardumont updated the task description. (Show Details)Oct 5 2022, 2:52 PM

ardumont updated the task description. (Show Details)Oct 5 2022, 3:00 PM

ardumont updated the task description. (Show Details)Oct 5 2022, 3:06 PM

ardumont added a revision: D8631: nixguix: Deal with manifest entries without an integrity field.Oct 5 2022, 4:01 PM

ardumont added a commit: rDLDBASE4d51ad991d31: DirectoryLoader: Check nar hashes when provided.Oct 5 2022, 4:35 PM

ardumont added a commit: rDLDBASE8aa6dab72ad4: {Cnt|Dir}Loader: Fix standard/nar hash mismatch behavior to fail loading.

ardumont added a commit: rDLS2e6e282d4464: nixguix: Deal with manifest entries without an integrity field.Oct 5 2022, 4:50 PM

ardumont added a revision: D8632: nixguix: Refactor by renaming success or failure the different datasets.Oct 5 2022, 4:51 PM

ardumont updated the task description. (Show Details)Oct 5 2022, 4:57 PM

ardumont updated the task description. (Show Details)Oct 6 2022, 8:50 AM

Out of the paste [1] (csv extract from swh-scheduler dev db after 3 lister runs on
docker), here is the state of detected files [2] so far (computed with [3]):

[1] P1487

[2]

In [29]: extensions
Out[29]:
['.0',
 '.04',
 '.1',
 '.10',
 '.11',
 '.13',
 '.15',
 '.16',
 '.2',
 '.24',
 '.3',
 '.4',
 '.4&id=9808325853ba9eb035115e5b056305a1c9d362a0',
 '.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399',
 '.5',
 '.6',
 '.7',
 '.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50',
 '.8',
 '.9',
 '.9-assembly',
 '.AppImage',
 '.L',
 '.M',
 '.S',
 '.VSIXPackage',
 '.aff',
 '.assoc',
 '.at',
 '.beta5',
 '.bin',
 '.c',
 '.c?format=diff',
 '.cab',
 '.cgi?id=1389687',
 '.cgi?id=240935',
 '.cgi?id=359589',
 '.cgi?id=361056',
 '.cgi?id=364774',
 '.cgi?id=535944',
 '.cgi?id=612792',
 '.cgi?id=79507',
 '.cgi?id=830',
 '.deb',
 '.def',
 '.desktop',
 '.dic',
 '.diff',
 '.diff;att=2;bug=665779',
 '.diff?inline=false',
 '.dtd',
 '.edict',
 '.exe',
 '.fref',
 '.git',
 '.git;a=commitdiff_plain;h=59032494e81a1a65c0b960aaae7ec4c2cc9db35a',
 '.git;a=commitdiff_plain;h=d57c99458933a21fdf94f508191f145ad8d5ec58',
 '.git;a=commitdiff_plain;h=ec1cc0263f1',
 '.git;a=patch;h=049e14870c13235cd066758f29c42dc96c1ccdf8',
 '.git;a=patch;h=24a461715d5bce47f63cb0097606fc336230589f',
 '.git;a=patch;h=32e4e8b4bcbacbf92af7c88337efae21986d9603',
 '.git;a=patch;h=91c6387e69c09beaa9b9ca1e28471751a834fc24',
 '.git;a=patch;h=9c2585c58b49815a0eab8d683f0a94f75cbbe64e',
 '.git;a=patch;h=a507b139adf37d2c742e039815601cdc2aa00a84',
 '.git;a=patch;h=a9bd3dec9fde',
 '.git;a=patch;h=b82e9b6d6b46877e5c3763cc3bc641c66fa7eb54',
 '.git;a=patch;h=c3f7414;hp=4c4fce51072c9189cfb11b52aa54fed79f5741bd',
 '.git;a=patch;h=cee7cefc610d42fd383b3c80c12cbc675443176a',
 '.git;a=patch;h=f5712c9949d026e4b891b25837edd2edc166151f',
 '.h',
 '.hpp',
 '.hs',
 '.ht',
 '.img',
 '.ini',
 '.iso',
 '.js',
 '.json',
 '.kak',
 '.linux64',
 '.lock',
 '.love',
 '.lua',
 '.md',
 '.menu',
 '.msi',
 '.nvim',
 '.obj',
 '.org',
 '.otf',
 '.oxt',
 '.pak',
 '.patch',
 '.patch?full_index=1',
 '.patch?h=btanks',
 '.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c',
 '.patch?h=gfm',
 '.patch?h=gprbuild&id=1d4e8a5cb982e79135a0aaa3ef87654bed1fe4f0',
 '.patch?h=hugs',
 '.patch?h=icon-slicer',
 '.patch?h=ike&id=3a56735ddc26f750df4720f4baba0728bb4cb458',
 '.patch?h=palm-novacom-git',
 '.patch?h=perl-www-curl&id=261d84887d736cc097abef61164339216fb79180',
 '.patch?h=qlandkartegt',
 '.patch?h=qt5-styleplugins',
 '.patch?h=tilp',
 '.patch?h=ventoy-bin&id=ce4c26c67a1de4b761f9448bf92e94ffae1c8148',
 '.patch?h=w3m-mouse&id=5b5f0fbb59f674575e87dd368fed834641c35f03',
 '.patch?id=04a3a7b1bd88c2d5502292fad27e0e02d084698d',
 '.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba',
 '.patch?id=11f6b4d4206b0648182e7b41cd57dcc9ccea0728',
 '.patch?id=1615f58890e8f9881c4228c78a6b39b9aab1303a',
 '.patch?id=192ac7421ddd4093125f4997898fb62e8a140a44',
 '.patch?id=1bbcfc9ae3dfdfcbdd35151cb7b6050776215e4d',
 '.patch?id=1e7bef484f96e7647f5f0911d3c8caa48131c33b',
 '.patch?id=21ba7540d385a9864b44850d6987893dfa16bfc0',
 '.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9',
 '.patch?id=3400945dbbb8a87065360963e4caa0e17d3dcc61',
 '.patch?id=36f8689f7903548f5d89827a6e7bdf70a9882cee',
 '.patch?id=3fe8e9910002b6523d995512a646b063565d0447',
 '.patch?id=409d0e2a9c9c899fb1fb04cc808fe0aff3f745ca',
 '.patch?id=426002bfe2789fb6213fba832c8bfee634d68d02',
 '.patch?id=4569a839f070a1a38d5dbce2a4d19233d25aeed2',
 '.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4',
 '.patch?id=4d35c076ce77bfac7655f60c4c3e4c86933ab7dd',
 '.patch?id=55fcb515620a8f7d3bb77eba938aa0fcf0d67c96',
 '.patch?id=56bd759df1d0c750a065b8c845e93d5dfa6b549d',
 '.patch?id=6751a93dca26b0b3ceec9eb151272253a2fe497e',
 '.patch?id=688d9675782dfc162d4e6cff04c668f7516118d0',
 '.patch?id=730cdcef6901750f4029d4c3b8639ce02ee3ead1',
 '.patch?id=7553a3c8dfa7bcec07241a07e6a4e7dcf5bb4f26',
 '.patch?id=7f371172f5c',
 '.patch?id=b510df361241e8f16314b1f14642305f0111dac6',
 '.patch?id=b6ea17ef8e4d652de0a85047bac8d41e90b25555',
 '.patch?id=c4256f68d3589570443075eccbbafacf661f785f',
 '.patch?id=c6d0ed89ad5653421f21cbf3b3d40fd9a1361828',
 '.patch?id=cec727ad614986ca1e6b9468eea7f1a5a9183382',
 '.patch?id=eab07e78b691ae7866267fc04d31c7c3ad6b0eeb',
 '.patch?id=f25d3fb08341b60b6ccef424399f060dfcf3f1a5',
 '.patch?id=f28a947813dbc0a1fd1a8d4a712d58a64c48ca01',
 '.patch?inline=false',
 '.patch?rev=2',
 '.patch?revision=1447925&view=co&pathrev=1484457',
 '.phar',
 '.php',
 '.php?4',
 '.php?id=194',
 '.php?s=file_download&id=25',
 '.pl',
 '.png',
 '.pom',
 '.py',
 '.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba',
 '.rb',
 '.rpm',
 '.rules',
 '.scm',
 '.sh',
 '.shar',
 '.svg',
 '.tcl',
 '.ttc',
 '.ttf',
 '.ttf?raw=true',
 '.txt',
 '.uqm',
 '.vsix',
 '.war',
 '.whl',
 '.xml',
 '.zsh']

[3]

In [6]: from pathlib import Path

In [15]: with open(filepath, "r") as f: data=[line.rstrip() for line in f.readlines()]

In [25]:  extensions = set([Path(url).suffixes[-1] for url in data if Path(url).suffixes])

In [26]:  extensions = list(set([Path(url).suffixes[-1] for url in data if Path(url).suffixes]))

In [28]: extensions.sort()

It must be more interesting to read it with a frequency [1]:

[1]

In [31]: from collections import defaultdict

In [32]: extensions = defaultdict(int)

In [33]: for url in data:
    ...:     suffixes = Path(url).suffixes
    ...:     if suffixes:
    ...:         extensions[Path(url).suffixes[-1]] += 1
    ...:


In [35]: dict(extensions)
Out[35]:
{'.pom': 279,
 '.patch': 1099,
 '.VSIXPackage': 127,
 '.8': 1,
 '.txt': 14,
 '.deb': 40,
 '.git;a=commitdiff_plain;h=59032494e81a1a65c0b960aaae7ec4c2cc9db35a': 1,
 '.oxt': 3,
 '.AppImage': 5,
 '.3': 1,
 '.rb': 2,
 '.whl': 18,
 '.py': 6,
 '.kak': 1,
 '.patch?h=tilp': 1,
 '.patch?id=b510df361241e8f16314b1f14642305f0111dac6': 10,
 '.diff': 84,
 '.0': 10,
 '.patch?id=730cdcef6901750f4029d4c3b8639ce02ee3ead1': 2,
 '.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399': 1,
 '.patch?id=21ba7540d385a9864b44850d6987893dfa16bfc0': 1,
 '.phar': 9,
 '.pl': 4,
 '.def': 1,
 '.patch?id=55fcb515620a8f7d3bb77eba938aa0fcf0d67c96': 1,
 '.php?4': 1,
 '.at': 4,
 '.1': 12,
 '.git;a=patch;h=91c6387e69c09beaa9b9ca1e28471751a834fc24': 1,
 '.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c': 8,
 '.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9': 7,
 '.rules': 1,
 '.patch?id=c4256f68d3589570443075eccbbafacf661f785f': 1,
 '.patch?h=gprbuild&id=1d4e8a5cb982e79135a0aaa3ef87654bed1fe4f0': 1,
 '.git;a=patch;h=a507b139adf37d2c742e039815601cdc2aa00a84': 1,
 '.ini': 1,
 '.S': 1,
 '.ttf': 38,
 '.otf': 7,
 '.png': 10,
 '.linux64': 1,
 '.patch?id=192ac7421ddd4093125f4997898fb62e8a140a44': 2,
 '.lua': 1,
 '.vsix': 1,
 '.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba': 15,
 '.13': 1,
 '.c': 11,
 '.war': 3,
 '.sh': 3,
 '.4': 3,
 '.rpm': 5,
 '.php': 1,
 '.patch?rev=2': 1,
 '.diff?inline=false': 3,
 '.patch?h=qt5-styleplugins': 2,
 '.pak': 1,
 '.cgi?id=364774': 1,
 '.2': 4,
 '.7': 3,
 '.16': 1,
 '.4&id=9808325853ba9eb035115e5b056305a1c9d362a0': 1,
 '.patch?h=perl-www-curl&id=261d84887d736cc097abef61164339216fb79180': 1,
 '.9': 2,
 '.git;a=patch;h=049e14870c13235cd066758f29c42dc96c1ccdf8': 1,
 '.json': 2,
 '.ttf?raw=true': 1,
 '.patch?id=56bd759df1d0c750a065b8c845e93d5dfa6b549d': 2,
 '.love': 4,
 '.patch?id=1615f58890e8f9881c4228c78a6b39b9aab1303a': 1,
 '.scm': 1,
 '.hs': 1,
 '.fref': 1,
 '.iso': 3,
 '.cgi?id=830': 1,
 '.msi': 4,
 '.patch?id=7553a3c8dfa7bcec07241a07e6a4e7dcf5bb4f26': 1,
 '.git;a=patch;h=b82e9b6d6b46877e5c3763cc3bc641c66fa7eb54': 1,
 '.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4': 4,
 '.patch?id=f28a947813dbc0a1fd1a8d4a712d58a64c48ca01': 1,
 '.patch?id=4d35c076ce77bfac7655f60c4c3e4c86933ab7dd': 1,
 '.uqm': 3,
 '.patch?id=eab07e78b691ae7866267fc04d31c7c3ad6b0eeb': 1,
 '.exe': 1,
 '.cab': 1,
 '.ht': 2,
 '.04': 1,
 '.aff': 1,
 '.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50': 1,
 '.patch?h=btanks': 1,
 '.git;a=patch;h=24a461715d5bce47f63cb0097606fc336230589f': 1,
 '.diff;att=2;bug=665779': 1,
 '.js': 1,
 '.git;a=patch;h=f5712c9949d026e4b891b25837edd2edc166151f': 1,
 '.git;a=patch;h=c3f7414;hp=4c4fce51072c9189cfb11b52aa54fed79f5741bd': 1,
 '.zsh': 1,
 '.svg': 1,
 '.patch?h=qlandkartegt': 9,
 '.git;a=patch;h=32e4e8b4bcbacbf92af7c88337efae21986d9603': 1,
 '.L': 1,
 '.M': 1,
 '.php?id=194': 1,
 '.patch?id=7f371172f5c': 2,
 '.5': 1,
 '.ttc': 2,
 '.patch?id=3fe8e9910002b6523d995512a646b063565d0447': 1,
 '.patch?id=04a3a7b1bd88c2d5502292fad27e0e02d084698d': 1,
 '.patch?id=36f8689f7903548f5d89827a6e7bdf70a9882cee': 1,
 '.hpp': 1,
 '.patch?h=ike&id=3a56735ddc26f750df4720f4baba0728bb4cb458': 1,
 '.obj': 1,
 '.php?s=file_download&id=25': 1,
 '.patch?id=688d9675782dfc162d4e6cff04c668f7516118d0': 1,
 '.assoc': 1,
 '.menu': 1,
 '.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba': 2,
 '.bin': 1,
 '.patch?id=11f6b4d4206b0648182e7b41cd57dcc9ccea0728': 1,
 '.patch?id=409d0e2a9c9c899fb1fb04cc808fe0aff3f745ca': 2,
 '.cgi?id=535944': 1,
 '.patch?h=icon-slicer': 1,
 '.patch?id=4569a839f070a1a38d5dbce2a4d19233d25aeed2': 1,
 '.cgi?id=79507': 1,
 '.patch?id=b6ea17ef8e4d652de0a85047bac8d41e90b25555': 1,
 '.patch?id=c6d0ed89ad5653421f21cbf3b3d40fd9a1361828': 2,
 '.org': 1,
 '.dic': 2,
 '.tcl': 1,
 '.24': 1,
 '.patch?id=1e7bef484f96e7647f5f0911d3c8caa48131c33b': 1,
 '.patch?h=palm-novacom-git': 1,
 '.patch?id=1bbcfc9ae3dfdfcbdd35151cb7b6050776215e4d': 1,
 '.patch?id=6751a93dca26b0b3ceec9eb151272253a2fe497e': 1,
 '.patch?id=f25d3fb08341b60b6ccef424399f060dfcf3f1a5': 2,
 '.lock': 2,
 '.c?format=diff': 1,
 '.git;a=patch;h=a9bd3dec9fde': 1,
 '.patch?h=w3m-mouse&id=5b5f0fbb59f674575e87dd368fed834641c35f03': 1,
 '.15': 1,
 '.git': 2,
 '.patch?h=hugs': 1,
 '.xml': 1,
 '.git;a=patch;h=9c2585c58b49815a0eab8d683f0a94f75cbbe64e': 1,
 '.patch?id=426002bfe2789fb6213fba832c8bfee634d68d02': 1,
 '.md': 1,
 '.desktop': 1,
 '.git;a=patch;h=cee7cefc610d42fd383b3c80c12cbc675443176a': 1,
 '.patch?h=ventoy-bin&id=ce4c26c67a1de4b761f9448bf92e94ffae1c8148': 1,
 '.6': 1,
 '.10': 1,
 '.nvim': 2,
 '.patch?full_index=1': 1,
 '.9-assembly': 1,
 '.beta5': 1,
 '.edict': 1,
 '.h': 2,
 '.dtd': 1,
 '.11': 1,
 '.cgi?id=240935': 1,
 '.cgi?id=361056': 1,
 '.shar': 1,
 '.cgi?id=359589': 1,
 '.cgi?id=612792': 1,
 '.git;a=commitdiff_plain;h=ec1cc0263f1': 1,
 '.patch?id=cec727ad614986ca1e6b9468eea7f1a5a9183382': 1,
 '.img': 1,
 '.cgi?id=1389687': 1,
 '.patch?revision=1447925&view=co&pathrev=1484457': 1,
 '.patch?id=3400945dbbb8a87065360963e4caa0e17d3dcc61': 1,
 '.git;a=commitdiff_plain;h=d57c99458933a21fdf94f508191f145ad8d5ec58': 1,
 '.patch?inline=false': 1,
 '.patch?h=gfm': 1}

Finally, more concentrated frequency dict:

In [46]: extensions = defaultdict(int)

In [47]: for url in data:
    ...:     suffixes = Path(url).suffixes
    ...:     if suffixes:
    ...:         if ".patch" in suffixes or ".patch" in suffixes[-1]:
    ...:             key = ".patch"
    ...:         elif ".git" in suffixes or ".git" in suffixes[-1]:
    ...:             key = ".git"
    ...:         elif ".cgi" in suffixes or ".cgi" in suffixes[-1]:
    ...:             key = ".cgi"
    ...:         else:
    ...:             key = suffixes[-1]
    ...:         extensions[key] += 1
    ...:

In [48]: dict(extensions)
Out[48]:
{'.pom': 279,
 '.patch': 1204,
 '.VSIXPackage': 127,
 '.8': 1,
 '.txt': 14,
 '.deb': 40,
 '.git': 16,
 '.oxt': 3,
 '.AppImage': 5,
 '.3': 1,
 '.rb': 2,
 '.whl': 18,
 '.py': 6,
 '.kak': 1,
 '.diff': 84,
 '.0': 10,
 '.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399': 1,
 '.phar': 9,
 '.pl': 4,
 '.def': 1,
 '.php?4': 1,
 '.at': 4,
 '.1': 12,
 '.rules': 1,
 '.ini': 1,
 '.S': 1,
 '.ttf': 38,
 '.otf': 7,
 '.png': 10,
 '.linux64': 1,
 '.lua': 1,
 '.vsix': 1,
 '.13': 1,
 '.c': 11,
 '.war': 3,
 '.sh': 3,
 '.4': 3,
 '.rpm': 5,
 '.php': 1,
 '.diff?inline=false': 3,
 '.pak': 1,
 '.cgi': 9,
 '.2': 4,
 '.7': 3,
 '.16': 1,
 '.4&id=9808325853ba9eb035115e5b056305a1c9d362a0': 1,
 '.9': 2,
 '.json': 2,
 '.ttf?raw=true': 1,
 '.love': 4,
 '.scm': 1,
 '.hs': 1,
 '.fref': 1,
 '.iso': 3,
 '.msi': 4,
 '.uqm': 3,
 '.exe': 1,
 '.cab': 1,
 '.ht': 2,
 '.04': 1,
 '.aff': 1,
 '.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50': 1,
 '.diff;att=2;bug=665779': 1,
 '.js': 1,
 '.zsh': 1,
 '.svg': 1,
 '.L': 1,
 '.M': 1,
 '.php?id=194': 1,
 '.5': 1,
 '.ttc': 2,
 '.hpp': 1,
 '.obj': 1,
 '.php?s=file_download&id=25': 1,
 '.assoc': 1,
 '.menu': 1,
 '.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba': 2,
 '.bin': 1,
 '.org': 1,
 '.dic': 2,
 '.tcl': 1,
 '.24': 1,
 '.lock': 2,
 '.c?format=diff': 1,
 '.15': 1,
 '.xml': 1,
 '.md': 1,
 '.desktop': 1,
 '.6': 1,
 '.10': 1,
 '.nvim': 2,
 '.9-assembly': 1,
 '.beta5': 1,
 '.edict': 1,
 '.h': 2,
 '.dtd': 1,
 '.11': 1,
 '.shar': 1,
 '.img': 1}

ardumont added a commit: rDLS5a53243bd3de: nixguix: Refactor by renaming success or failure the different datasets.Oct 6 2022, 10:52 AM

ardumont updated the task description. (Show Details)Oct 6 2022, 10:53 AM

ardumont added a revision: D8636: ContentLoader: Allow nar computation checks.Oct 6 2022, 1:03 PM

ardumont updated the task description. (Show Details)Oct 6 2022, 2:08 PM

ardumont updated the task description. (Show Details)Oct 6 2022, 2:31 PM

The actual nixpkgs manifests are either not built properly or not complete yet. They
sometimes are referencing hash we cannot compute back as only the derivation is
containing the information [1] [2].

In [1], the fs layout is required to build properly the same hash.

In [2], the executable bit permission is required on the file to compute the proper hash.

There may exist many other more discrepancies. That and the fact that the nixpkgs
manifest has not been updated for a while, Oct 11, 2021. (from [3] to [4])

[1] P1489#10067

[2] P1490

[3] https://nix-community.github.io/nixpkgs-swh/

[4] https://github.com/NixOS/nixpkgs/tree/e4ef597edfd8a0ba5f12362932fc9b1dd01a0aef

ardumont updated the task description. (Show Details)Oct 6 2022, 2:44 PM

ardumont mentioned this in T4608: nixpkgs manifests list "recursive" file which are missing information to recompute their hashes.Oct 6 2022, 3:32 PM

ardumont updated the task description. (Show Details)Oct 6 2022, 3:36 PM

ardumont added a revision: D8637: nixguix: Exclude faulty "recursive" file origins from listing.Oct 6 2022, 4:02 PM

ardumont closed subtask T3294: nixguix: Add support for pseudo-URLs with a missing schema as Resolved.Oct 6 2022, 6:13 PM

ardumont updated the task description. (Show Details)

With D8637, listing is less noisy [1] (code [2]):

[1]

dataset: guix

{'.1': 1,
 '.10': 1,
 '.14': 1,
 '.15': 1,
 '.19': 2,
 '.2': 2,
 '.4': 1,
 '.5': 1,
 '.9': 1,
 '.c': 3,
 '.cfg?revision=59745': 1,
 '.el': 45,
 '.el?id=dcc9ba03252ee5d39e03bba31b420e0708c3ba0c': 1,
 '.lisp': 1,
 '.love': 1,
 '.map': 1,
 '.py': 1,
 '.sf3': 1,
 '.ttf': 1}

dataset: nixpkgs

{'.0': 10,
 '.1': 12,
 '.10': 1,
 '.11': 1,
 '.13': 1,
 '.15': 1,
 '.16': 1,
 '.2': 4,
 '.24': 1,
 '.3': 1,
 '.4': 3,
 '.4&id=9808325853ba9eb035115e5b056305a1c9d362a0': 1,
 '.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399': 1,
 '.5': 1,
 '.6': 1,
 '.7': 3,
 '.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50': 1,
 '.8': 1,
 '.9': 2,
 '.9-assembly': 1,
 '.AppImage': 5,
 '.L': 1,
 '.M': 1,
 '.S': 1,
 '.VSIXPackage': 127,
 '.aff': 1,
 '.assoc': 1,
 '.at': 4,
 '.beta5': 1,
 '.bin': 1,
 '.c': 11,
 '.c?format=diff': 1,
 '.cab': 1,
 '.cgi': 9,
 '.deb': 40,
 '.def': 1,
 '.desktop': 1,
 '.dic': 2,
 '.diff': 84,
 '.diff;att=2;bug=665779': 1,
 '.diff?inline=false': 3,
 '.dtd': 1,
 '.edict': 1,
 '.exe': 1,
 '.fref': 1,
 '.git': 14,
 '.h': 2,
 '.hpp': 1,
 '.hs': 1,
 '.img': 1,
 '.ini': 1,
 '.iso': 3,
 '.js': 1,
 '.json': 2,
 '.linux64': 1,
 '.lock': 2,
 '.love': 4,
 '.menu': 1,
 '.msi': 4,
 '.obj': 1,
 '.otf': 6,
 '.oxt': 3,
 '.pak': 1,
 '.patch': 1204,
 '.phar': 9,
 '.php?4': 1,
 '.php?id=194': 1,
 '.php?s=file_download&id=25': 1,
 '.pl': 4,
 '.png': 10,
 '.pom': 279,
 '.py': 6,
 '.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba': 2,
 '.rb': 1,
 '.rpm': 4,
 '.rules': 1,
 '.scm': 1,
 '.sh': 1,
 '.shar': 1,
 '.svg': 1,
 '.tcl': 1,
 '.ttf': 29,
 '.txt': 9,
 '.uqm': 3,
 '.vsix': 1,
 '.war': 3,
 '.whl': 18,
 '.xml': 1,
 '.zsh': 1}

[2]

from typing import Dict, List
from pathlib import Path
from collections import defaultdict


def read_dataset(dataset_name: str) -> List[str]:
    filepath = f'/var/tmp/nixguix/dataset/20221007/list-contents-{dataset_name}.csv'
    with open(filepath, "r") as f:
        data=[line.rstrip() for line in f]
    return data


def group_by_extensions(data: List[str]) -> Dict[int, str]:
    extensions = defaultdict(int)
    for url in data:
        suffixes = Path(url).suffixes
        if suffixes:
            if ".patch" in suffixes or ".patch" in suffixes[-1]:
                key = ".patch"
            elif ".git" in suffixes or ".git" in suffixes[-1]:
                key = ".git"
            elif ".cgi" in suffixes or ".cgi" in suffixes[-1]:
                key = ".cgi"
            else:
                key = suffixes[-1]
            extensions[key] += 1
    return dict(extensions)

for dataset_name in ["guix", "nixpkgs"]:
    data = read_dataset(dataset_name)
    print(f"dataset: {dataset_name}\n")
    extensions = group_by_extensions(data)
    from pprint import pprint
    pprint(extensions)
    print()

[3]

list-contents-nixpkgs.csv209 KBDownload

[4]

list-contents-guix.csv4 KBDownload

ardumont added a commit: rDLSc22f41a6d74c: nixguix: Exclude faulty "recursive" file origins from listing.Oct 7 2022, 4:37 PM

ardumont added a commit: rDLDBASE028b7c04b9ed: ContentLoader: Allow nar computation checks.Oct 10 2022, 2:37 PM

Improved version with noisy urls printed alongside the hash output [1] [2]:

[1]

dataset: guix

https://downloads.mariadb.org/f/connector-c-3.1.13/mariadb-connector-c-3.1.13-src.tar.gz/from/https%3A//mirrors.ukfast.co.uk/sites/mariadb/?serve
http://git.savannah.gnu.org/cgit/emacs/elpa.git/plain/packages/pinentry/pinentry.el?id=dcc9ba03252ee5d39e03bba31b420e0708c3ba0c
https://tug.org/svn/texlive/tags/texlive-2021.3/Master/texmf-dist/web2c/updmap.cfg?revision=59745
http://apps.fz-juelich.de/jsc/jube/jube2/download.php?version=2.2.2
{'.1': 1,
 '.10': 1,
 '.14': 1,
 '.15': 1,
 '.19': 2,
 '.2': 1,
 '.4': 1,
 '.5': 1,
 '.9': 1,
 '.c': 3,
 '.cfg': 1,
 '.el': 46,
 '.lisp': 1,
 '.love': 1,
 '.map': 1,
 '.php': 1,
 '.py': 1,
 '.sf3': 1,
 '.ttf': 1}

dataset: nixpkgs

https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-fs/cryfs/files/cryfs-0.10.2-unbundle-libs.patch?id=192ac7421ddd4093125f4997898fb62e8a140a44
https://aur.archlinux.org/cgit/aur.git/plain/remove-broken-kde-support.patch?h=tilp
https://salsa.debian.org/debian/autogen/-/raw/debian/1%255.18.16-4/debian/patches/20_no_Werror.diff?inline=false
https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=10;filename=fix_window_resizing.diff;att=2;bug=665779
https://git.samba.org/?p=rsync.git;a=patch;h=c3f7414;hp=4c4fce51072c9189cfb11b52aa54fed79f5741bd
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=65438a7ec0f4cddccf810136da6f280bd148af71
https://git.alpinelinux.org/aports/plain/main/net-snmp/fix-includes.patch?id=f25d3fb08341b60b6ccef424399f060dfcf3f1a5
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/patch/?id=cc087b11462af9f971a2c090d07e8d780a867b50
https://aur.archlinux.org/cgit/aur.git/plain/improve-gpx-name.patch?h=qlandkartegt
https://bazaar.launchpad.net/~arnouten/pastebinit/python38/diff/264?context=3
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/pari-2.7.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-fs/cryfs/files/cryfs-0.10.2-install-targets.patch?id=192ac7421ddd4093125f4997898fb62e8a140a44
https://git.sagemath.org/sage.git/plain/build/pkgs/glpk/patches/error_recovery.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.sagemath.org/sage.git/plain/build/pkgs/pynac/patches/realpartloop.patch?h=9.4.beta5
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.1-bluray_pow_freespace.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://aur.archlinux.org/cgit/aur.git/plain/fix_operator_ambiguity.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://aur.archlinux.org/cgit/aur.git/plain/libical3.patch?h=orage-4.10
https://src.fedoraproject.org/cgit/rpms/dbus-c++.git/plain/dbus-c++-writechar.patch?id=7f371172f5c
https://git.savannah.gnu.org/cgit/gsl.git/patch/?id=9cc12d
http://openarena.ws/request.php?4
https://aur.archlinux.org/cgit/aur.git/plain/sanitize.patch?h=ventoy-bin&id=ce4c26c67a1de4b761f9448bf92e94ffae1c8148
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-glibc2.6.90.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.alpinelinux.org/aports/plain/main/libexecinfo/20-define-gnu-source.patch?id=730cdcef6901750f4029d4c3b8639ce02ee3ead1
https://aur.archlinux.org/cgit/aur.git/plain/fix_deprecated_boost_api.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://git.videolan.org/?p=ffmpeg.git;a=commitdiff_plain;h=59032494e81a1a65c0b960aaae7ec4c2cc9db35a
https://bugzilla.redhat.com/attachment.cgi?id=1389687
https://git.sagemath.org/sage.git/plain/build/pkgs/maxima/patches/undoing_true_false_printing_patch.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=74dfb854b8199ddb0a27e89296fa565f4706cb9d
https://git.savannah.gnu.org/cgit/cpio.git/patch/?id=dd96882877721703e19272fe25034560b794061b
https://git.ghostscript.com/?p=ghostpdl.git;a=patch;h=a9bd3dec9fde
https://git.savannah.gnu.org/cgit/cpio.git/patch/?id=641d3f489cf6238bb916368d4ba0d9325a235afb
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-reload.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.sagemath.org/sage.git/plain/build/pkgs/maxima/patches/infodir.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://aur.archlinux.org/cgit/aur.git/plain/0002-fix-gtk2-background.patch?h=qt5-styleplugins
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libafterimage/files/libafterimage-giflib5-v2.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=32e4e8b4bcbacbf92af7c88337efae21986d9603
http://bashburn.dose.se/index.php?s=file_download&id=25
https://git.sagemath.org/sage.git/plain/build/pkgs/maxima/patches/matrixexp.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=fde47bb227b8fa817c88d7e10a8eb771c46de1df
https://aur.archlinux.org/cgit/aur.git/plain/hotspotfix.patch?h=icon-slicer
https://git.alpinelinux.org/aports/plain/community/vte3/fix-W_EXITCODE.patch?id=4d35c076ce77bfac7655f60c4c3e4c86933ab7dd
https://git.sagemath.org/sage.git/plain/build/pkgs/elliptic_curves/spkg-install.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.alpinelinux.org/aports/plain/main/lvm2/mallinfo.patch?h=3.7-stable&id=31bd4a8c2dc00ae79a821f6fe0ad2f23e1534f50
https://cgit.freedesktop.org/xorg/driver/xf86-video-xgi/patch/?id=bd94c475035739b42294477cff108e0c5f15ef67
https://aur.archlinux.org/cgit/aur.git/plain/fix-incomplete-type.patch?h=qlandkartegt
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libafterimage/files/libafterimage-libpng15.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4
https://aur.archlinux.org/cgit/aur.git/plain/improve-gpx-creator.patch?h=qlandkartegt
https://src.fedoraproject.org/cgit/rpms/SDL.git/plain/SDL-1.2.15-x11-Bypass-SetGammaRamp-when-changing-gamma.patch?id=04a3a7b1bd88c2d5502292fad27e0e02d084698d
https://aur.archlinux.org/cgit/aur.git/plain/fix_throw_specifications.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.1-noevent.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://aur.archlinux.org/cgit/aur.git/plain/fix_ffmpeg30.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://cgit.freedesktop.org/poppler/poppler/patch/?id=004e3c10df0abda214f0c293f9e269fdd979c5ee
https://git.savannah.gnu.org/cgit/cpio.git/patch/?id=236684f6deb3178043fe72a8e2faca538fa2aae1
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-libs/argp-standalone/files/argp-standalone-1.3-shared.patch?id=409d0e2a9c9c899fb1fb04cc808fe0aff3f745ca
https://git.sagemath.org/sage.git/plain/build/pkgs/giac/patches/pari_2_11.patch?id=21ba7540d385a9864b44850d6987893dfa16bfc0
https://git.claws-mail.org/?p=claws.git;a=patch;h=9c2585c58b49815a0eab8d683f0a94f75cbbe64e
https://gitlab.haskell.org/ghc/head.hackage/-/raw/e48738ee1be774507887a90a0d67ad1319456afc/patches/language-haskell-extract-0.2.4.patch?inline=false
https://git.savannah.gnu.org/cgit/guile.git/patch/?id=2fbde7f02adb8c6585e9baf6e293ee49cd23d4c4
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-text/opensp/files/opensp-1.5.2-c11-using.patch?id=688d9675782dfc162d4e6cff04c668f7516118d0
https://aur.archlinux.org/cgit/aur.git/plain/https.patch?h=w3m-mouse&id=5b5f0fbb59f674575e87dd368fed834641c35f03
https://git.sagemath.org/sage.git/plain/build/pkgs/cython/patches/trashcan.patch?id=4569a839f070a1a38d5dbce2a4d19233d25aeed2
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=f5712c9949d026e4b891b25837edd2edc166151f
https://git.kernel.org/pub/scm/network/wireless/iwd.git/patch/?id=ed10b00afa3f4c087b46d7ba0b60a47bd05d8b39
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-libs/argp-standalone/files/argp-standalone-1.3-throw-in-funcdef.patch?id=409d0e2a9c9c899fb1fb04cc808fe0aff3f745ca
https://aur.archlinux.org/cgit/aur.git/plain/fix_ptr2bool_cast.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://aur.archlinux.org/cgit/aur.git/plain/remove-broken-kde-support.patch?h=gfm
https://code.qt.io/cgit/qt/qtwebengine-chromium.git/patch/?id=193c5bed1cff123e21b7e6d12f464d6709ace2e3
https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/patch/?id=a3272b93725a406bc98b67373da67a4bdf6fcdb0
https://git.sagemath.org/sage.git/patch?id2=9.4&id=9808325853ba9eb035115e5b056305a1c9d362a0
https://aur.archlinux.org/cgit/aur.git/plain/fix_ffmpeg_codecid.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://git.sagemath.org/sage.git/plain/build/pkgs/rubiks/patches/dietz-cu2-Makefile.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-wctomb-r1.patch?id=b510df361241e8f16314b1f14642305f0111dac6
http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/app-arch/unzip/files/unzip-6.0-natspec.patch?revision=1.1
https://git.alpinelinux.org/aports/plain/main/net-snmp/netsnmp-swinst-crash.patch?id=f25d3fb08341b60b6ccef424399f060dfcf3f1a5
https://gitweb.gentoo.org/repo/gentoo.git/plain/dev-tcltk/tix/files/tix-8.4.3-tcl8.5.patch?id=56bd759df1d0c750a065b8c845e93d5dfa6b549d
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/imlib/files/imlib-1.9.15-giflib51-1.patch?id=c6d0ed89ad5653421f21cbf3b3d40fd9a1361828
https://git.sagemath.org/sage.git/plain/build/pkgs/ppl/patches/clang5-support.patch?h=9.2
https://code.qt.io/cgit/qt/qtwebengine-chromium.git/patch/?id=1a53f599
https://git.sagemath.org/sage.git/plain/build/pkgs/ratpoints/patches/sturm_and_rp_private.patch?id=1615f58890e8f9881c4228c78a6b39b9aab1303a
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/lcalc-1.23_default_parameters_1.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libfpx/files/libfpx-1.3.1_p6-gcc6.patch?id=f28a947813dbc0a1fd1a8d4a712d58a64c48ca01
http://git.fluxbox.org/fluxbox.git/patch/?id=22866c4d30f5b289c429c5ca88d800200db4fc4f
https://github.com/bcpierce00/unison/commit/14b885316e0a4b41cb80fe3daef7950f88be5c8f.patch?full_index=1
http://git.marmaro.de/?p=mmh;a=snapshot;h=431604647f89d5aac7b199a7883e98e56e4ccf9e;sf=tgz
https://git.savannah.gnu.org/cgit/emacs.git/patch/?id=a88f63500e475f842e5fbdd9abba4ce122cdb082
http://git.0pointer.net/libcanberra.git/patch/?id=c0620e432650e81062c1967cc669829dbd29b310
https://bug787443.bugzilla-attachments.gnome.org/attachment.cgi?id=359589
https://git.savannah.gnu.org/cgit/guix.git/plain/gnu/packages/patches/glibc-reinstate-prlimit64-fallback.patch?id=eab07e78b691ae7866267fc04d31c7c3ad6b0eeb
https://bugzilla.gnome.org/attachment.cgi?id=364774
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-sound/mp3gain/files/mp3gain-1.6.2-CVE-2019-18359-plus.patch?id=36f8689f7903548f5d89827a6e7bdf70a9882cee
https://gitweb.gentoo.org/repo/gentoo.git/plain/dev-perl/Crypt-Curve25519/files/Crypt-Curve25519-0.60.0-fmul-fixedvar.patch?id=cec727ad614986ca1e6b9468eea7f1a5a9183382
https://www.earthbyte.org/download/8421/?uid=b89bb31428
https://git.strongswan.org/?p=strongswan.git;a=patch;h=91c6387e69c09beaa9b9ca1e28471751a834fc24
https://git.sagemath.org/sage.git/plain/build/pkgs/conway_polynomials/spkg-install.py?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/proj/gcc-patches.git/plain/4.9.4/gentoo/100_all_avoid-ustat-glibc-2.28.patch?id=55fcb515620a8f7d3bb77eba938aa0fcf0d67c96
https://git.sagemath.org/sage.git/plain/build/pkgs/rubiks/patches/dietz-solver-Makefile.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
http://sources.gentoo.org/viewcvs.py/*checkout*/gentoo-x86/sys-apps/fxload/files/fxload-20020411-linux-headers-2.6.21.patch?rev=1.1
https://bugzilla-attachments.libsdl.org/attachment.cgi?id=830
https://bug787443.bugzilla-attachments.gnome.org/attachment.cgi?id=361056
https://bug697543.bugzilla-attachments.gnome.org/attachment.cgi?id=240935
https://git.sagemath.org/sage.git/plain/build/pkgs/cypari/patches/trashcan.patch?id=b6ea17ef8e4d652de0a85047bac8d41e90b25555
https://git.sagemath.org/sage.git/plain/build/pkgs/ecl/patches/write_error.patch?h=9.2
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-macros.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
https://aur.archlinux.org/cgit/aur.git/plain/0001-Use-usb_bulk_-read-write-instead-of-homemade-handler.patch?h=palm-novacom-git
https://aur.archlinux.org/cgit/aur.git/plain/openssl-1.1.0.patch?h=ike&id=3a56735ddc26f750df4720f4baba0728bb4cb458
https://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/app-misc/bfr/files/bfr-1.6-perl.patch?revision=1.1
https://aur.archlinux.org/cgit/aur.git/plain/fix-timespec.patch?h=qlandkartegt
https://marc.info/?l=grub-devel&m=146193404929072&q=mbox
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/time.h.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-fts-obstack.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
https://aur.archlinux.org/cgit/aur.git/plain/fix-ver_str.patch?h=qlandkartegt
https://bugs.gentoo.org/attachment.cgi?id=612792
https://aur.archlinux.org/cgit/aur.git/plain/0001-fix-build-against-Qt-5.15.patch?h=qt5-styleplugins
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=a507b139adf37d2c742e039815601cdc2aa00a84
https://aur.archlinux.org/cgit/aur.git/plain/lua52.patch?h=btanks
https://git.sagemath.org/sage.git/plain/build/pkgs/ecl/patches/16.1.2-getcwd.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
http://git.annexia.org/?p=virt-top.git;a=patch;h=24a461715d5bce47f63cb0097606fc336230589f
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-qsort_r.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=b82e9b6d6b46877e5c3763cc3bc641c66fa7eb54
https://aur.archlinux.org/cgit/aur.git/plain/fix_c++11_literal_warnings.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://w1.fi/cgit/hostap/patch/?id=7800725afb27397f7d6033d4969e2aeb61af4737
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.1-lastshort.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://bazaar.launchpad.net/~arnouten/pastebinit/pastebin-com-https/diff/264?context=3
https://aur.archlinux.org/cgit/aur.git/plain/autoptr2uniqueptr.patch?h=e6cc6bc80c672aaa1a2260abfe8823da299a192c
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=ec1cc0263f1
https://aur.archlinux.org/cgit/aur.git/plain/curl-7.71.0.patch?h=perl-www-curl&id=261d84887d736cc097abef61164339216fb79180
https://w1.fi/cgit/hostap/patch/?id=0388992905a5c2be5cba9497504eaea346474754
https://src.fedoraproject.org/cgit/rpms/dbus-c++.git/plain/dbus-c++-threading.patch?id=7f371172f5c
https://bugzilla.redhat.com/attachment.cgi?id=79507
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=fd08479625b5845e4d725ab628628f7ebfccc407
https://git.savannah.gnu.org/cgit/cpio.git/patch/?id=dfc801c44a93bed7b3951905b188823d6a0432c8
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/Lcommon.h.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-dvddl-r1.patch?id=b510df361241e8f16314b1f14642305f0111dac6
http://www.linux-phc.org/forum/download/file.php?id=194
https://git.sagemath.org/sage.git/plain/build/pkgs/rubiks/patches/dietz-mcube-Makefile.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://salsa.debian.org/debian/autogen/-/raw/debian/1%255.18.16-4/debian/patches/30_ag_macros.m4_syntax_error.diff?inline=false
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=e12ec26e19e02281d3e7258c3aabb88a5cf5ec1d
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-video/rtmpdump/files/rtmpdump-openssl-1.1.patch?id=1e7bef484f96e7647f5f0911d3c8caa48131c33b
https://aur.archlinux.org/cgit/aur.git/plain/hsbase_inline.patch?h=hugs
https://bugs.gentoo.org/attachment.cgi?id=535944
https://w1.fi/cgit/hostap/patch/?id=a0541334a6394f8237a4393b7372693cd7e96f15
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.1-bluray_srm+pow.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.alpinelinux.org/aports/plain/main/libexecinfo/10-execinfo.patch?id=730cdcef6901750f4029d4c3b8639ce02ee3ead1
https://aur.archlinux.org/cgit/aur.git/plain/relocatable-build.patch?h=gprbuild&id=1d4e8a5cb982e79135a0aaa3ef87654bed1fe4f0
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=557c3c373a9992d45d4358a6a2ccf53b03276f39
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-sysmacros.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=5bb7eb173b72256f70c6b3f3916d7a444be93340
https://git.kernel.org/pub/scm/utils/dash/dash.git/patch/?id=6f6d1f2da03468c0e131fdcbdcfa9771ffca2614
https://git.alpinelinux.org/aports/plain/main/figlet/musl-fix-cplusplus-decls.patch?h=3.4-stable&id=71776c73a6f04b6f671430f702bcd40b29d48399
https://gitweb.gentoo.org/repo/gentoo.git/plain/dev-tcltk/tix/files/tix-8.4.3-tcl8.6.patch?id=56bd759df1d0c750a065b8c845e93d5dfa6b549d
https://gitweb.gentoo.org/repo/gentoo.git/plain/app-cdr/dvd+rw-tools/files/dvd+rw-tools-7.0-wexit.patch?id=b510df361241e8f16314b1f14642305f0111dac6
https://git.sagemath.org/sage.git/plain/build/pkgs/maxima/patches/maxima.system.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
http://git.savannah.gnu.org/cgit/src-highlite.git/patch/?id=904949c9026cb772dc93fbe0947a252ef47127f4
https://aur.archlinux.org/cgit/aur.git/plain/fix-qt5-build.patch?h=qlandkartegt
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/imlib/files/imlib-1.9.15-giflib51-2.patch?id=c6d0ed89ad5653421f21cbf3b3d40fd9a1361828
https://git.kernel.org/pub/scm/utils/dash/dash.git/patch/?id=29d6f2148f10213de4e904d515e792d2cf8c968e
https://git.alpinelinux.org/aports/plain/main/elfutils/fix-aarch64_fregs.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libafterimage/files/libafterimage-gif.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4
https://salsa.debian.org/debian/autogen/-/raw/debian/1%255.18.16-4/debian/patches/31_allow_overriding_AGexe_for_crossbuild.diff?inline=false
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=d57c99458933a21fdf94f508191f145ad8d5ec58
https://gitweb.gentoo.org/repo/gentoo.git/plain/games-strategy/scorched3d/files/scorched3d-44-fix-c++14.patch?id=1bbcfc9ae3dfdfcbdd35151cb7b6050776215e4d
https://svnweb.mageia.org/packages/cauldron/bombono-dvd/current/SOURCES/bombono-dvd-1.2.4-scons-python3.patch?revision=1447925&view=co&pathrev=1484457
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-block/partimage/files/partimage-0.6.9-openssl-1.1-compatibility.patch?id=3fe8e9910002b6523d995512a646b063565d0447
https://gitweb.gentoo.org/repo/gentoo.git/plain/sci-libs/vtk/files/vtk-8.2.0-gcc-10.patch?id=c4256f68d3589570443075eccbbafacf661f785f
https://git.sagemath.org/sage.git/plain/build/pkgs/rubiks/patches/reid-Makefile.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://git.alpinelinux.org/aports/plain/testing/mapbox-gl-native/0002-skip-license-check.patch?id=6751a93dca26b0b3ceec9eb151272253a2fe497e
https://gitweb.gentoo.org/repo/gentoo.git/plain/media-libs/libafterimage/files/libafterimage-makefile.in.patch?id=4aa4fca00611b0b3a4007870da43cc5fd63f76c4
https://aur.archlinux.org/cgit/aur.git/plain/fix-qtgui-include.patch?h=qlandkartegt
https://cgit.freedesktop.org/xorg/driver/xf86-video-xgi/patch/?id=78d1138dd6e214a200ca66fa9e439ee3c9270ec8
https://git.alpinelinux.org/aports/plain/community/date/538-output-date-pc-for-pkg-config.patch?id=11f6b4d4206b0648182e7b41cd57dcc9ccea0728
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-asm-ptrace-h.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
http://git.ghostscript.com/?p=mupdf.git;a=patch;h=cee7cefc610d42fd383b3c80c12cbc675443176a
http://bugs.icu-project.org/trac/changeset/39484?format=diff
https://code.qt.io/cgit/qt/qtwebengine-chromium.git/patch/?id=fad3e27bfb50d1e23a07577f087a826b5e00bb1d
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-strndupa.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
https://aur.archlinux.org/cgit/aur.git/plain/fix-proj_api.patch?h=qlandkartegt
https://aur.archlinux.org/cgit/aur.git/plain/fix-gps_read.patch?h=qlandkartegt
https://git.alpinelinux.org/aports/plain/main/lynx/CVE-2021-38165.patch?id=3400945dbbb8a87065360963e4caa0e17d3dcc61
https://git.savannah.gnu.org/cgit/dmidecode.git/patch/?id=1d0db85949a5bdd96375f6131d393a11204302a6
https://git.sagemath.org/sage.git/plain/build/pkgs/lcalc/patches/lcalc-1.23_default_parameters_2.patch?id=07d6c37d18811e2b377a9689790a7c5e24da16ba
https://gitweb.gentoo.org/repo/gentoo.git/plain/sys-apps/xinetd/files/xinetd-2.3.15-creds.patch?id=426002bfe2789fb6213fba832c8bfee634d68d02
https://build.opensuse.org/public/source/openSUSE:Factory/btar/btar-librsync.patch?rev=2
https://projects.duckcorp.org/projects/bip/repository/revisions/39414f8ff9df63c8bc2e4eee34f09f829a5bf8f5/diff/src/connection.c?format=diff
https://git.sagemath.org/sage.git/plain/build/pkgs/giac/patches/nofltk-check.patch?id=7553a3c8dfa7bcec07241a07e6a4e7dcf5bb4f26
http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/dev-db/xbase/files/xbase-3.1.2-gcc47.patch?revision=1.1
https://cgit.freedesktop.org/libreoffice/libcdr/patch/?id=bf3e7f3bbc414d4341cf1420c99293debf1bd894
https://git.alpinelinux.org/aports/plain/main/elfutils/musl-strerror_r.patch?id=2e3d4976eeffb4704cf83e2cc3306293b7c7b2e9
{'.0': 10,
 '.1': 8,
 '.11': 1,
 '.13': 1,
 '.15': 1,
 '.16': 1,
 '.2': 2,
 '.24': 1,
 '.3': 1,
 '.4': 3,
 '.5': 1,
 '.6': 1,
 '.7': 3,
 '.8': 1,
 '.9': 2,
 '.9-assembly': 1,
 '.AppImage': 5,
 '.L': 1,
 '.M': 1,
 '.S': 1,
 '.VSIXPackage': 127,
 '.aff': 1,
 '.assoc': 1,
 '.at': 4,
 '.bin': 1,
 '.c': 12,
 '.cab': 1,
 '.cgi': 10,
 '.deb': 40,
 '.def': 1,
 '.desktop': 1,
 '.dic': 2,
 '.diff': 87,
 '.dtd': 1,
 '.edict': 1,
 '.exe': 1,
 '.fref': 1,
 '.git': 1,
 '.h': 2,
 '.hpp': 1,
 '.hs': 1,
 '.img': 1,
 '.ini': 1,
 '.iso': 3,
 '.js': 1,
 '.json': 2,
 '.linux64': 1,
 '.lock': 2,
 '.love': 4,
 '.menu': 1,
 '.msi': 4,
 '.obj': 1,
 '.otf': 6,
 '.oxt': 3,
 '.pak': 1,
 '.patch': 1214,
 '.phar': 9,
 '.php': 3,
 '.pl': 4,
 '.png': 10,
 '.pom': 279,
 '.py': 8,
 '.rb': 1,
 '.rpm': 4,
 '.rules': 1,
 '.scm': 1,
 '.sh': 1,
 '.shar': 1,
 '.svg': 1,
 '.tcl': 1,
 '.ttf': 29,
 '.txt': 9,
 '.uqm': 3,
 '.vsix': 1,
 '.war': 3,
 '.whl': 18,
 '.xml': 1,
 '.zsh': 1}

[2]

from typing import Dict, List
from pathlib import Path
from collections import defaultdict


def read_dataset(dataset_name: str) -> List[str]:
    filepath = f'/var/tmp/nixguix/dataset/20221007/list-contents-{dataset_name}.csv'
    with open(filepath, "r") as f:
        data=[line.rstrip() for line in f]
    return data


def group_by_extensions(data: List[str]) -> Dict[int, str]:
    extensions = defaultdict(int)
    for url in data:
        path = Path(url)
        filename = path.name
        if "?" in filename:
            print(url)
            path, _ = filename.split('?')
            suffixes = Path(path).suffixes
        else:
            suffixes = path.suffixes

        if suffixes:
            if ".patch" in suffixes or ".patch" in suffixes[-1]:
                key = ".patch"
            elif ".git" in suffixes or ".git" in suffixes[-1]:
                key = ".git"
            elif ".cgi" in suffixes or ".cgi" in suffixes[-1]:
                key = ".cgi"
            else:
                key = suffixes[-1]
            extensions[key] += 1
    return dict(extensions)

for dataset_name in ["guix", "nixpkgs"]:
    data = read_dataset(dataset_name)
    print(f"dataset: {dataset_name}\n")
    extensions = group_by_extensions(data)
    from pprint import pprint
    pprint(extensions)
    print()

ardumont updated the task description. (Show Details)Oct 24 2022, 3:17 PM

I had a pass on extensions to further check what's a tarball or not [1]

By the way, still no news from the upstream issues opened... [2] [3]

[1] P1503

[2] T4608

[3] T4609

ardumont added a revision: D8757: Add support for more tarball/zip extensions.Oct 24 2022, 3:53 PM

Checks that newly detected extensions are actually supported already.
Summary [1] and the actual checks [2]:

[1]

|-------------+----|
| war         | ok |
| whl         | ok |
| oxt         | ok |
| pak         | ok |
| love        | ok |
| vsix        | ok |
| VSIXPackage | ok |
|-------------+----|

[2]

$ ipython
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pathlib import Path

In [5]: dir_ = Path('./manifest-files-output/')

In [12]: workdir = Path('/tmp/workdir')

In [13]: import shutil

In [14]: shutil.rmtree(workdir)

In [15]: workdir.mkdir()

In [16]: all_tarballs = list(dir_.iterdir())

In [17]: all_tarballs
Out[17]:
[PosixPath('manifest-files-output/51-trezor.rules'),
 PosixPath('manifest-files-output/Microsoft.VisualStudio.Services.VSIXPackage'),
 PosixPath('manifest-files-output/android_amd64.img'),
 PosixPath('manifest-files-output/wp-cli-2.5.0.phar'),
 PosixPath('manifest-files-output/tempora-lgc-unicode.otf'),
 PosixPath('manifest-files-output/openprinting-ppds-postscript-lexmark-20160218-1lsb3.2.noarch.rpm'),
 PosixPath('manifest-files-output/android-udev-rules'),
 PosixPath('manifest-files-output/attachment.obj'),
 PosixPath('manifest-files-output/0.9.9-assembly'),
 PosixPath('manifest-files-output/trilium.svg'),
 PosixPath('manifest-files-output/ckan.exe'),
 PosixPath('manifest-files-output/Wire-3.26.2941_amd64.deb'),
 PosixPath('manifest-files-output/1cd6a87c-623f-4407-a52d-c31be49e925c_e19f60808bdcbfbd3c3df6be3e71ffc52e43261e.cab'),
 PosixPath('manifest-files-output/da_DK-2.5.189.oxt'),
 PosixPath('manifest-files-output/virtio-win.iso'),
 PosixPath('manifest-files-output/superProductivity-7.5.1.AppImage'),
 PosixPath('manifest-files-output/cron_4.1.shar'),
 PosixPath('manifest-files-output/uqm-0.7.0-voice.uqm'),
 PosixPath('manifest-files-output/webtorrent-desktop.desktop'),
 PosixPath('manifest-files-output/nethack.def'),
 PosixPath('manifest-files-output/wine-mono-6.4.0-x86.msi'),
 PosixPath('manifest-files-output/ms-python-release.vsix'),
 PosixPath('manifest-files-output/Lightning.scm'),
 PosixPath('manifest-files-output/TeensyduinoInstall.linux64'),
 PosixPath('manifest-files-output/unifont-14.0.01.ttf'),
 PosixPath('manifest-files-output/py-yajl.git'),
 PosixPath('manifest-files-output/streamlit-0.50.2-py2.py3-none-any.whl'),
 PosixPath('manifest-files-output/NotoFonts.pak'),
 PosixPath('manifest-files-output/gorilla1537_64.bin'),
 PosixPath('manifest-files-output/2048.tcl'),
 PosixPath('manifest-files-output/0561ddcedcd12ea1f98b7ddedb93686ed8a5ffa4.patch'),
 PosixPath('manifest-files-output/jenkins.war')]

In [18]: tarball = [entry for entry in all_tarballs if entry.name.endswith('.whl')][0]

In [19]: tarball
Out[19]: PosixPath('manifest-files-output/streamlit-0.50.2-py2.py3-none-any.whl')

In [20]: archive = [entry for entry in all_tarballs if entry.name.endswith('.whl')][0]

In [21]: from swh.core import tarball

In [23]: workdir.mkdir(exist_ok=True)

In [25]: list(workdir.iterdir())
Out[25]: []

In [26]: tarball._unpack_zip(archive, workdir)
Out[26]: PosixPath('/tmp/workdir')

In [27]: list(workdir.iterdir())
Out[27]:
[PosixPath('/tmp/workdir/streamlit-0.50.2.dist-info'),
 PosixPath('/tmp/workdir/streamlit-0.50.2.data'),
 PosixPath('/tmp/workdir/streamlit')]

In [28]: shutil.rmtree(workdir)

In [30]: workdir.mkdir()

In [31]: archive = [entry for entry in all_tarballs if entry.name.endswith('.oxt')][0]

In [32]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.oxt')][0])

In [34]: archive.exists()
Out[34]: True

In [35]: tarball._unpack_zip(archive, workdir)
Out[35]: PosixPath('/tmp/workdir')

In [36]: list(workdir.iterdir())
Out[36]:
[PosixPath('/tmp/workdir/description'),
 PosixPath('/tmp/workdir/da_DK.aff'),
 PosixPath('/tmp/workdir/hyph_da_DK.dic'),
 PosixPath('/tmp/workdir/README_da_DK.txt'),
 PosixPath('/tmp/workdir/description.xml'),
 PosixPath('/tmp/workdir/th_da_DK.dat'),
 PosixPath('/tmp/workdir/Images'),
 PosixPath('/tmp/workdir/META-INF'),
 PosixPath('/tmp/workdir/help'),
 PosixPath('/tmp/workdir/da_DK.dic'),
 PosixPath('/tmp/workdir/dictionaries.xcu'),
 PosixPath('/tmp/workdir/th_da_DK.idx'),
 PosixPath('/tmp/workdir/HYPH_da_DK_README.txt')]

In [37]: shutil.rmtree(workdir)

In [38]: workdir.mkdir()

In [39]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.pak')][0])

In [40]: archive
Out[40]: PosixPath('manifest-files-output/NotoFonts.pak')

In [41]: archive.exists()
Out[41]: True

In [42]: tarball._unpack_zip(archive, workdir)
Out[42]: PosixPath('/tmp/workdir')

In [43]: list(workdir.iterdir())
Out[43]: [PosixPath('/tmp/workdir/Fonts')]

In [44]: shutil.rmtree(workdir)

In [45]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.war')][0])

In [46]: archive
Out[46]: PosixPath('manifest-files-output/jenkins.war')

In [47]: archive.exists()
Out[47]: True

In [48]: tarball._unpack_jar(war, workdir)
Out[48]: PosixPath('/tmp/workdir')

In [49]: list(workdir.iterdir())
Out[49]:
[PosixPath('/tmp/workdir/Main$FileAndDescription.class'),
 PosixPath('/tmp/workdir/LogFileOutputStream$1.class'),
 PosixPath('/tmp/workdir/JNLPMain.class'),
 PosixPath('/tmp/workdir/robots.txt'),
 PosixPath('/tmp/workdir/scripts'),
 PosixPath('/tmp/workdir/MainDialog$1$1.class'),
 PosixPath('/tmp/workdir/images'),
 PosixPath('/tmp/workdir/jsbundles'),
 PosixPath('/tmp/workdir/favicon.ico'),
 PosixPath('/tmp/workdir/MainDialog.class'),
 PosixPath('/tmp/workdir/LogFileOutputStream.class'),
 PosixPath('/tmp/workdir/WEB-INF'),
 PosixPath('/tmp/workdir/LogFileOutputStream$2.class'),
 PosixPath('/tmp/workdir/css'),
 PosixPath('/tmp/workdir/bootstrap'),
 PosixPath('/tmp/workdir/executable'),
 PosixPath('/tmp/workdir/Main.class'),
 PosixPath('/tmp/workdir/winstone.jar'),
 PosixPath('/tmp/workdir/META-INF'),
 PosixPath('/tmp/workdir/help'),
 PosixPath('/tmp/workdir/MainDialog$1.class'),
 PosixPath('/tmp/workdir/ColorFormatter.class')]

In [50]: shutil.rmtree(workdir)

In [51]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.whl')][0])

In [52]: archive
Out[52]: PosixPath('manifest-files-output/streamlit-0.50.2-py2.py3-none-any.whl')

In [53]: archive.exists()
Out[53]: True

In [55]: workdir.mkdir()

In [56]: list(workdir.iterdir())
Out[56]: []

In [57]: tarball._unpack_zip(archive, workdir)
Out[57]: PosixPath('/tmp/workdir')

In [58]: list(workdir.iterdir())
Out[58]:
[PosixPath('/tmp/workdir/streamlit-0.50.2.dist-info'),
 PosixPath('/tmp/workdir/streamlit-0.50.2.data'),
 PosixPath('/tmp/workdir/streamlit')]

In [59]: archive
Out[59]: PosixPath('manifest-files-output/streamlit-0.50.2-py2.py3-none-any.whl')

In [60]: shutil.rmtree(workdir)

In [62]: all_tarballs = list(dir_.iterdir())

In [63]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.love')][0])

In [64]: archive
Out[64]: PosixPath('manifest-files-output/vapor_dbf509f.love')

In [65]: archive.exists()
Out[65]: True

In [67]: workdir.mkdir()

In [68]: list(workdir.iterdir())
Out[68]: []

In [69]: tarball._unpack_zip(archive, workdir)
Out[69]: PosixPath('/tmp/workdir')

In [70]: list(workdir.iterdir())
Out[70]:
[PosixPath('/tmp/workdir/assets'),
 PosixPath('/tmp/workdir/state_vapor.lua'),
 PosixPath('/tmp/workdir/lib'),
 PosixPath('/tmp/workdir/git.lua'),
 PosixPath('/tmp/workdir/core'),
 PosixPath('/tmp/workdir/main.lua'),
 PosixPath('/tmp/workdir/conf.lua'),
 PosixPath('/tmp/workdir/state_load.lua'),
 PosixPath('/tmp/workdir/games.json')]

In [71]: shutil.rmtree(workdir); workdir.mkdir(); list(workdir.iterdir())
Out[71]: []

In [73]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.vsix')][0])

In [74]: archive
Out[74]: PosixPath('manifest-files-output/ms-python-release.vsix')

In [75]: archive.exists()
Out[75]: True

In [76]: tarball._unpack_zip(archive, workdir)
Out[76]: PosixPath('/tmp/workdir')

In [77]: list(workdir.iterdir())
Out[77]:
[PosixPath('/tmp/workdir/[Content_Types].xml'),
 PosixPath('/tmp/workdir/extension'),
 PosixPath('/tmp/workdir/extension.vsixmanifest')]

In [78]: shutil.rmtree(workdir); workdir.mkdir(); list(workdir.iterdir())
Out[78]: []

In [79]: archive = Path([entry for entry in all_tarballs if entry.name.endswith('.VSIXPackage')][0])

In [80]: archive
Out[80]: PosixPath('manifest-files-output/Microsoft.VisualStudio.Services.VSIXPackage')

In [81]: archive.exists()
Out[81]: True

In [82]: tarball._unpack_zip(archive, workdir)
Out[82]: PosixPath('/tmp/workdir')

In [83]: shutil.rmtree(workdir); workdir.mkdir(); list(workdir.iterdir())
Out[83]: []

In [84]: tarball._unpack_zip(archive, workdir)
Out[84]: PosixPath('/tmp/workdir')

In [85]: list(workdir.iterdir())
Out[85]:
[PosixPath('/tmp/workdir/[Content_Types].xml'),
 PosixPath('/tmp/workdir/extension'),
 PosixPath('/tmp/workdir/extension.vsixmanifest')]

ardumont added a revision: D8758: swh.core.tarball: Wire support to existing tarball.Oct 24 2022, 4:38 PM

ardumont added a revision: D8761: nixguix/test: Add all supported tarball extensions to test manifest.Oct 25 2022, 10:09 AM

ardumont added a commit: rDLS31eb5f637f69: Add support for more tarball recognition based on extensions.Oct 25 2022, 10:16 AM

ardumont added a revision: D8763: nixguix: Allow lister to ignore specific extensions.Oct 25 2022, 10:41 AM

ardumont added a commit: rDCORE306a74d01a99: swh.core.tarball: Add support for war files as well.Oct 25 2022, 11:57 AM

ardumont added a commit: rDLSd96a39d5b056: nixguix/test: Add all supported tarball extensions to test manifest.Oct 25 2022, 1:39 PM

ardumont added a commit: rDLSca4ab7f277dc: nixguix: Allow lister to ignore specific extensions.

ardumont updated the task description. (Show Details)Oct 25 2022, 2:50 PM

ardumont mentioned this in rDSNIPca822ba76bf5: nixguix: Reference the snippet of code to check dataset result.Oct 25 2022, 3:03 PM

ardumont added a revision: D8773: nixguix: Deal with edge case url with version instead of extension.Oct 25 2022, 5:29 PM

ardumont added a revision: D8774: nixguix: Use content-disposition from http head request if provided.Oct 25 2022, 5:50 PM

ardumont mentioned this in rDSNIP5203c59a2bb3: nixguix/analyze-result: Improve command output.Oct 25 2022, 6:50 PM

ardumont mentioned this in rDSNIP4ff0739b5cae: nixguix/analyze-result: Improve extension grouping.

Last analysis without [1]. That last diff should fix the key entries marked with the key 'only-version-should-be-tarball'.

@vlorentz @anlambert ^

contents datasets attached below [2] [3]

[1] D8773

$ python -m analyze-result --dataset guix --dataset nixpkgs --obj-type contents --dataset-date 20221025
dataset <guix> with type contents: /var/tmp/nixguix/dataset/20221025/list-contents-guix.csv

{'.c': 3,
 '.cfg': 1,
 '.el': 46,
 '.lisp': 1,
 '.map': 1,
 '.php': 1,
 '.py': 1,
 '.ttf': 1,
 'ending-version-ok': 8,
 'only-version-should-be-tarball': 3}

dataset <nixpkgs> with type contents: /var/tmp/nixguix/dataset/20221025/list-contents-nixpkgs.csv

{'.L': 1,
 '.M': 1,
 '.S': 1,
 '.aff': 1,
 '.assoc': 1,
 '.at': 4,
 '.c': 12,
 '.cab': 1,
 '.cgi': 10,
 '.def': 1,
 '.desktop': 1,
 '.diff': 90,
 '.dtd': 1,
 '.edict': 1,
 '.fref': 1,
 '.h': 2,
 '.hpp': 1,
 '.hs': 1,
 '.img': 1,
 '.ini': 1,
 '.js': 1,
 '.json': 2,
 '.lock': 2,
 '.menu': 1,
 '.obj': 1,
 '.otf': 6,
 '.patch': 1221,
 '.phar': 9,
 '.php': 3,
 '.pl': 4,
 '.pom': 279,
 '.py': 8,
 '.rb': 1,
 '.rules': 1,
 '.scm': 1,
 '.sh': 1,
 '.shar': 1,
 '.svg': 1,
 '.tcl': 1,
 '.ttf': 29,
 '.txt': 9,
 '.uqm': 3,
 '.xml': 1,
 '.zsh': 1,
 'ending-version-ok': 5,
 'only-version-should-be-tarball': 33}

[2] guix

list-contents-guix.csv4 KBDownload

[3] nixpkgs

list-contents-nixpkgs.csv180 KBDownload

ardumont updated the task description. (Show Details)Oct 25 2022, 6:57 PM

ardumont added a commit: rDLS026fea21da49: nixguix: Deal with edge case url with version instead of extension.Oct 26 2022, 12:00 PM

ardumont added a commit: rDLS81688ca17e66: nixguix: Use content-disposition from http head request if provided.Oct 26 2022, 2:28 PM

ardumont updated the task description. (Show Details)Oct 26 2022, 2:29 PM

ardumont closed subtask T4662: staging: Deploy nixguix lister and loader as Invalid.Oct 26 2022, 2:33 PM

ardumont updated the task description. (Show Details)

With latest diffs, the filtering seems to sort properly the files and tarballs for the guix manifest:

$ python -m analyze-result --dataset guix --obj-type contents --obj-type directories --dataset-date 20221026
dataset <guix> with type contents: /var/tmp/nixguix/dataset/20221026/list-contents-guix.csv

{'.c': 3,
 '.cfg': 1,
 '.el': 46,
 '.lisp': 1,
 '.map': 1,
 '.php': 1,
 '.py': 1,
 '.ttf': 1,
 'ending-version-ok': 8}

dataset <guix> with type directories: /var/tmp/nixguix/dataset/20221026/list-directories-guix.csv

{'.7z': 3,
 '.Z': 3,
 '.crate': 2987,
 '.gem': 376,
 '.gz': 6991,
 '.jar': 60,
 '.love': 1,
 '.lz': 17,
 '.lzma': 2,
 '.php': 1,
 '.tar': 96,
 '.tbz': 30,
 '.tgz': 104,
 '.txz': 1,
 '.xz': 1211,
 '.z': 1,
 '.zip': 180,
 '.zst': 1,
 'ending-version-ok': 628,
 'only-version-should-be-tarball': 24}

[1] guix contents

list-contents-guix.csv4 KBDownload

[2] guix directories

list-directories-guix.csv891 KBDownload

ardumont mentioned this in rDSNIPdb50b4777309: nixguix/analyze-result: Make --obj-type a multiple option.Oct 26 2022, 2:45 PM

ardumont updated the task description. (Show Details)Oct 26 2022, 5:48 PM

gitlab-migration changed the status of subtask T3294: nixguix: Add support for pseudo-URLs with a missing schema from Resolved to Migrated.Jan 8 2023, 4:34 PM

gitlab-migration changed the status of subtask T4662: staging: Deploy nixguix lister and loader from Invalid to Migrated.Jan 8 2023, 10:04 PM

This task has been migrated to GitLab.

gitlab-migration closed subtask T4608: nixpkgs manifests list "recursive" file which are missing information to recompute their hashes as Migrated.Jan 8 2023, 10:25 PM

gitlab-migration closed subtask T4609: nipxkgs manifests list "git" origins as "urls" as Migrated.

rCDFJ Dockerfiles for Jenkins
		D8627	rCDFJb7f329d73d65 base-buster/Dockerfile: Install nix binaries in buster image
rDENV Development environment
	Abandoned		D8621 wip: Add nixguix lister and loader
		D8625	rDENV8e164268c784 docker: Install nix binaries in swh/stack image
rDLDBASE Generic VCS/Package Loader
	Abandoned		D8406 [WIP] archive.loader: Allow archive loader to deal with nixguix archives
		D8636	rDLDBASE028b7c04b9ed ContentLoader: Allow nar computation checks
		D8630	rDLDBASE8aa6dab72ad4 {Cnt\|Dir}Loader: Fix standard/nar hash mismatch behavior to fail loading
		D8618	rDLDBASE4d51ad991d31 DirectoryLoader: Check nar hashes when provided
		D8601	rDLDBASEa5255f106453 {Content\|Directory}Loader: Register tasks
		D8587	rDLDBASE39c33a66c27c {Content\|Directory}Loader: Adapt support for checksums
		D8584	rDLDBASEdbf7f3dca0c8 Add Directory Loader to allow tarball ingestion as Directory
		D8581	rDLDBASEf774aba59e65 Add Content Loader to ingest raw content file
rDLS Listers
		D8774	rDLS81688ca17e66 nixguix: Use content-disposition from http head request if provided
		D8773	rDLS026fea21da49 nixguix: Deal with edge case url with version instead of extension
		D8763	rDLSca4ab7f277dc nixguix: Allow lister to ignore specific extensions
		D8761	rDLSd96a39d5b056 nixguix/test: Add all supported tarball extensions to test manifest
		D8757	rDLS31eb5f637f69 Add support for more tarball recognition based on extensions
		D8637	rDLSc22f41a6d74c nixguix: Exclude faulty "recursive" file origins from listing
		D8632	rDLS5a53243bd3de nixguix: Refactor by renaming success or failure the different datasets
		D8631	rDLS2e6e282d4464 nixguix: Deal with manifest entries without an integrity field
		D8626	rDLSf2377c283ac5 nixguix: Improve is_tarball detection pattern
		D8624	rDLS2ee103e2bcbb nixguix: Improve further tarball detection
		D8620	rDLSff80a91f0af8 nixguix: Improve git origins detection
		D8619	rDLS2fbd66778f32 nixguix: Improve tarball detection
		D8614	rDLS944d4b5b60d0 nixguix: Add support for listing origins with "recursive" integrity
		D8612	rDLS5daead68adcb nixguix: Add support for pseudo url with missing schema
		D8611	rDLS0f8f293f968e nixguix: Deal with connection error with server
		D8610	rDLSd92474bbda68 nixguix: Refactor by cleaning up unneeded code
		D8607	rDLS06b11dd5f672 nixguix: Deal with impossible communication with server
		D8606	rDLSa94b75f366be nixguix: Deal with mistyped origins
		D8605	rDLS1b4fe51f62c7 nixguix: Randomize order of listed origins
		D8341	rDLS6d2e7aa17808 nixguix: Register task
		D8341	rDLS94b6dbea0a7f nixguix: Document lister
		D8341	rDLSfbfdf88ea4fe nixguix: Add lister
rDMOD Data model
	Abandoned		D8582 Allow more checksum computations in Content model
rDCORE Foundations and core functionalities
		D8758	rDCORE306a74d01a99 swh.core.tarball: Add support for war files as well
		D8623	rDCOREbe9403c67696 core.tarball: Add missing mimetype to the list
		D8603	rDCORE9a8292c12c5f Make mimetype to archive format dictionary public

Replace the Nixguix loader with a lister
Closed, MigratedEdits Locked
Actions

Description

Revisions and Commits

Related Objects
Search...

Event Timeline

	vlorentz
	Dec 8 2021, 6:45 PM

Replace the Nixguix loader with a listerClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Replace the Nixguix loader with a lister
Closed, MigratedEdits Locked
Actions

Related Objects
Search...