Page MenuHomeSoftware Heritage

Add nixguix lister
ClosedPublic

Authored by ardumont on Aug 30 2022, 11:19 AM.

Details

Summary

Implementation as per the notes in the linked task (up to the hedgedoc linked in the task).

Downstream loader adaptations required to be able to ingest the output of the lister are tracked in the following diffs:

  • D8581: Create ContentLoader(BaseLoader) to deal with ListedOrigins with "file" visit_type
  • D8584: Create DirectoryLoader(BaseLoader) to deal with "integrity" field (with or without version)
  • run through docker to lift papercuts [1] [2] [3]

[1] guix

swh-lister_1                         | [2022-10-03 08:05:49,405: INFO/ForkPoolWorker-1] Task swh.lister.nixguix.tasks.NixGuixListerTask[f58096ad-af9f-42fa-bc29-e4791f1a24e3] succeeded in 557.3408025280223s: {'pages': 21483, 'origins': 18936}

[2] P1467

[3] nixpkgs

swh-lister_1                         | [2022-10-03 15:36:38,225: INFO/ForkPoolWorker-1] Task swh.lister.nixguix.tasks.NixGuixListerTask[b442f750-797d-4df8-af0e-a5426a669462] succeeded in 177.8664992809645s: {'pages': 31285, 'origins': 31218}

Related to T3781

Diff Detail

Repository
rDLS Listers
Branch
add-nixguix-lister
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32041
Build 50160: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 50159: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Rebase (this includes a work around to fix master build)

Build has FAILED

Patch application report for D8341 (id=31028)

Rebasing onto 6f40d2c1a5...

Current branch diff-target is up to date.
Changes applied before test
commit 807f554342229ead766b35a13e438ffcad39ea0a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Add a small lister introduction
    
    Related to T3781

commit 89167cddfab2c5c5e2087a6b2f0a48ac6bb5f96a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit 7157bb521cb56793dff7840e1df84f3583034e46
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:06:58 2022 +0200

    nixguix: Transform integrity field into a checksums dict
    
    This allows to make Content/Directory loader reusable for other listers. This also
    centralizes the integrity field computation lister side.
    
    Related to T3781

commit 9ffc9e840f89adff4b41cae2449f195d93155776
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 23 15:34:39 2022 +0200

    nixguix: Keep only one url as origin and other urls as fallback ones
    
    This is for the "url" typed artifacts which are either tarballs or plain text files.
    
    Related to T3781

commit 3b93228c90103b27a2601e3a244e3befa01f1a56
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Boostrap lister
    
    It's an incomplete implementation for now.
    
    Related to T3781

commit e832d74c9843fb363acff2a94c2ba38ed7707063
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 2 19:43:24 2022 +0200

    requirements: Work around build failure
    
    A new release made our job failing.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/723/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/723/console

Build is green

Patch application report for D8341 (id=31029)

Rebasing onto 6f40d2c1a5...

Current branch diff-target is up to date.
Changes applied before test
commit 6fc011b206aacc28bf8b0b01852b7ad1aae7fbbc
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit 3035fe57d0d229a1028bd7139c21f72a39c03c89
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit 7157bb521cb56793dff7840e1df84f3583034e46
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:06:58 2022 +0200

    nixguix: Transform integrity field into a checksums dict
    
    This allows to make Content/Directory loader reusable for other listers. This also
    centralizes the integrity field computation lister side.
    
    Related to T3781

commit 9ffc9e840f89adff4b41cae2449f195d93155776
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 23 15:34:39 2022 +0200

    nixguix: Keep only one url as origin and other urls as fallback ones
    
    This is for the "url" typed artifacts which are either tarballs or plain text files.
    
    Related to T3781

commit 3b93228c90103b27a2601e3a244e3befa01f1a56
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Boostrap lister
    
    It's an incomplete implementation for now.
    
    Related to T3781

commit e832d74c9843fb363acff2a94c2ba38ed7707063
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Oct 2 19:43:24 2022 +0200

    requirements: Work around build failure
    
    A new release made our job failing.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/724/ for more details.

vlorentz requested changes to this revision.Oct 3 2022, 9:52 AM
vlorentz added inline comments.
requirements.txt
10 ↗(On Diff #31029)

could you add a comment to explain this?

swh/lister/__init__.py
31–32

please us docstrings to comment public constants

swh/lister/nixguix/__init__.py
13–15
swh/lister/nixguix/lister.py
74

use an enum instead of str. (or typing.Literal if it really needs to be a string)

77

svn is not a DVCS, rename it to SUPPORTED_VCS

127
141

Could you make it a required argument, to avoid accidental misconfiguration?

194

I know it's a noop on non-GitHub URLs, but I find it meh to call this unconditionally on non-GitHub URLs; because seemingly innocuous changes to GithubSession.get_canonical_url's code would make us leak our API tokens to random websites.

Heh, I guess that's fine.

210–212
233–234

vcs, too

This revision now requires changes to proceed.Oct 3 2022, 9:52 AM
ardumont added inline comments.
requirements.txt
10 ↗(On Diff #31029)

yes, it's workaround in a dedicated commit (so it's explained there).
Currently all our builds (using entrypoints) are broken due to this (including this diff [1]).

[1] I got lazy about doing another diff and build this one on it.

swh/lister/__init__.py
44

sure.

fwiw, i've noticed, it's missing crate and gem as well.

swh/lister/nixguix/lister.py
141

yes

194

It's up to reviews to block innocuous change that would do what you said.

ardumont planned changes to this revision.EditedOct 3 2022, 10:21 AM

Docker run made apparent that determining the urls' nature is not so simple...
I'll amend accordingly and take into account review suggestions.

Rebase and adapt according to:

  • reviews
  • docker runs [1]

[1] this added some convoluted contraptions to detect if an url is a tarball or not.
It's not well covered by tests yet. I'd like some feedback on how to better support
this. See [2] for an excerpt of the current runs.

[2] P1467

ardumont edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D8341 (id=31055)

Rebasing onto fa1205c4df...

Current branch diff-target is up to date.
Changes applied before test
commit 0599051e4c8d842c32e92fbfccc9035fb6a4c8de
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit 8485106fc40e199c04b51775acb3334a84b7acbd
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit 697bdb777c8f877b478ade0ba10509c8c3a2625e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/725/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/725/console

Build is green

Patch application report for D8341 (id=31069)

Rebasing onto fa1205c4df...

Current branch diff-target is up to date.
Changes applied before test
commit 31f51cde9b32067909d5dbf7154235a484b89108
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit 6c63d7cf27b39609dbf8dc12c6ec894d4ecffe7e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit 147fca73d0021ab71d526a47f38d5414c08f9f52
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/726/ for more details.

anlambert added inline comments.
swh/lister/nixguix/lister.py
95–97

Maybe you could update and reuse tarball mimetypes declared in swh.core.tarball instead ?

230

you can remove that line, session is already initialized by the base Lister class.

245

allow_redirects is set to True by default for GET requests, you can remove that parameter.

swh/lister/nixguix/lister.py
141

actually it got dropped as it was a misnamed "instance" equivalent...

ardumont added inline comments.
swh/lister/nixguix/lister.py
95–97

yeah, that'd make sense, can we iterate over this in another diff?

That'd let me the time to open those mimetypes in swh.core.tarball (and release) so we can use a non private variable.

217

fwiw, I've really opened this only for docker because i got rate-limited almost immediately otherwise...

230

ack

245

right

310

ok (missing coverage), i can add that part as well...

ardumont marked an inline comment as done.

Adapt according to review (except for reusing the mimetype from swh.core.tarballs, i'll
do that in another iteration if you don't mind)

Build is green

Patch application report for D8341 (id=31071)

Rebasing onto fa1205c4df...

Current branch diff-target is up to date.
Changes applied before test
commit 0bf641449ef2b0f34ad80233d1dfa4ba67366b8b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit df4b9f033c5744485e0f4d51d6882f221844b697
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit 0342f8f9bf3ae180410c79afb4e3f0d9c7316613
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/727/ for more details.

swh/lister/nixguix/lister.py
95–97

Opened [1] in swh.core

[1] D8603

  • Adapt last review comment (reuse swh.core's mimetype dictionary)
  • Format warning log
  • Decrease log verbosity for one debug instruction (it was in info for test purposes)
  • Skip artifacts with missing integrity field too (nixpkgs dataset has some)
ardumont added inline comments.
swh/lister/nixguix/lister.py
95–97

I've amended that lister diff to do that immediatly in the end (the dev/diff/review/package got fast ;)

Build is green

Patch application report for D8341 (id=31074)

Rebasing onto fa1205c4df...

Current branch diff-target is up to date.
Changes applied before test
commit c37b02028beede7ed41302ea60bad7c7be7d735f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit 25e63dce66ebe6c562a824ced18219a34fa2350c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit f35f40638e3690a34ef5db83befac8d0698f136c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/728/ for more details.

Fix missing integrity field (this makes for a redundant check in tests, adding
missing integrity to actual test correctly this entry)

Use proper range of commits to update the diff

Build has FAILED

Patch application report for D8341 (id=31076)

Could not rebase; Attempt merge onto fa1205c4df...

Updating fa1205c..9d74319
Fast-forward
 requirements-swh.txt                               |   2 +-
 setup.py                                           |   1 +
 swh/lister/__init__.py                             |  22 ++
 swh/lister/gnu/tree.py                             |  21 +-
 swh/lister/nixguix/__init__.py                     |  38 +++
 swh/lister/nixguix/lister.py                       | 370 +++++++++++++++++++++
 swh/lister/nixguix/tasks.py                        |  14 +
 swh/lister/nixguix/tests/__init__.py               |   0
 .../nixguix/tests/data/guix-swh_sources.json       |  19 ++
 .../nixguix/tests/data/nixpkgs-swh_sources.json    |  52 +++
 swh/lister/nixguix/tests/test_lister.py            | 238 +++++++++++++
 swh/lister/nixguix/tests/test_tasks.py             |  27 ++
 swh/lister/tests/test_cli.py                       |   4 +
 13 files changed, 790 insertions(+), 18 deletions(-)
 create mode 100644 swh/lister/nixguix/__init__.py
 create mode 100644 swh/lister/nixguix/lister.py
 create mode 100644 swh/lister/nixguix/tasks.py
 create mode 100644 swh/lister/nixguix/tests/__init__.py
 create mode 100644 swh/lister/nixguix/tests/data/guix-swh_sources.json
 create mode 100644 swh/lister/nixguix/tests/data/nixpkgs-swh_sources.json
 create mode 100644 swh/lister/nixguix/tests/test_lister.py
 create mode 100644 swh/lister/nixguix/tests/test_tasks.py
Changes applied before test
commit 9d74319009813f67ee283cc3ab0375f882e4434e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit c4eed155c8c8fb4a035ebe1a1c83b48a9ece4b41
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit 385dfdcd9c4e9413249e906fc6c3a2900b63b9e9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/729/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/729/console

Build has FAILED

Patch application report for D8341 (id=31077)

Could not rebase; Attempt merge onto fa1205c4df...

Updating fa1205c..9d74319
Fast-forward
 requirements-swh.txt                               |   2 +-
 setup.py                                           |   1 +
 swh/lister/__init__.py                             |  22 ++
 swh/lister/gnu/tree.py                             |  21 +-
 swh/lister/nixguix/__init__.py                     |  38 +++
 swh/lister/nixguix/lister.py                       | 370 +++++++++++++++++++++
 swh/lister/nixguix/tasks.py                        |  14 +
 swh/lister/nixguix/tests/__init__.py               |   0
 .../nixguix/tests/data/guix-swh_sources.json       |  19 ++
 .../nixguix/tests/data/nixpkgs-swh_sources.json    |  52 +++
 swh/lister/nixguix/tests/test_lister.py            | 238 +++++++++++++
 swh/lister/nixguix/tests/test_tasks.py             |  27 ++
 swh/lister/tests/test_cli.py                       |   4 +
 13 files changed, 790 insertions(+), 18 deletions(-)
 create mode 100644 swh/lister/nixguix/__init__.py
 create mode 100644 swh/lister/nixguix/lister.py
 create mode 100644 swh/lister/nixguix/tasks.py
 create mode 100644 swh/lister/nixguix/tests/__init__.py
 create mode 100644 swh/lister/nixguix/tests/data/guix-swh_sources.json
 create mode 100644 swh/lister/nixguix/tests/data/nixpkgs-swh_sources.json
 create mode 100644 swh/lister/nixguix/tests/test_lister.py
 create mode 100644 swh/lister/nixguix/tests/test_tasks.py
Changes applied before test
commit 9d74319009813f67ee283cc3ab0375f882e4434e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit c4eed155c8c8fb4a035ebe1a1c83b48a9ece4b41
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit 385dfdcd9c4e9413249e906fc6c3a2900b63b9e9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/730/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/730/console

zut, forgot to update the mock setup ¯\_(ツ)_/¯

Build is green

Patch application report for D8341 (id=31078)

Rebasing onto fa1205c4df...

Current branch diff-target is up to date.
Changes applied before test
commit 87a0542dd086e161196a6e6a7358c9cf18faa549
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit 2d6e216366b785ff7f21e986430d7ea3e9f3732c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit 3aa0387968c8d77ea63019c9e5e901899bb4d2b3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/731/ for more details.

Use correct visit_type {content, directory} in the listed origins outputted from the
lister.

So they get a chance to get sheduled for ingestion.

Build was aborted

Patch application report for D8341 (id=31079)

Rebasing onto fa1205c4df...

Current branch diff-target is up to date.
Changes applied before test
commit c0ecaa8af78095be364f59dc927bab087e4abf77
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit 17365e04f3cd4533c2dbabc8bc120427ca639489
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit cc96f58627ba5e9ef760f28e36cb4c1813a6d563
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/732/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/732/console

Fix documentation references and tests according to latest change in visit_type

Build is green

Patch application report for D8341 (id=31080)

Rebasing onto fa1205c4df...

Current branch diff-target is up to date.
Changes applied before test
commit 94b6dbea0a7f602be0711a3bb1f9bb9e16fc48ce
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:41:48 2022 +0200

    nixguix: Document lister
    
    Related to T3781

commit 6d2e7aa17808e39ba9f493b65d662d0ddef5796c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sat Oct 1 16:12:46 2022 +0200

    nixguix: Register task
    
    Related to T3781

commit fbfdf88ea4fe79c4846ecd48f2a1322f5d3995fc
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Aug 30 11:17:33 2022 +0200

    nixguix: Add lister
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/733/ for more details.

Looks good to me, better fixing issues that might araise when testing the lister in new diffs.

This revision is now accepted and ready to land.Oct 4 2022, 2:26 PM
This revision was automatically updated to reflect the committed changes.