Page MenuHomeSoftware Heritage

nixguix: Filter out unsupported artifact extensions
ClosedPublic

Authored by ardumont on Aug 7 2020, 5:44 PM.

Details

Summary

Instead of trying to download artifacts and then fail on ingestion. Try to
prevent spurious downloads which cost too many wasted resources (both upstream
servers and our own infra) and probably a high number of noisy issues in
sentry.

Ideally, we should improve our archive support. In the mean time, this approach
allows more sensible treatment than the current approach (see previous points).

This will also allow to incrementally add support to currently unauthorized known
unsupported pattern by removing them (and update the test scenario about it).

Related to T2510

Test Plan

tox

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D3742 (id=13181)

Rebasing onto 57d3e372fa...

Current branch diff-target is up to date.
Changes applied before test
commit dd6ede38ee4bab346838b49a790c344b8f87dab0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Aug 7 17:39:42 2020 +0200

    nixguix: Filter out unsupported artifact extensions instead
    
    Related to T2510

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/264/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/264/console

ardumont retitled this revision from wip: nixguix: Filter out unsupported artifact extensions instead to wip: nixguix: Filter out unsupported artifact extensions.
vlorentz requested changes to this revision.Aug 7 2020, 5:55 PM
vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/loader/package/nixguix/loader.py
217

forgot tbz2

293–296

let's be consistent with string prefixes.

and spaces belong at the end of lines, not the beginning

swh/loader/package/nixguix/tests/test_nixguix.py
175

missing tbz2

185

meh, I think we should support .patch eventually so it's not a great example.

.whl and .iso would be better examples.

but that's not a big deal

This revision now requires changes to proceed.Aug 7 2020, 5:55 PM
swh/loader/package/nixguix/loader.py
217

and .7z and .Z, etc. Look at swh-lister/swh/lister/gnu/tree.py

But I take you are fine with the approach then?

(Thanks for the remarks, i'll fix those)

swh/loader/package/nixguix/loader.py
293–296

yeah, i have a tendency to split "a l'arrache", sorry and forget to check back...

swh/loader/package/nixguix/tests/test_nixguix.py
185

yes, agreed.
In the mean time, it's nice it's here.
That will fail when we'll add support and forgot to update it.

  • Adapt according to review
  • rework commit message to add description rationale
  • Remains loader integration tests to adapt (loader status and hashes most likely)

Build has FAILED

Patch application report for D3742 (id=13184)

Rebasing onto 57d3e372fa...

Current branch diff-target is up to date.
Changes applied before test
commit 1fb003ff2dce1b953bba653f9e1134ce6a2e4880
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Aug 7 17:39:42 2020 +0200

    nixguix: Filter out unsupported artifact extensions
    
    Instead of trying to download artifacts and then break on ingestion. Try to
    prevent spurious downloads which cost too many wasted resources (both upstream
    servers and our own infra) and many noisy issues in sentry.
    
    Ideally, we should improve our archive support. In the mean time, this approach
    allows more sensible treatment than the current approach (see previous points).
    This will also allows to incrementally add support, and demonstrating it by
    removing the new supported archive pattern in the current regexp in next
    commits.
    
    Related to T2510

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/265/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/265/console

But I take you are fine with the approach then?

(Thanks for the remarks, i'll fix those)

Yes of course.

Although... now I wonder if we shouldn't go with a blacklist rather than a whitelist. This way when new types appear, we have a chance to get an alert instead of silently ignoring them

Change the unsupported approach to unauthorize some known patterns of artifacts we do not support.

This makes the previous failed tests pass again because we don't have any
of those unauthorized pattern in our test dataset.

Build has FAILED

Patch application report for D3742 (id=13185)

Rebasing onto 57d3e372fa...

Current branch diff-target is up to date.
Changes applied before test
commit b2d716d6316a88f807483528a38ff23ba386d9ad
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Aug 7 17:39:42 2020 +0200

    nixguix: Filter out unsupported artifact extensions
    
    Instead of trying to download artifacts and then break on ingestion. Try to
    prevent spurious downloads which cost too many wasted resources (both upstream
    servers and our own infra) and many noisy issues in sentry.
    
    Ideally, we should improve our archive support. In the mean time, this approach
    allows more sensible treatment than the current approach (see previous points).
    This will also allows to incrementally add support, and demonstrating it by
    removing the new supported archive pattern in the current regexp in next
    commits.
    
    Related to T2510

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/266/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/266/console

ardumont edited the test plan for this revision. (Show Details)
ardumont retitled this revision from wip: nixguix: Filter out unsupported artifact extensions to nixguix: Filter out unsupported artifact extensions.Aug 7 2020, 9:37 PM
ardumont edited the summary of this revision. (Show Details)

Build has failed.

for unrelated reason, the main master build is broken

[1] https://jenkins.softwareheritage.org/job/DLDBASE/job/tests/1035/console

vlorentz added inline comments.
swh/loader/package/nixguix/loader.py
219

Could you just add a \. before the list of extensions?

This revision is now accepted and ready to land.Aug 7 2020, 9:40 PM
swh/loader/package/nixguix/loader.py
219

just before the parenthesis as in:

r".*\.(iso|whl|gem|pom|msi|pod|png|rock|ttf|jar|c|rpm|diff|patch)$", re.

right?

swh/loader/package/nixguix/loader.py
219

yes

Adapt according to review
Rebase on D3744 so build is green ;)

swh/loader/package/nixguix/loader.py
219

on its way ;)

Build is green

Patch application report for D3742 (id=13188)

Could not rebase; Attempt merge onto 57d3e372fa...

Updating 57d3e37..fc9fe58
Fast-forward
 requirements-swh.txt                             |  2 +-
 swh/loader/package/nixguix/loader.py             | 46 ++++++++++++---
 swh/loader/package/nixguix/tests/test_nixguix.py | 75 +++++++++++++++++++++---
 swh/loader/tests/__init__.py                     | 22 +++----
 4 files changed, 115 insertions(+), 30 deletions(-)
Changes applied before test
commit fc9fe58d6572071643308923781d1dab5b242e3f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Aug 7 17:39:42 2020 +0200

    nixguix: Filter out unsupported artifact extensions
    
    Instead of trying to download artifacts and then break on ingestion. Try to
    prevent spurious downloads which cost too many wasted resources (both upstream
    servers and our own infra) and many noisy issues in sentry.
    
    Ideally, we should improve our archive support. In the mean time, this approach
    allows more sensible treatment than the current approach (see previous points).
    This will also allows to incrementally add support, and demonstrating it by
    removing the new supported archive pattern in the current regexp in next
    commits.
    
    Related to T2510

commit d9e4365f6f2998354c3128f1a69e014da8976ed8
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Aug 7 21:45:15 2020 +0200

    swh.loader.tests: Use snapshot_get_all_branches in check_snapshot
    
    Fixes build [1]
    
    [1] https://jenkins.softwareheritage.org/job/DLDBASE/job/tests/1035/console

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/268/ for more details.

swh/loader/package/nixguix/loader.py
218

We can add 'el' extensions (emacs-lisp) to the list.
Well, i gather all kind of textual files for now ;)