Page MenuHomeSoftware Heritage

nixguix: Filter out unsupported artifacts from ingestion
Closed, ResolvedPublic

Description

As [1] revealed, unfiltered loading of unsupported artifacts can impact the
ingestion time for no good reason.

It's not quite reasonable from a resource standpoint, both for the mirror we are
downloading the artifacts and our own infra. So for the time being, we might as
well filter out unsupported artifacts (based on their extension most likely).

The aim is to make that list short and even short-lived (to subside with better
artifacts support).

I recall @lewo should have a fair subset of those unsupported extensions since
he filtered them out of the nix listing ([2] should help)

[1] https://forge.softwareheritage.org/T2485#46361

[2] https://nix-community.github.io/nixpkgs-swh/#by-file-types

Event Timeline

ardumont triaged this task as Normal priority.Sun, Jul 26, 6:23 AM
ardumont created this task.
ardumont updated the task description. (Show Details)
lewo added a comment.Sun, Aug 2, 9:27 PM

I'm currently using the following regex to filter the exposed urls .tar.gz$|.zip$|tar.bz2$|.tbz$|.tar.xz$|.tgz$|.tar$ but I'm pretty sure it could be improved.

ardumont closed this task as Resolved.Fri, Aug 7, 11:51 PM
ardumont claimed this task.