Page MenuHomeSoftware Heritage

scanner: add support for an exclusion list
Closed, MigratedEdits Locked

Description

When scanning local code there are often file patterns that we want to exclude as it doesn't make sense to scan them.
Common offenders will be things like .git directories, .tox, but also node_modules/ and a bunch of others, depending on the case.

Hence we want to have:

  • the ability to specify exclusion patterns on the command line, e.g., with a -x/--exclude option; it should be possible to pass it multiple times
  • the ability to specify exclusion patters in the configuration file of swh-scanner
  • some sensible defaults

The syntax and semantics of exclusion patterns is to be defined. Glob patterns might be an option. We need something that is expressive, common, and that we do not need to implement by hand.

Event Timeline

zack triaged this task as Normal priority.Mar 25 2020, 11:07 AM
zack created this task.

The problem here is that actually swh-model il calculating the persistent identifier of the input path, so it also has to exclude the specified paths.

The problem here is that actually swh-model il calculating the persistent identifier of the input path, so it also has to exclude the specified paths.

Yes, implementing this for the scanner will probably entail making the swh.model stuff you're using parametric on an optional exclusion list. And it's good!, as it might be useful in contexts other than the scanner.