Page MenuHomeSoftware Heritage

scanner: exclusion list through glob patterns
ClosedPublic

Authored by DanSeraf on Apr 9 2020, 1:05 PM.

Details

Summary

Implementation of a directory filter for the scanner

Related: T2336

Diff Detail

Repository
rDTSCN Code scanner
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

DanSeraf created this revision.Apr 9 2020, 1:05 PM

Build is green

Patch application report for D2989 (id=10628)

Rebasing onto 62be9f10d4...

Current branch diff-target is up to date.
Changes applied before test
commit e6adcb9642f9ca7609bbb607cd1331e5a164fd9a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Apr 9 12:50:19 2020 +0200

    exclusion list through glob patterns

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/5/ for more details.

vlorentz requested changes to this revision.Apr 9 2020, 3:46 PM
vlorentz added a subscriber: vlorentz.

Instead of building the set of all paths to exclude (possibly very large), I think you should pass patterns to directory_filter and use [[ https://docs.python.org/3/library/fnmatch.html | fnmatch ]] there

swh/scanner/cli.py
36

patterns: List[str]

46

elif root_path not in dirpath.parents:

swh/scanner/scanner.py
82

len(set(path.parents).intersection(exclude_paths)) != 0 is equivalent to set(path.parents).intersection(exclude_paths) when coerced to bool.

(also, shorter version if you prefer: set(path.parents) & exclude_paths)

101

Jenkins reports this code is never ran by the tests

101–102

Instead of duplicating the code to call Directory.from_disk, you could define dir_filter like this:

if exclude_paths:
    def dir_filter(dirpath, *args):
        return directory_filter(dirpath, exclude_paths)
else:
    dir_filter = None

(mypy might pout, but adding type annotations should work)

101–102

should just be else, to handle non-regular files.

102–104

same here

102–118

Does this really need to be a separate case? Doesn't Directory.from_disk handle roots that are a file?

126–129

exclude_paths: Set[PosixPath]

157

Avoid mutables as default value. It's fine here; but it's a bad habit (it's instantiated on function declaration, not on function call; so mutating it once affects all future calls)

swh/scanner/tests/test_scanner.py
57–65

this test doesn't fail if get_subpaths returns nothing, which is what happens here.

You should assign the return value of get_subpaths, check the length, then iterate on it.

This revision now requires changes to proceed.Apr 9 2020, 3:46 PM
DanSeraf added inline comments.Apr 9 2020, 4:18 PM
swh/scanner/scanner.py
102–118

Yes, swh-model is currently using Content.from_disk to handle files.

Instead of building the set of all paths to exclude (possibly very large), I think you should pass patterns to directory_filter and use [[ https://docs.python.org/3/library/fnmatch.html | fnmatch ]] there

Yeah, it seems better!

I did not follow at all the diff. Still, i think you need a bump on swh-model to actually use D2960.
If that's the case, update also the requirements-swh.txt to v0.0.65 (maybe not immediately though, it's currently building ;).

Cheers,

(maybe not immediately though, it's currently building ;).

it's built so it's fine now ;)

(maybe not immediately though, it's currently building ;).

it's built so it's fine now ;)

Perfect! thanks :)

DanSeraf updated this revision to Diff 10682.Apr 10 2020, 3:51 PM
  • requirements-swh: swh.model version >= 0.0.65
  • match pattern with fnmatch
  • tests adapted to accept exclude patterns

Build is green

Patch application report for D2989 (id=10682)

Rebasing onto 62be9f10d4...

Current branch diff-target is up to date.
Changes applied before test
commit 33e2334ea68e66d5150ea4204438bae1bc3eed00
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Apr 10 15:49:48 2020 +0200

    requirements-swh: swh.model version >= 0.0.65

commit fac0715e9d58117549e4e0cfcf72cbf6ed3d4a04
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Apr 9 12:50:19 2020 +0200

    exclusion list through glob patterns

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/6/ for more details.

vlorentz requested changes to this revision.Apr 14 2020, 3:08 PM
vlorentz added inline comments.
swh/scanner/cli.py
36–50

Could you write test(s) for this function?

92

You could compile patterns here, so fnmatch doesn't need to recompile for each call to directory_filter.

This revision now requires changes to proceed.Apr 14 2020, 3:08 PM
DanSeraf added inline comments.Apr 14 2020, 4:36 PM
swh/scanner/cli.py
92

Do you mean compiling them using the regex of each pattern? I could create a set of regex objects and use them in the scanner to match the paths.

DanSeraf updated this revision to Diff 10746.Apr 16 2020, 5:16 PM

requested changes

Build is green

Patch application report for D2989 (id=10746)

Rebasing onto 62be9f10d4...

Current branch diff-target is up to date.
Changes applied before test
commit f19bda6c2c1dcde706cca328b6d14ec996cacaad
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Apr 10 15:49:48 2020 +0200

    requirements-swh: swh.model version >= 0.0.64

commit e9b593d3472b52f314b87f961cb8c30edb2fefa3
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Apr 9 12:50:19 2020 +0200

    exclusion list through glob patterns

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/7/ for more details.

vlorentz accepted this revision.Apr 17 2020, 10:51 AM
vlorentz added inline comments.
swh/scanner/cli.py
92

yes (sorry for the late reply)

This revision is now accepted and ready to land.Apr 17 2020, 10:51 AM