Page MenuHomeSoftware Heritage

swh-model: exclude path support for the function from_disk in Directory
ClosedPublic

Authored by DanSeraf on Apr 6 2020, 4:10 PM.

Details

Summary

exclude specific path when generating Software Heritage objects

Since swh-scanner needs to implement an exclusion list, when it creates an object from the Directory class in swh.model.from_disk, it will pass a set of PosixPath to be excluded (this set will be generated from the cli with the glob module). The function from_disk through the function ignore_path will check if the parent of a directory is present in the set.

Related T2336

Diff Detail

Repository
rDMOD Data model
Branch
exclusion-list
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11658
Build 17684: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 17683: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D2960 (id=10542)

Rebasing onto c7c1a571e5...

Current branch diff-target is up to date.
Changes applied before test
commit 4e1f94ae209426b1e375576f397c23bc86665f54
Author: Daniele Serafini <danseraf@softwareheritage.org>
Date:   Mon Apr 6 15:55:07 2020 +0200

    exclude path support for the function from_disk in Directory

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/33/ for more details.

ardumont added inline comments.
swh/model/from_disk.py
245

type?

def ignore_path(path: PosixPath, Set[PosixPath]) -> bool:
248
if not exclude_paths:  # None or {} should be treated the same.
252

directly return the conditional

return set(path.parents).intersection...
270

maybe make that a default empty set instead, that'd avoid to have to define Optional[Set[PosixPath]]
and directly have Set[PosixPath].

277

change the method's signature to include types.

Since you are modifying it and you know the types now ;)

Regarding the docstring, you can then remove the defined types next to the variable names (since it's already in the signature of the method).

swh/model/from_disk.py
1

copyright update ;)

270

wondering if the ignore_path should not be implemented with the dir_filter callable now.

swh/model/tests/test_from_disk.py
867 ↗(On Diff #10542)

That's a simple case.
Adding another case where the path is deeper should be added.
To ensure it works as expected.

vlorentz added inline comments.
swh/model/from_disk.py
270

Yes, I think it should.

ardumont requested changes to this revision.Apr 6 2020, 5:40 PM

I think in the end, this should be implemented with the dir_filter parameter instead.

Please, check that you can do this. If you can, this will, there is no need to change the current implementation.
And instead, either:

  • not touch swh.model at all, then define your ignore_path function in swh-scanner and use it when calling from_disk.
  • if you think, this can be a shared behavior (i guess it can), define a function callable here in swh-model (like accept_all_directories and the other ones next to it).

In both cases, this won't touch Directory.from_disk though.

This revision now requires changes to proceed.Apr 6 2020, 5:40 PM

I think in the end, this should be implemented with the dir_filter parameter instead.

Please, check that you can do this. If you can, this will, there is no need to change the current implementation.
And instead, either:

  • not touch swh.model at all, then define your ignore_path function in swh-scanner and use it when calling from_disk.
  • if you think, this can be a shared behavior (i guess it can), define a function callable here in swh-model (like accept_all_directories and the other ones next to it).

In both cases, this won't touch Directory.from_disk though.

I can define a dir_filter function but how could i get the values i need, like the full path generated inside from_disk, without touching the implementation?

I can define a dir_filter function but how could i get the values i need, like the full path generated inside from_disk, without touching the implementation?

I'm not sure i understand the question.
from_disk has a save_path parameter. If set to True, the paths are stored alongside the output result.
Isn't that enough?

Or do you mean something else?
What are the values you need?

I can define a dir_filter function but how could i get the values i need, like the full path generated inside from_disk, without touching the implementation?

I'm not sure i understand the question.
from_disk has a save_path parameter. If set to True, the paths are stored alongside the output result.
Isn't that enough?

It seems that the save_path parameter is never used inside the implementation.

Or do you mean something else?
What are the values you need?

It would be nice to have the path created inside the for loop, the dir_filter function only handles at the moment the name of the directory and its relative entries.

Indeed. Then it would make sense to also give the full path to dir_filter, instead of adding a redundant attribute.

There aren't that many users of dir_filters (are there any at all outside this package?) so it shouldn't be too hard

Build has FAILED

Patch application report for D2960 (id=10571)

Rebasing onto c7c1a571e5...

Current branch diff-target is up to date.
Changes applied before test
commit ffdc6d58e44c8ab2a431c8982a7419b598bf2eaf
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Apr 7 17:28:09 2020 +0200

    from_disk: path parameter to dir_filter functions

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/34/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/34/console

I did not reply as val did (and i agreed with him ;)

Indeed. Then it would make sense to also give the full path to dir_filter,
instead of adding a redundant attribute.

Yes.

To clarify, we propose you to change the dir_filter signature to add the
full_path you need. That would impact the current code on the callee
(from_disk) to adapt the call to dir_filter with the right parameters.

from:

def dir_filter_callback_fn(dirname, entries):

to, for example:

def dif_filter_callback_fn(dirpath: str, dirname: str, entries: Iterable[Any]):

And the signature change only impacts the following module functions (which are
the potential other callbacks used in that context), see next point also:

def accept_all_directories(dirpath, dirname, entries):
def ignore_empty_directories(dirpath, dirname, entries):
# <- it returns a function, so its signature
# does not change, instead, its implementation must, check `def named_filter` within:
def ignore_named_directories(names, *, case_sensitive=True):

Modifying their signatures should not change the tests result for those (you
may have to change the call in their respective tests though).

There aren't that many users of dir_filters (are there any at all outside
this package?) so it shouldn't be too hard

Indeed.

Let's check:

$ workon swh
$ pwd
/home/tony/work/inria/repo/swh/swh-environment
$ grep -ri "dir_filter=" */swh/**
swh-model/swh/model/from_disk.py:                  dir_filter=accept_all_directories,
swh-model/swh/model/tests/test_from_disk.py:            dir_filter=from_disk.ignore_empty_directories
swh-model/swh/model/tests/test_from_disk.py:            dir_filter=from_disk.ignore_named_directories([b'symlinks'])
swh-model/swh/model/tests/test_from_disk.py:            dir_filter=from_disk.ignore_named_directories([b'symLiNks'],

No client whatsoever yet, so changing the definition is fine.

swh/model/from_disk.py
199

I would be explicit instead and name that:

def accept_all_directories(dirpath: str, dirname: str, entries: Iterable[Any]) -> bool:

and also with types

237

this one must change as well.

Getting there, see my previous comments ;)

This revision now requires changes to proceed.Apr 8 2020, 10:25 AM

Build is green

Patch application report for D2960 (id=10575)

Rebasing onto c7c1a571e5...

Current branch diff-target is up to date.
Changes applied before test
commit 5d6883b265947e95ff931a72e15e5fcfa822aa82
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Apr 7 17:28:09 2020 +0200

    from_disk: path parameter to dir_filter functions

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/35/ for more details.

Looks good to me.

Don't you want to implement your callable here (the one which actually exclude stuff)?

Because now, the diff title is no longer in sync with the code ;)

Or do you plan to add it in swh-scanner?

This revision is now accepted and ready to land.Apr 8 2020, 11:44 AM

Looks good to me.

Don't you want to implement your callable here (the one which actually exclude stuff)?

Because now, the diff title is no longer in sync with the code ;)

Or do you plan to add it in swh-scanner?

Yeah, i will add it in swh-scanner