Page MenuHomeSoftware Heritage

swh-model: exclude path support for the function from_disk in Directory
ClosedPublic

Authored by DanSeraf on Apr 6 2020, 4:10 PM.

Details

Summary

exclude specific path when generating Software Heritage objects

Since swh-scanner needs to implement an exclusion list, when it creates an object from the Directory class in swh.model.from_disk, it will pass a set of PosixPath to be excluded (this set will be generated from the cli with the glob module). The function from_disk through the function ignore_path will check if the parent of a directory is present in the set.

Related T2336

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

DanSeraf created this revision.Apr 6 2020, 4:10 PM
DanSeraf edited the summary of this revision. (Show Details)Apr 6 2020, 4:10 PM

Build is green

Patch application report for D2960 (id=10542)

Rebasing onto c7c1a571e5...

Current branch diff-target is up to date.
Changes applied before test
commit 4e1f94ae209426b1e375576f397c23bc86665f54
Author: Daniele Serafini <danseraf@softwareheritage.org>
Date:   Mon Apr 6 15:55:07 2020 +0200

    exclude path support for the function from_disk in Directory

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/33/ for more details.

ardumont added inline comments.
swh/model/from_disk.py
250

type?

def ignore_path(path: PosixPath, Set[PosixPath]) -> bool:
253
if not exclude_paths:  # None or {} should be treated the same.
257

directly return the conditional

return set(path.parents).intersection...
275

maybe make that a default empty set instead, that'd avoid to have to define Optional[Set[PosixPath]]
and directly have Set[PosixPath].

282

change the method's signature to include types.

Since you are modifying it and you know the types now ;)

Regarding the docstring, you can then remove the defined types next to the variable names (since it's already in the signature of the method).

ardumont added inline comments.Apr 6 2020, 4:44 PM
swh/model/from_disk.py
1

copyright update ;)

274

wondering if the ignore_path should not be implemented with the dir_filter callable now.

swh/model/tests/test_from_disk.py
867 ↗(On Diff #10542)

That's a simple case.
Adding another case where the path is deeper should be added.
To ensure it works as expected.

vlorentz added inline comments.
swh/model/from_disk.py
274

Yes, I think it should.

ardumont requested changes to this revision.Apr 6 2020, 5:40 PM

I think in the end, this should be implemented with the dir_filter parameter instead.

Please, check that you can do this. If you can, this will, there is no need to change the current implementation.
And instead, either:

  • not touch swh.model at all, then define your ignore_path function in swh-scanner and use it when calling from_disk.
  • if you think, this can be a shared behavior (i guess it can), define a function callable here in swh-model (like accept_all_directories and the other ones next to it).

In both cases, this won't touch Directory.from_disk though.

This revision now requires changes to proceed.Apr 6 2020, 5:40 PM

I think in the end, this should be implemented with the dir_filter parameter instead.

Please, check that you can do this. If you can, this will, there is no need to change the current implementation.
And instead, either:

  • not touch swh.model at all, then define your ignore_path function in swh-scanner and use it when calling from_disk.
  • if you think, this can be a shared behavior (i guess it can), define a function callable here in swh-model (like accept_all_directories and the other ones next to it).

In both cases, this won't touch Directory.from_disk though.

I can define a dir_filter function but how could i get the values i need, like the full path generated inside from_disk, without touching the implementation?

I can define a dir_filter function but how could i get the values i need, like the full path generated inside from_disk, without touching the implementation?

I'm not sure i understand the question.
from_disk has a save_path parameter. If set to True, the paths are stored alongside the output result.
Isn't that enough?

Or do you mean something else?
What are the values you need?

I can define a dir_filter function but how could i get the values i need, like the full path generated inside from_disk, without touching the implementation?

I'm not sure i understand the question.
from_disk has a save_path parameter. If set to True, the paths are stored alongside the output result.
Isn't that enough?

It seems that the save_path parameter is never used inside the implementation.

Or do you mean something else?
What are the values you need?

It would be nice to have the path created inside the for loop, the dir_filter function only handles at the moment the name of the directory and its relative entries.

Indeed. Then it would make sense to also give the full path to dir_filter, instead of adding a redundant attribute.

There aren't that many users of dir_filters (are there any at all outside this package?) so it shouldn't be too hard

DanSeraf updated this revision to Diff 10571.Apr 8 2020, 9:47 AM

requested changes

Build has FAILED

Patch application report for D2960 (id=10571)

Rebasing onto c7c1a571e5...

Current branch diff-target is up to date.
Changes applied before test
commit ffdc6d58e44c8ab2a431c8982a7419b598bf2eaf
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Apr 7 17:28:09 2020 +0200

    from_disk: path parameter to dir_filter functions

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/34/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/34/console

ardumont added a comment.EditedApr 8 2020, 9:56 AM

I did not reply as val did (and i agreed with him ;)

Indeed. Then it would make sense to also give the full path to dir_filter,
instead of adding a redundant attribute.

Yes.

To clarify, we propose you to change the dir_filter signature to add the
full_path you need. That would impact the current code on the callee
(from_disk) to adapt the call to dir_filter with the right parameters.

from:

def dir_filter_callback_fn(dirname, entries):

to, for example:

def dif_filter_callback_fn(dirpath: str, dirname: str, entries: Iterable[Any]):

And the signature change only impacts the following module functions (which are
the potential other callbacks used in that context), see next point also:

def accept_all_directories(dirpath, dirname, entries):
def ignore_empty_directories(dirpath, dirname, entries):
# <- it returns a function, so its signature
# does not change, instead, its implementation must, check `def named_filter` within:
def ignore_named_directories(names, *, case_sensitive=True):

Modifying their signatures should not change the tests result for those (you
may have to change the call in their respective tests though).

There aren't that many users of dir_filters (are there any at all outside
this package?) so it shouldn't be too hard

Indeed.

Let's check:

$ workon swh
$ pwd
/home/tony/work/inria/repo/swh/swh-environment
$ grep -ri "dir_filter=" */swh/**
swh-model/swh/model/from_disk.py:                  dir_filter=accept_all_directories,
swh-model/swh/model/tests/test_from_disk.py:            dir_filter=from_disk.ignore_empty_directories
swh-model/swh/model/tests/test_from_disk.py:            dir_filter=from_disk.ignore_named_directories([b'symlinks'])
swh-model/swh/model/tests/test_from_disk.py:            dir_filter=from_disk.ignore_named_directories([b'symLiNks'],

No client whatsoever yet, so changing the definition is fine.

ardumont added inline comments.Apr 8 2020, 9:58 AM
swh/model/from_disk.py
199–200

I would be explicit instead and name that:

def accept_all_directories(dirpath: str, dirname: str, entries: Iterable[Any]) -> bool:

and also with types

238–241

this one must change as well.

ardumont requested changes to this revision.Apr 8 2020, 10:25 AM

Getting there, see my previous comments ;)

This revision now requires changes to proceed.Apr 8 2020, 10:25 AM
DanSeraf updated this revision to Diff 10575.Apr 8 2020, 11:33 AM

requested changes

Build is green

Patch application report for D2960 (id=10575)

Rebasing onto c7c1a571e5...

Current branch diff-target is up to date.
Changes applied before test
commit 5d6883b265947e95ff931a72e15e5fcfa822aa82
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Apr 7 17:28:09 2020 +0200

    from_disk: path parameter to dir_filter functions

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/35/ for more details.

ardumont added a comment.EditedApr 8 2020, 11:36 AM

Looks good to me.

Don't you want to implement your callable here (the one which actually exclude stuff)?

Because now, the diff title is no longer in sync with the code ;)

Or do you plan to add it in swh-scanner?

ardumont accepted this revision.Apr 8 2020, 11:44 AM
This revision is now accepted and ready to land.Apr 8 2020, 11:44 AM

Looks good to me.

Don't you want to implement your callable here (the one which actually exclude stuff)?

Because now, the diff title is no longer in sync with the code ;)

Or do you plan to add it in swh-scanner?

Yeah, i will add it in swh-scanner

This revision was automatically updated to reflect the committed changes.