Page MenuHomeSoftware Heritage

swh identify: add --exclude
ClosedPublic

Authored by douardda on Oct 7 2020, 7:51 PM.

Diff Detail

Repository
rDMOD Data model
Branch
swh-identify.ignore-named-directory
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 16516
Build 25448: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 25447: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
zack requested changes to this revision.Oct 8 2020, 9:09 AM
zack added a subscriber: zack.

Thanks, even though this is a little bit disturbing discrepancy wrt swh-scanner exclusion mechanism, it's probably for the best, and arguably we should have something this simple also there.

I've requested some changes about CLI and it's documentation.

swh/model/cli.py
178

sounds like a bit of a mouthful, can we just go for --ignore-dir instead?

Also, the help string should clarify:

  • whether the directory names take glob patterns or not (I think it doesn't)
  • the depth at which the directory names match (I have no idea)
  • the fact that the option can be passed multiple times
This revision now requires changes to proceed.Oct 8 2020, 9:09 AM

changed --ignore-named-directory to --ignore-dir
and updated it's documentation

marmoute added inline comments.
swh/model/cli.py
60–65

(nits: maybe we could have a default value for this new arguments since if was not necessary in most usecase before)

183

I have the same question as Zack, does this match at root level or at all level?

douardda requested changes to this revision.EditedOct 8 2020, 11:16 AM
douardda added a subscriber: douardda.

can you please remove the "noise" added by arc in the commit message? And update it (still the previous option name in there).

This revision now requires changes to proceed.Oct 8 2020, 11:16 AM
In D4193#103804, @zack wrote:

Thanks, even though this is a little bit disturbing discrepancy wrt swh-scanner exclusion mechanism,

I agree having the same cli interface than the scanner (aka --exclude with support for glob) would be much better. If the glob pattern matching can wait, using the same option name is a "must" for me.

it's probably for the best, and arguably we should have something this simple also there.

@zack Not sure what the problem with the scanner is. Can you elaborate?

swh/model/cli.py
182

Minor nit/typo: it should be "globbing", but all in all I'd rather just not mention globbing if it's not supported. I suggest ignore directories by path name (can be repeated). The point about depth remains though. If it does match at any depth, I propose: ignore directories by local path name, matching at any depth (can be repeated). Otherwise it should be: ignore directories by absolute path name, starting from root path (can be repeated)

acezar retitled this revision from swh identify: add --ignore-named-directory to swh identify: add --exclude.

Changed --ignore-dir by --exclude

It supports glob, but had to copy extract_regex_objs from swh.scan.

Also, despite extract_regex_objs saying `checks if the path is a subdirectory
or relative to the root path`, during my tests excluding ../foo does not raise
an error.

Fixed path resolution error.

swh-identify --exclude foo my-repo versus
cd my-repo; swh-identify --exclude foo .

Better fix for the previous bug

Changed --ignore-dir by --exclude

It supports glob, but had to copy extract_regex_objs from swh.scan.

As long as it can be in the future invoked from swh-scanner (which already has a dependency on swh-model anyway), that's fine. However, please add an "XXX/TODO" comment in the code noting where the code comes from and that it is a copy/paste.

Once this is landed, please also file a bug against swh-scanner asking to depend on this function rather than duplicating it.

LGTM otherwise !

Added a TODO about the extract_regex_objs copied from swh.scanner

Except for the small feedback about the exception, this looks fine.

swh/model/exceptions.py
133

This exception need a docstring. And if I understand it Correctly, it should subclass RuntimeError.

acezar added inline comments.
swh/model/exceptions.py
133

The exception has been copy pasted from swh.scanner like extract_regex_objs. I don't think changing it as it is just about moving code from one package to another.

@douardda @zack Note that this diff somehow did not trigger the ci tests, only the linters. No idea why. just a heads up.

@douardda @zack Note that this diff somehow did not trigger the ci tests, only the linters. No idea why. just a heads up.

because it's not related to swh.model repo I'd say (then the question is how can arcanist produce diff that's not related with any repo?)

@acezar Now that your arcanist issue is fixed, can you please try and update this diff to check if that could trigger properly the diff build?

If that does not work, i gather closing and opening a new one just to check if that builds fine would be a way forward here.

Thanks in advance.

Build is green

Patch application report for D4193 (id=15277)

Rebasing onto 9224c8ca6e...

Current branch diff-target is up to date.
Changes applied before test
commit 2e1539da6aa6ca56104504939ba911a107976293
Author: Antoine Cezar <antoine@cezar.fr>
Date:   Wed Oct 7 19:45:44 2020 +0200

    swh identify: add --exclude

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/153/ for more details.

Build is green

Awesome.

I know you are moving only the module from scanner to model.
Still, this should be covered.

Please, add at least one test around the exclude flag in the test_cli module.
The following test [1] is listing the content of a git repository.

You could another scenario based from [1], but this times, it would ignore the
.git folder, resulting in a snapshot (for listing [2]) with a different id.

That should be enough to cover it.

[1] https://forge.softwareheritage.org/source/swh-model/browse/master/swh/model/tests/test_cli.py$55-69

[2] Such listing would result in only the following to list, which:

drwxr-xr-x    zack/zack          0 sample-repo/
-rw-r--r--    zack/zack          9 sample-repo/foo
-rw-r--r--    zack/zack          5 sample-repo/bar
-rw-r--r--    zack/zack          4 sample-repo/baz

Thanks in advance.

Added --exclude test-case

Build is green

Patch application report for D4193 (id=15286)

Rebasing onto 9224c8ca6e...

Current branch diff-target is up to date.
Changes applied before test
commit 6f6d0eefe565f1ea300b65306e906de87eb00fec
Author: Antoine Cezar <antoine@cezar.fr>
Date:   Wed Oct 7 19:45:44 2020 +0200

    swh identify: add --exclude

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/154/ for more details.

This globally LGTM but there is this path encoding issue. The 2 new functions in from_disk.py should take a bytes argument instead of a str one.

swh/model/cli.py
69

isn't there some encoding risk here? This path.decode() frighten me a bit.

swh/model/tests/test_cli.py
150

At least one test with non-ascii path is needed I'd say

This revision now requires changes to proceed.Oct 22 2020, 11:25 AM
acezar marked 2 inline comments as done.

Use bytes path

Build is green

Patch application report for D4193 (id=15343)

Rebasing onto 9224c8ca6e...

Current branch diff-target is up to date.
Changes applied before test
commit 0e4fbb9d59bbdc6d777d7a3789a3f8fbbf836b30
Author: Antoine Cezar <antoine@cezar.fr>
Date:   Wed Oct 7 19:45:44 2020 +0200

    swh identify: add --exclude

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/155/ for more details.

ok but please properly document arguments in docstrings.

swh/model/from_disk.py
289

please document the arguments

314

lack the root description in this docstring (even if pretty abvious)

This revision is now accepted and ready to land.Oct 23 2020, 11:00 AM

Build is green

Patch application report for D4193 (id=15365)

Rebasing onto 9224c8ca6e...

Current branch diff-target is up to date.
Changes applied before test
commit 2b869aa7d30d099ed6146d9f8dc667cd7a8eefc3
Author: Antoine Cezar <antoine@cezar.fr>
Date:   Wed Oct 7 19:45:44 2020 +0200

    swh identify: add --exclude

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/156/ for more details.

douardda closed this revision.
douardda edited reviewers, added: acezar; removed: douardda.

closed by 2b869aa7d30d099ed6146d9f8dc667cd7a8eefc3