Page MenuHomeSoftware Heritage

swh identify: add --exclude
ClosedPublic

Authored by douardda on Oct 7 2020, 7:51 PM.

Diff Detail

Repository
rDMOD Data model
Branch
swh-identify.ignore-named-directory
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 16591
Build 25572: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 25571: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
acezar set the repository for this revision to rDMOD Data model.Oct 7 2020, 7:55 PM
zack requested changes to this revision.Oct 8 2020, 9:09 AM
zack added a subscriber: zack.

Thanks, even though this is a little bit disturbing discrepancy wrt swh-scanner exclusion mechanism, it's probably for the best, and arguably we should have something this simple also there.

I've requested some changes about CLI and it's documentation.

swh/model/cli.py
182

sounds like a bit of a mouthful, can we just go for --ignore-dir instead?

Also, the help string should clarify:

  • whether the directory names take glob patterns or not (I think it doesn't)
  • the depth at which the directory names match (I have no idea)
  • the fact that the option can be passed multiple times
This revision now requires changes to proceed.Oct 8 2020, 9:09 AM
acezar updated this revision to Diff 14757.Oct 8 2020, 10:54 AM

changed --ignore-named-directory to --ignore-dir
and updated it's documentation

acezar marked an inline comment as done.Oct 8 2020, 10:54 AM
marmoute added inline comments.
swh/model/cli.py
61–66

(nits: maybe we could have a default value for this new arguments since if was not necessary in most usecase before)

187

I have the same question as Zack, does this match at root level or at all level?

douardda requested changes to this revision.EditedOct 8 2020, 11:16 AM
douardda added a subscriber: douardda.

can you please remove the "noise" added by arc in the commit message? And update it (still the previous option name in there).

This revision now requires changes to proceed.Oct 8 2020, 11:16 AM
In D4193#103804, @zack wrote:

Thanks, even though this is a little bit disturbing discrepancy wrt swh-scanner exclusion mechanism,

I agree having the same cli interface than the scanner (aka --exclude with support for glob) would be much better. If the glob pattern matching can wait, using the same option name is a "must" for me.

it's probably for the best, and arguably we should have something this simple also there.

@zack Not sure what the problem with the scanner is. Can you elaborate?

zack added inline comments.Oct 8 2020, 11:46 AM
swh/model/cli.py
186

Minor nit/typo: it should be "globbing", but all in all I'd rather just not mention globbing if it's not supported. I suggest ignore directories by path name (can be repeated). The point about depth remains though. If it does match at any depth, I propose: ignore directories by local path name, matching at any depth (can be repeated). Otherwise it should be: ignore directories by absolute path name, starting from root path (can be repeated)

zack resigned from this revision.Oct 8 2020, 11:46 AM
acezar updated this revision to Diff 14842.Oct 8 2020, 7:48 PM
acezar retitled this revision from swh identify: add --ignore-named-directory to swh identify: add --exclude.

Changed --ignore-dir by --exclude

It supports glob, but had to copy extract_regex_objs from swh.scan.

Also, despite extract_regex_objs saying `checks if the path is a subdirectory
or relative to the root path`, during my tests excluding ../foo does not raise
an error.

acezar marked 3 inline comments as done.Oct 8 2020, 7:49 PM
acezar updated this revision to Diff 14846.Oct 8 2020, 9:13 PM

Fixed path resolution error.

swh-identify --exclude foo my-repo versus
cd my-repo; swh-identify --exclude foo .

acezar updated this revision to Diff 14873.Oct 9 2020, 12:36 PM

Better fix for the previous bug

zack added a comment.Oct 9 2020, 12:55 PM

Changed --ignore-dir by --exclude

It supports glob, but had to copy extract_regex_objs from swh.scan.

As long as it can be in the future invoked from swh-scanner (which already has a dependency on swh-model anyway), that's fine. However, please add an "XXX/TODO" comment in the code noting where the code comes from and that it is a copy/paste.

Once this is landed, please also file a bug against swh-scanner asking to depend on this function rather than duplicating it.

LGTM otherwise !

acezar updated this revision to Diff 14888.Oct 9 2020, 2:38 PM

Added a TODO about the extract_regex_objs copied from swh.scanner

acezar updated this revision to Diff 14889.Oct 9 2020, 2:39 PM

Fixed typo

Except for the small feedback about the exception, this looks fine.

swh/model/exceptions.py
134

This exception need a docstring. And if I understand it Correctly, it should subclass RuntimeError.

acezar marked an inline comment as done.Oct 12 2020, 11:10 AM
acezar added inline comments.
swh/model/exceptions.py
134

The exception has been copy pasted from swh.scanner like extract_regex_objs. I don't think changing it as it is just about moving code from one package to another.

@douardda @zack Note that this diff somehow did not trigger the ci tests, only the linters. No idea why. just a heads up.

@douardda @zack Note that this diff somehow did not trigger the ci tests, only the linters. No idea why. just a heads up.

because it's not related to swh.model repo I'd say (then the question is how can arcanist produce diff that's not related with any repo?)

douardda set the repository for this revision to rDMOD Data model.Oct 13 2020, 1:04 PM
acezar marked an inline comment as done.Oct 16 2020, 12:38 PM
ardumont added a comment.EditedOct 21 2020, 12:08 PM

@acezar Now that your arcanist issue is fixed, can you please try and update this diff to check if that could trigger properly the diff build?

If that does not work, i gather closing and opening a new one just to check if that builds fine would be a way forward here.

Thanks in advance.

acezar updated this revision to Diff 15275.Oct 21 2020, 2:10 PM

Rebased on master

acezar updated this revision to Diff 15276.Oct 21 2020, 2:12 PM

Trying to trigger CI

Build is green

Patch application report for D4193 (id=15277)

Rebasing onto 9224c8ca6e...

Current branch diff-target is up to date.
Changes applied before test
commit 2e1539da6aa6ca56104504939ba911a107976293
Author: Antoine Cezar <antoine@cezar.fr>
Date:   Wed Oct 7 19:45:44 2020 +0200

    swh identify: add --exclude

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/153/ for more details.

ardumont added a comment.EditedOct 21 2020, 4:06 PM

Build is green

Awesome.

I know you are moving only the module from scanner to model.
Still, this should be covered.

Please, add at least one test around the exclude flag in the test_cli module.
The following test [1] is listing the content of a git repository.

You could another scenario based from [1], but this times, it would ignore the
.git folder, resulting in a snapshot (for listing [2]) with a different id.

That should be enough to cover it.

[1] https://forge.softwareheritage.org/source/swh-model/browse/master/swh/model/tests/test_cli.py$55-69

[2] Such listing would result in only the following to list, which:

drwxr-xr-x    zack/zack          0 sample-repo/
-rw-r--r--    zack/zack          9 sample-repo/foo
-rw-r--r--    zack/zack          5 sample-repo/bar
-rw-r--r--    zack/zack          4 sample-repo/baz

Thanks in advance.

acezar updated this revision to Diff 15286.Oct 21 2020, 4:58 PM

Added --exclude test-case

Build is green

Patch application report for D4193 (id=15286)

Rebasing onto 9224c8ca6e...

Current branch diff-target is up to date.
Changes applied before test
commit 6f6d0eefe565f1ea300b65306e906de87eb00fec
Author: Antoine Cezar <antoine@cezar.fr>
Date:   Wed Oct 7 19:45:44 2020 +0200

    swh identify: add --exclude

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/154/ for more details.

ardumont accepted this revision.Oct 21 2020, 5:12 PM

Thanks.

douardda requested changes to this revision.Oct 22 2020, 11:25 AM

This globally LGTM but there is this path encoding issue. The 2 new functions in from_disk.py should take a bytes argument instead of a str one.

swh/model/cli.py
70

isn't there some encoding risk here? This path.decode() frighten me a bit.

swh/model/tests/test_cli.py
150

At least one test with non-ascii path is needed I'd say

This revision now requires changes to proceed.Oct 22 2020, 11:25 AM
acezar updated this revision to Diff 15343.Oct 22 2020, 5:21 PM
acezar marked 2 inline comments as done.

Use bytes path

Build is green

Patch application report for D4193 (id=15343)

Rebasing onto 9224c8ca6e...

Current branch diff-target is up to date.
Changes applied before test
commit 0e4fbb9d59bbdc6d777d7a3789a3f8fbbf836b30
Author: Antoine Cezar <antoine@cezar.fr>
Date:   Wed Oct 7 19:45:44 2020 +0200

    swh identify: add --exclude

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/155/ for more details.

douardda accepted this revision.Oct 23 2020, 11:00 AM

ok but please properly document arguments in docstrings.

swh/model/from_disk.py
289

please document the arguments

314

lack the root description in this docstring (even if pretty abvious)

This revision is now accepted and ready to land.Oct 23 2020, 11:00 AM
acezar updated this revision to Diff 15365.Oct 23 2020, 11:27 AM

Better docstrings

acezar marked 2 inline comments as done.Oct 23 2020, 11:28 AM

Build is green

Patch application report for D4193 (id=15365)

Rebasing onto 9224c8ca6e...

Current branch diff-target is up to date.
Changes applied before test
commit 2b869aa7d30d099ed6146d9f8dc667cd7a8eefc3
Author: Antoine Cezar <antoine@cezar.fr>
Date:   Wed Oct 7 19:45:44 2020 +0200

    swh identify: add --exclude

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/156/ for more details.

douardda commandeered this revision.Oct 23 2020, 11:51 AM
douardda closed this revision.
douardda edited reviewers, added: acezar; removed: douardda.