Page MenuHomeSoftware Heritage

swh-model: add recursive option
ClosedPublic

Authored by DanSeraf on Jun 8 2021, 2:57 PM.

Details

Reviewers
zack
Group Reviewers
Reviewers
Maniphest Tasks
T3160: swh identify: add a -R/--recursive flag
Summary

Recursively compute all the SWHID(s) for a given source code object

Diff Detail

Repository
rDMOD Data model
Branch
recursive-support
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 21946
Build 34135: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 34134: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D5825 (id=20843)

Rebasing onto 4808fc2ae5...

Current branch diff-target is up to date.
Changes applied before test
commit d9b5ba15208e4a3b8073dde8f268683a5a4e4839
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jun 8 13:51:47 2021 +0100

    cli: recursive option

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/339/ for more details.

vlorentz requested changes to this revision.Jun 8 2021, 3:26 PM
vlorentz added a subscriber: vlorentz.

Cool, thanks!

One question and two requests:

  1. What happens if one runs it on a directory and --recursive isn't given?
  1. If relevant, could you implement --verify too?
  1. could you add tests?
swh/model/cli.py
27

can you keep this import inside functions?

284

what about path = os.fsencode(objects[0]), for consistency with the other half the function?

288

ditto

This revision now requires changes to proceed.Jun 8 2021, 3:26 PM
  1. What happens if one runs it on a directory and --recursive isn't given?

It will show only the SWHID of the given directory, basically the same process as before is applied

  1. If relevant, could you implement --verify too?

Sure, if it is useful i could open another diff for it

  1. could you add tests?

I would add a cli test but it seems that the Content objects generated with the sample-folder uses the attribute data instead of path generating a key error during the test (since the same key is used to retrieve the path from both the Directory and the Content objects; from_disk.DiskBackedContent and model.Content have different attributes name). Would it be safe to change the attribute name?

It will show only the SWHID of the given directory, basically the same process as before is applied

But what was the process before? Did it ignore directory entries?

  1. could you add tests?

I would add a cli test but it seems that the Content objects generated with the sample-folder uses the attribute data instead of path generating a key error during the test (since the same key is used to retrieve the path from both the Directory and the Content objects; from_disk.DiskBackedContent and model.Content have different attributes name). Would it be safe to change the attribute name?

Changing the name won't work, they don't have the same values.

If mocks don't return the right type of data, you should change the mocks instead.

  1. If relevant, could you implement --verify too?

Sure, if it is useful i could open another diff for it

This is not going to be easy, because one would need to pass a set of SWHIDs, mapped to individual paths, which is much harder to do on a CLI than passing a single SWHID (as in the current implementation of --verify).
The only easy way to do this would be passing a filename pointing to, e.g., a JSON formatted manifest, to be compared with computed SWHIDs, but I don't see much the point of doing that.

I'm fine with --verify being incompatible with --recursive for the time being (maybe with an appropriate error message).

It will show only the SWHID of the given directory, basically the same process as before is applied

But what was the process before? Did it ignore directory entries?

It checks only the given directories generating a from_disk.Directory object for each directory. Should it uses the same logic used for the recursive option?

I would add a cli test but it seems that the Content objects generated with the sample-folder uses the attribute data instead of path generating a key error during the test (since the same key is used to retrieve the path from both the Directory and the Content objects; from_disk.DiskBackedContent and model.Content have different attributes name). Would it be safe to change the attribute name?

Changing the name won't work, they don't have the same values.

If mocks don't return the right type of data, you should change the mocks instead.

Ok, i'll change the mocks.

In D5825#148519, @zack wrote:
  1. If relevant, could you implement --verify too?

Sure, if it is useful i could open another diff for it

This is not going to be easy, because one would need to pass a set of SWHIDs, mapped to individual paths, which is much harder to do on a CLI than passing a single SWHID (as in the current implementation of --verify).
The only easy way to do this would be passing a filename pointing to, e.g., a JSON formatted manifest, to be compared with computed SWHIDs, but I don't see much the point of doing that.

I'm fine with --verify being incompatible with --recursive for the time being (maybe with an appropriate error message).

Ok, therefore I will check that the two options are mutually exclusive.

But what was the process before? Did it ignore directory entries?

It checks only the given directories generating a from_disk.Directory object for each directory. Should it uses the same logic used for the recursive option?

To elaborate on this for @vlorentz, swh identify did work on directories before, and did use swh.model.from_disk to compute the SWHID of directories, but due to how the code was organized the built model was not kept around until the printing phase, so it was impossible to decide at that time whether to output only the root SWHID or all inner SWHIDs recursively.

I see, thanks.

And --verify was just an example, it looks like --recursive causes all options but --exclude to be ignored. Considering this and the completely disjoint code-paths, wouldn't it make sense to have a separate command (eg. recursive-identify instead of identify --recursive)?

wouldn't it make sense to have a separate command (eg. recursive-identify instead of identify --recursive)?

I don't think so, given that a -r/--recursive is a very common pattern for CLI commands that work on files and that other options are much more seldomly used.
(However we should make sure if they are used in an incompatible way, the tool fails with a clear error message.)

zack requested changes to this revision.Jun 9 2021, 11:57 AM
zack added inline comments.
swh/model/cli.py
27

I guess this was here because "Directory" is used as the return type of the generate_dir_obj function.
Or is there a trick to make mypy happy about that and at the same type avoid the top-level import?

78–79

please rename this function to something like "model_of_dir"

297

Exit mid-way through the function isn't great. Can you instead put what follows under the "else" branch?
(besides, if the exit really has to remain, I believe it should be click.Context.exit instead)

swh/model/cli.py
27

in that case I can't use it as a returning type in generate_dir_obj.

swh/model/cli.py
297

ack, I will do it.

  • requested changes
  • get the object path based on the data contained in the node object

Build is green

Patch application report for D5825 (id=20909)

Rebasing onto 8540c6700c...

First, rewinding head to replay your work on top of it...
Applying: cli: add recursive option
Changes applied before test
commit 742f4cd5e32bddcac85a4f0196902ffb6112c9e3
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jun 8 13:51:47 2021 +0100

    cli: add recursive option

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/340/ for more details.

It seems it's still missing incompatibility checks with --no-dereference, --no-filename, and --type

It seems it's still missing incompatibility checks with --no-dereference, --no-filename, and --type

Actually, --no-filename could be compatible (and implemented in this diff); should i include it?

  • missing incompatibility checks
  • ignore --recursive if the input object is not a directory
  • print only the swhid if using --no-filename

Build is green

Patch application report for D5825 (id=20954)

Rebasing onto 8540c6700c...

First, rewinding head to replay your work on top of it...
Applying: cli: add recursive option
Changes applied before test
commit c12b1ebc09ddbdb0072b11ff4f102efd3ab22ca8
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jun 8 13:51:47 2021 +0100

    cli: add recursive option

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/341/ for more details.

zack requested changes to this revision.Jun 11 2021, 1:20 PM
zack added inline comments.
swh/model/cli.py
282

I like a lot this solution of printing a warning instead of failing. However, you should use "logging.warn" (and hence import logging) instead of a bare print.

286

better: "verification of recursive object identification is not supported"

288–289

I think here the test should be "not in ("auto", "directory")", no?

And here's a better error message: "recursive identification is supported only for directories".

291–292

I don't get this one. Why is it incompatible? Both --dereference and --no-dereference should work just fine with or without --recursive, no? (As long as the symlink resolves to a directory it should be fine IMO.)

This revision now requires changes to proceed.Jun 11 2021, 1:20 PM
swh/model/cli.py
291–292

yeah, maybe we can print a warning here or just ignore the option

DanSeraf marked an inline comment as not done.Jun 11 2021, 1:28 PM

Build is green

Patch application report for D5825 (id=20955)

Rebasing onto 8540c6700c...

First, rewinding head to replay your work on top of it...
Applying: cli: add recursive option
Changes applied before test
commit 87d41d1294d38be3e8969aa08aba116724636525
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jun 8 13:51:47 2021 +0100

    cli: add recursive option

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/342/ for more details.

This revision is now accepted and ready to land.Jun 11 2021, 2:54 PM

Build is green

Patch application report for D5825 (id=20966)

Rebasing onto ae50e43fe0...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-343-D5825.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/343/ for more details.