Page MenuHomeSoftware Heritage

swh.scanner: use model.from_disk instead of scanner.model to store a source code project
ClosedPublic

Authored by DanSeraf on Jun 25 2021, 1:57 PM.

Diff Detail

Repository
rDTSCN Code scanner
Branch
model-refactoring
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22529
Build 35109: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35108: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D5926 (id=21272)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 329aa84dfdd8f4366a71f531761c4121ea7840e9
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/126/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/126/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 25 2021, 1:59 PM
Harbormaster failed remote builds in B22251: Diff 21272!

I like where this is going, very promising refactoring!

I've added some comments/suggestions here and there.

swh/scanner/data.py
21–25

why merkle node classes as keys?
shouldn't we use actual SWHIDs, which are meant to be IDs?

With SWHIDs we will have the ability to keep information about archive objects that maybe we do not have locally available (e.g., in the future, revision objects pointing to submodules that we have not retrieved locally). That seems valuable to me.

51

is there a constant/enum somewhere that we can use instead of "directory" as a string?

70

is there a constant/enum somewhere that we can use instead of "content" as a string?

73

remember to bump the dependency on swh.model in requirements-swh.txt to get access to the swhid() method

92

ditto: constant for "content", if it exists

swh/scanner/output.py
65

Looks like this is explicitly recursive, which might make the stack explode for particularly big trees (as Python still does not have tail call optimization). To stay on the safe side I think you should write this with an explicit queue/stack and possibly an auxiliary nested function.

swh/scanner/scanner.py
20

While we are at this, i think we need a better name for this algo. How about simply bfs? Or maybe lazy_bfs if you want to highlight the "stop" part more.

37

constant

swh/scanner/tests/test_scanner.py
60

We discussed how iter_tree should not be used to output stuff, because it deduplicates and might hence skip sub-trees. Can you hence add an explicit test case with sharing (ideally: of a sub-directory somewhere) and make the test verify that all nodes are present in the output even if they are deduplicated in the in-memory data model?

swh/scanner/data.py
21–25

Yeah, the comment and the exception here should be changed, the function is actually checking that the key is a valid SWHID.

51

There are some deprecated constants in swh.model.identifiers. Shouldn't the object_type attribute of from_disk.Directory/Content be consistent with the enum swh.model.identifiers.ObjectType?

swh/scanner/scanner.py
20

OK!

swh/scanner/tests/test_scanner.py
60

Yes, but we first need to change swh.model.merkle.iter_tree to output *all* the nodes.

  • use constant object type names in model.identifiers
  • adapt unit test for scanner.test.test_plot/dashboard
  • json and ndjson output
  • unit test for scanner.data
  • iterate over the merkle tree to output node information (text output)

TODO:

  • update requirements-swh to use the correct version of swh.model
DanSeraf retitled this revision from WIP: use model.from_disk instead of scanner.model to store a source code project to swh.scanner: use model.from_disk instead of scanner.model to store a source code project.Jul 2 2021, 9:33 AM
DanSeraf edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D5926 (id=21418)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit f2648e037cd03577f79507879e8423cfb66dc3da
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/127/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/127/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 2 2021, 9:34 AM
Harbormaster failed remote builds in B22397: Diff 21418!

Build has FAILED

Patch application report for D5926 (id=21472)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 07b9afb41395474bfd718a821e674a53302bdb12
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/128/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/128/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 2 2021, 7:07 PM
Harbormaster failed remote builds in B22449: Diff 21472!

Build has FAILED

Patch application report for D5926 (id=21472)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit b8eca8a5eab5bb1dc98718663297b2b4a5bc1b2a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/129/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/129/console

Build is green

Patch application report for D5926 (id=21474)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 8fbc28fc274f5204d142e1bb089b47e23b264b57
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/130/ for more details.

zack requested changes to this revision.Jul 6 2021, 10:12 AM

Thanks, both the general structure and implementation look OK.
I'm requesting changes to address two main issues:

  1. the keys in the MerkleNodeData should not be SWHIDs as strings, but rather SWHIDs as instances of CoreSWHID (see comments in the review for rationale)
  2. it looks like in various places you .decode() paths early, which is dangerous as we have no guarantee that path components are actually decodable (to UTF-8). You should delay decoding as much as possible, ideally until final output

After fixing (1), I suggest you grep the core for all instances of str(.*swhid()), or something similar, there shouldn't be many around.
Similarly, after fixing (2), grep for .decode(), ideally they should remain only in output-related modules.

swh/scanner/data.py
14

naming nitpick: I'd call this MerkleNodeInfo, as it's extra data w.r.t. the actual data in the node, which is stored in the from_disk structure already

15

better docstring: "store additional information about Merkle DAG nodes, using SWHIDs as keys"

17–18

given the constructor isn't doing anything else, I think you can just leave it out entirely and not define __init__ here

24–25

As dictionary keys for additional information we should use CoreSWHID instances, not strings.

That will avoid having to re-validate SWHID syntactical correctness every time. Also, it is now trivial to obtain CoreSWHID from from_disk instances, thanks to your change in T3393.

So, to recap: the scanner should instantiate from_disk objects from the filesystem, then obtain CoreSWHIDs from each node of it using the new swhid() method, and use those as keys for MerkleNodeData

57

this decode() looks dangerous, and Path objects work fine with bytes in general, please check if it is avoidable

78

here's an example of why we should use CoreSWHID instances as keys: here you are first rendering an existing CoreSWHID to string, than using it as a dictionary key, which means it will be hashed again. What a waste :)

105

here's another example of rendering/reparsing, this time even worse because it will also go through syntactic validation for SWHID correctness, which is pointless given it came from a CoreSWHID object already

109

Why are you decoding the path here? in general, there is no guarantee that we will be able to decode it at all, so it should be remain bytes as long as possible (until final output). I think this function should return a Dict[bytes,dict] instead, delaying decoding.

swh/scanner/scanner.py
30

better: while queue

This revision now requires changes to proceed.Jul 6 2021, 10:12 AM
swh/scanner/data.py
57

I think it is not possible to avoid decode() here, since Path wants strings instead of bytes.

swh/scanner/data.py
57

ouch, you're right (https://bugs.python.org/issue23904) and that sucks

oh well

Build has FAILED

Patch application report for D5926 (id=21540)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 761f9659660ced85223693d80f442982f6cc3798
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/131/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/131/console

Build is green

Patch application report for D5926 (id=21547)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 7e019da43cbc2462b44d802566fb9ab2efea0eee
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/132/ for more details.

LGTM, accepted, thanks!

Please note down on the sides two remaining TODOs about all this:

  • adding a test case with a deduplicated source tree, to make sure nodes that are deduplicated at the Merkle DAG level are present multiple times in the output
  • adding a test case for a path that is not decodable in utf-8, to make sure it can be handled propertly

(they can be both added later, but I do not want to forget about these)

This revision is now accepted and ready to land.Jul 8 2021, 9:43 AM

Build is green

Patch application report for D5926 (id=21555)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit ba339fa62fbab2ff287126ca756a09ad35ab9efc
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project
    
    Closes T3349
    Closes T2730
    Closes T2692

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/133/ for more details.

Build is green

Patch application report for D5926 (id=21556)

Rebasing onto d58bcb59a0...

Current branch diff-target is up to date.
Changes applied before test
commit 0d92c754c8dfa1b30910bf5b08c6df810276060f
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project
    
    Closes T3349
    Closes T2730
    Closes T2692

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/134/ for more details.