Details
- Reviewers
zack - Group Reviewers
Reviewers - Maniphest Tasks
- T2692: Move the output related functions to another (sub)module
T2730: scanner: should output the root SWHID as well
T3349: use swh.model.merkle/from_disk instead of swh.scanner.model - Commits
- rDTSCN0d92c754c8df: use model.from_disk instead of scanner.model to store a source code project
Diff Detail
- Repository
- rDTSCN Code scanner
- Branch
- model-refactoring
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 22522 Build 35096: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 35095: arc lint + arc unit
Event Timeline
Build has FAILED
Patch application report for D5926 (id=21272)
Rebasing onto d58bcb59a0...
First, rewinding head to replay your work on top of it... Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 329aa84dfdd8f4366a71f531761c4121ea7840e9 Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/126/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/126/console
I like where this is going, very promising refactoring!
I've added some comments/suggestions here and there.
swh/scanner/data.py | ||
---|---|---|
21–25 | why merkle node classes as keys? With SWHIDs we will have the ability to keep information about archive objects that maybe we do not have locally available (e.g., in the future, revision objects pointing to submodules that we have not retrieved locally). That seems valuable to me. | |
51 | is there a constant/enum somewhere that we can use instead of "directory" as a string? | |
70 | is there a constant/enum somewhere that we can use instead of "content" as a string? | |
73 | remember to bump the dependency on swh.model in requirements-swh.txt to get access to the swhid() method | |
92 | ditto: constant for "content", if it exists | |
swh/scanner/output.py | ||
65 | Looks like this is explicitly recursive, which might make the stack explode for particularly big trees (as Python still does not have tail call optimization). To stay on the safe side I think you should write this with an explicit queue/stack and possibly an auxiliary nested function. | |
swh/scanner/scanner.py | ||
20 | While we are at this, i think we need a better name for this algo. How about simply bfs? Or maybe lazy_bfs if you want to highlight the "stop" part more. | |
37 | constant | |
swh/scanner/tests/test_scanner.py | ||
60 | We discussed how iter_tree should not be used to output stuff, because it deduplicates and might hence skip sub-trees. Can you hence add an explicit test case with sharing (ideally: of a sub-directory somewhere) and make the test verify that all nodes are present in the output even if they are deduplicated in the in-memory data model? |
swh/scanner/data.py | ||
---|---|---|
21–25 | Yeah, the comment and the exception here should be changed, the function is actually checking that the key is a valid SWHID. | |
51 | There are some deprecated constants in swh.model.identifiers. Shouldn't the object_type attribute of from_disk.Directory/Content be consistent with the enum swh.model.identifiers.ObjectType? | |
swh/scanner/scanner.py | ||
20 | OK! | |
swh/scanner/tests/test_scanner.py | ||
60 | Yes, but we first need to change swh.model.merkle.iter_tree to output *all* the nodes. |
- use constant object type names in model.identifiers
- adapt unit test for scanner.test.test_plot/dashboard
- json and ndjson output
- unit test for scanner.data
- iterate over the merkle tree to output node information (text output)
TODO:
- update requirements-swh to use the correct version of swh.model
Build has FAILED
Patch application report for D5926 (id=21418)
Rebasing onto d58bcb59a0...
First, rewinding head to replay your work on top of it... Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit f2648e037cd03577f79507879e8423cfb66dc3da Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/127/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/127/console
Build has FAILED
Patch application report for D5926 (id=21472)
Rebasing onto d58bcb59a0...
First, rewinding head to replay your work on top of it... Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 07b9afb41395474bfd718a821e674a53302bdb12 Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/128/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/128/console
Build has FAILED
Patch application report for D5926 (id=21472)
Rebasing onto d58bcb59a0...
First, rewinding head to replay your work on top of it... Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit b8eca8a5eab5bb1dc98718663297b2b4a5bc1b2a Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/129/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/129/console
Build is green
Patch application report for D5926 (id=21474)
Rebasing onto d58bcb59a0...
First, rewinding head to replay your work on top of it... Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 8fbc28fc274f5204d142e1bb089b47e23b264b57 Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project
See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/130/ for more details.
Thanks, both the general structure and implementation look OK.
I'm requesting changes to address two main issues:
- the keys in the MerkleNodeData should not be SWHIDs as strings, but rather SWHIDs as instances of CoreSWHID (see comments in the review for rationale)
- it looks like in various places you .decode() paths early, which is dangerous as we have no guarantee that path components are actually decodable (to UTF-8). You should delay decoding as much as possible, ideally until final output
After fixing (1), I suggest you grep the core for all instances of str(.*swhid()), or something similar, there shouldn't be many around.
Similarly, after fixing (2), grep for .decode(), ideally they should remain only in output-related modules.
swh/scanner/data.py | ||
---|---|---|
14 | naming nitpick: I'd call this MerkleNodeInfo, as it's extra data w.r.t. the actual data in the node, which is stored in the from_disk structure already | |
15 | better docstring: "store additional information about Merkle DAG nodes, using SWHIDs as keys" | |
17–18 | given the constructor isn't doing anything else, I think you can just leave it out entirely and not define __init__ here | |
24–25 | As dictionary keys for additional information we should use CoreSWHID instances, not strings. That will avoid having to re-validate SWHID syntactical correctness every time. Also, it is now trivial to obtain CoreSWHID from from_disk instances, thanks to your change in T3393. So, to recap: the scanner should instantiate from_disk objects from the filesystem, then obtain CoreSWHIDs from each node of it using the new swhid() method, and use those as keys for MerkleNodeData | |
57 | this decode() looks dangerous, and Path objects work fine with bytes in general, please check if it is avoidable | |
78 | here's an example of why we should use CoreSWHID instances as keys: here you are first rendering an existing CoreSWHID to string, than using it as a dictionary key, which means it will be hashed again. What a waste :) | |
105 | here's another example of rendering/reparsing, this time even worse because it will also go through syntactic validation for SWHID correctness, which is pointless given it came from a CoreSWHID object already | |
109 | Why are you decoding the path here? in general, there is no guarantee that we will be able to decode it at all, so it should be remain bytes as long as possible (until final output). I think this function should return a Dict[bytes,dict] instead, delaying decoding. | |
swh/scanner/scanner.py | ||
30 | better: while queue |
swh/scanner/data.py | ||
---|---|---|
57 | I think it is not possible to avoid decode() here, since Path wants strings instead of bytes. |
swh/scanner/data.py | ||
---|---|---|
57 | ouch, you're right (https://bugs.python.org/issue23904) and that sucks oh well |
Build has FAILED
Patch application report for D5926 (id=21540)
Rebasing onto d58bcb59a0...
First, rewinding head to replay your work on top of it... Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 761f9659660ced85223693d80f442982f6cc3798 Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/131/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/131/console
Build is green
Patch application report for D5926 (id=21547)
Rebasing onto d58bcb59a0...
First, rewinding head to replay your work on top of it... Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit 7e019da43cbc2462b44d802566fb9ab2efea0eee Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project
See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/132/ for more details.
LGTM, accepted, thanks!
Please note down on the sides two remaining TODOs about all this:
- adding a test case with a deduplicated source tree, to make sure nodes that are deduplicated at the Merkle DAG level are present multiple times in the output
- adding a test case for a path that is not decodable in utf-8, to make sure it can be handled propertly
(they can be both added later, but I do not want to forget about these)
Build is green
Patch application report for D5926 (id=21555)
Rebasing onto d58bcb59a0...
First, rewinding head to replay your work on top of it... Applying: use model.from_disk instead of scanner.model to store a source code project
Changes applied before test
commit ba339fa62fbab2ff287126ca756a09ad35ab9efc Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project Closes T3349 Closes T2730 Closes T2692
See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/133/ for more details.
Build is green
Patch application report for D5926 (id=21556)
Rebasing onto d58bcb59a0...
Current branch diff-target is up to date.
Changes applied before test
commit 0d92c754c8dfa1b30910bf5b08c6df810276060f Author: Daniele Serafini <me@danieleserafini.eu> Date: Fri Jun 25 12:46:29 2021 +0100 use model.from_disk instead of scanner.model to store a source code project Closes T3349 Closes T2730 Closes T2692
See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/134/ for more details.