⚙ D5926 swh.scanner: use model.from_disk instead of scanner.model to store a source code project

DanSeraf created this revision.Jun 25 2021, 1:57 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptJun 25 2021, 1:57 PM

Build has FAILED

Patch application report for D5926 (id=21272)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project

Changes applied before test

commit 329aa84dfdd8f4366a71f531761c4121ea7840e9
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/126/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/126/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 25 2021, 1:59 PM

Harbormaster failed remote builds in B22251: Diff 21272!

I like where this is going, very promising refactoring!

I've added some comments/suggestions here and there.

swh/scanner/data.py
21–25	why merkle node classes as keys? shouldn't we use actual SWHIDs, which are meant to be IDs? With SWHIDs we will have the ability to keep information about archive objects that maybe we do not have locally available (e.g., in the future, revision objects pointing to submodules that we have not retrieved locally). That seems valuable to me.
51	is there a constant/enum somewhere that we can use instead of `"directory"` as a string?
70	is there a constant/enum somewhere that we can use instead of `"content"` as a string?
73	remember to bump the dependency on swh.model in requirements-swh.txt to get access to the `swhid()` method
92	ditto: constant for `"content"`, if it exists
swh/scanner/output.py
65	Looks like this is explicitly recursive, which might make the stack explode for particularly big trees (as Python still does not have tail call optimization). To stay on the safe side I think you should write this with an explicit queue/stack and possibly an auxiliary nested function.
swh/scanner/scanner.py
21	While we are at this, i think we need a better name for this algo. How about simply `bfs`? Or maybe `lazy_bfs` if you want to highlight the "stop" part more.
38	constant
swh/scanner/tests/test_scanner.py
61	We discussed how iter_tree should not be used to output stuff, because it deduplicates and might hence skip sub-trees. Can you hence add an explicit test case with sharing (ideally: of a sub-directory somewhere) and make the test verify that all nodes are present in the output even if they are deduplicated in the in-memory data model?

zack edited the summary of this revision. (Show Details)Jun 28 2021, 9:10 AM

DanSeraf added inline comments.Jun 28 2021, 1:26 PM

swh/scanner/data.py
21–25	Yeah, the comment and the exception here should be changed, the function is actually checking that the key is a valid SWHID.
51	There are some deprecated constants in `swh.model.identifiers`. Shouldn't the `object_type` attribute of `from_disk.Directory/Content` be consistent with the enum `swh.model.identifiers.ObjectType`?
swh/scanner/scanner.py
21	OK!
swh/scanner/tests/test_scanner.py
61	Yes, but we first need to change `swh.model.merkle.iter_tree` to output all the nodes.

use constant object type names in model.identifiers
adapt unit test for scanner.test.test_plot/dashboard
json and ndjson output
unit test for scanner.data
iterate over the merkle tree to output node information (text output)

TODO:

update requirements-swh to use the correct version of swh.model

DanSeraf retitled this revision from WIP: use model.from_disk instead of scanner.model to store a source code project to swh.scanner: use model.from_disk instead of scanner.model to store a source code project.Jul 2 2021, 9:33 AM

DanSeraf edited the summary of this revision. (Show Details)

DanSeraf added a parent revision: D5951: model: make deduplication optional when iterating over the merkle tree.

Build has FAILED

Patch application report for D5926 (id=21418)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project

Changes applied before test

commit f2648e037cd03577f79507879e8423cfb66dc3da
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/127/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/127/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 2 2021, 9:34 AM

Harbormaster failed remote builds in B22397: Diff 21418!

update requirements-swh

Build has FAILED

Patch application report for D5926 (id=21472)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project

Changes applied before test

commit 07b9afb41395474bfd718a821e674a53302bdb12
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/128/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/128/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 2 2021, 7:07 PM

Harbormaster failed remote builds in B22449: Diff 21472!

Build has FAILED

Patch application report for D5926 (id=21472)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project

Changes applied before test

commit b8eca8a5eab5bb1dc98718663297b2b4a5bc1b2a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/129/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/129/console

make pytest pass

Build is green

Patch application report for D5926 (id=21474)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project

Changes applied before test

commit 8fbc28fc274f5204d142e1bb089b47e23b264b57
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/130/ for more details.

Harbormaster completed remote builds in B22451: Diff 21474.Jul 5 2021, 9:33 AM

DanSeraf requested review of this revision.Jul 5 2021, 9:33 AM

Thanks, both the general structure and implementation look OK.
I'm requesting changes to address two main issues:

the keys in the MerkleNodeData should not be SWHIDs as strings, but rather SWHIDs as instances of CoreSWHID (see comments in the review for rationale)
it looks like in various places you .decode() paths early, which is dangerous as we have no guarantee that path components are actually decodable (to UTF-8). You should delay decoding as much as possible, ideally until final output

After fixing (1), I suggest you grep the core for all instances of str(.*swhid()), or something similar, there shouldn't be many around.
Similarly, after fixing (2), grep for .decode(), ideally they should remain only in output-related modules.

swh/scanner/data.py
14	naming nitpick: I'd call this MerkleNodeInfo, as it's extra data w.r.t. the actual data in the node, which is stored in the `from_disk` structure already
15	better docstring: "store additional information about Merkle DAG nodes, using SWHIDs as keys"
17–18	given the constructor isn't doing anything else, I think you can just leave it out entirely and not define `__init__` here
24–25	As dictionary keys for additional information we should use CoreSWHID instances, not strings. That will avoid having to re-validate SWHID syntactical correctness every time. Also, it is now trivial to obtain CoreSWHID from from_disk instances, thanks to your change in T3393. So, to recap: the scanner should instantiate from_disk objects from the filesystem, then obtain CoreSWHIDs from each node of it using the new swhid() method, and use those as keys for MerkleNodeData
57	this `decode()` looks dangerous, and Path objects work fine with bytes in general, please check if it is avoidable
78	here's an example of why we should use CoreSWHID instances as keys: here you are first rendering an existing CoreSWHID to string, than using it as a dictionary key, which means it will be hashed again. What a waste :)
105	here's another example of rendering/reparsing, this time even worse because it will also go through syntactic validation for SWHID correctness, which is pointless given it came from a CoreSWHID object already
109	Why are you decoding the path here? in general, there is no guarantee that we will be able to decode it at all, so it should be remain bytes as long as possible (until final output). I think this function should return a `Dict[bytes,dict]` instead, delaying decoding.
swh/scanner/scanner.py
30	better: `while queue`

This revision now requires changes to proceed.Jul 6 2021, 10:12 AM

DanSeraf added inline comments.Jul 7 2021, 11:48 AM

swh/scanner/data.py
57	I think it is not possible to avoid `decode()` here, since `Path` wants strings instead of bytes.

zack added inline comments.Jul 7 2021, 11:55 AM

swh/scanner/data.py
57	ouch, you're right (https://bugs.python.org/issue23904) and that sucks oh well

requested changes

Build has FAILED

Patch application report for D5926 (id=21540)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project

Changes applied before test

commit 761f9659660ced85223693d80f442982f6cc3798
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/131/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/131/console

Harbormaster failed remote builds in B22515: Diff 21540!Jul 7 2021, 5:07 PM

make pytest pass

Build is green

Patch application report for D5926 (id=21547)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project

Changes applied before test

commit 7e019da43cbc2462b44d802566fb9ab2efea0eee
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/132/ for more details.

Harbormaster completed remote builds in B22522: Diff 21547.Jul 7 2021, 6:08 PM

LGTM, accepted, thanks!

Please note down on the sides two remaining TODOs about all this:

adding a test case with a deduplicated source tree, to make sure nodes that are deduplicated at the Merkle DAG level are present multiple times in the output
adding a test case for a path that is not decodable in utf-8, to make sure it can be handled propertly

(they can be both added later, but I do not want to forget about these)

This revision is now accepted and ready to land.Jul 8 2021, 9:43 AM

commit message

Build is green

Patch application report for D5926 (id=21555)

Rebasing onto d58bcb59a0...

First, rewinding head to replay your work on top of it...
Applying: use model.from_disk instead of scanner.model to store a source code project

Changes applied before test

commit ba339fa62fbab2ff287126ca756a09ad35ab9efc
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project
    
    Closes T3349
    Closes T2730
    Closes T2692

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/133/ for more details.

Harbormaster completed remote builds in B22529: Diff 21555.Jul 8 2021, 3:32 PM

rebase

Build is green

Patch application report for D5926 (id=21556)

Rebasing onto d58bcb59a0...

Current branch diff-target is up to date.

Changes applied before test

commit 0d92c754c8dfa1b30910bf5b08c6df810276060f
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Jun 25 12:46:29 2021 +0100

    use model.from_disk instead of scanner.model to store a source code project
    
    Closes T3349
    Closes T2730
    Closes T2692

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/134/ for more details.

Harbormaster completed remote builds in B22530: Diff 21556.Jul 8 2021, 3:38 PM

Closed by commit rDTSCN0d92c754c8df: use model.from_disk instead of scanner.model to store a source code project (authored by DanSeraf). · Explain WhyJul 8 2021, 3:42 PM

This revision was automatically updated to reflect the committed changes.

DanSeraf added a commit: rDTSCN0d92c754c8df: use model.from_disk instead of scanner.model to store a source code project.

swh.scanner: use model.from_disk instead of scanner.model to store a source code project
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D5926 (id=21272)

Changes applied before test

Patch application report for D5926 (id=21418)

Changes applied before test

Patch application report for D5926 (id=21472)

Changes applied before test

Patch application report for D5926 (id=21472)

Changes applied before test

Patch application report for D5926 (id=21474)

Changes applied before test

Patch application report for D5926 (id=21540)

Changes applied before test

Patch application report for D5926 (id=21547)

Changes applied before test

Patch application report for D5926 (id=21555)

Changes applied before test

Patch application report for D5926 (id=21556)

Changes applied before test

Revision Contents
Changeset List

Diff 21272

swh/scanner/dashboard/dashboard.py

swh/scanner/data.py

swh/scanner/model.py

swh/scanner/output.py

swh/scanner/scanner.py

swh/scanner/tests/data.py

swh/scanner/tests/flask_api.py

swh/scanner/tests/test_model.py

swh/scanner/tests/test_scanner.py

swh.scanner: use model.from_disk instead of scanner.model to store a source code projectClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D5926 (id=21272)

Changes applied before test

Patch application report for D5926 (id=21418)

Changes applied before test

Patch application report for D5926 (id=21472)

Changes applied before test

Patch application report for D5926 (id=21472)

Changes applied before test

Patch application report for D5926 (id=21474)

Changes applied before test

Patch application report for D5926 (id=21540)

Changes applied before test

Patch application report for D5926 (id=21547)

Changes applied before test

Patch application report for D5926 (id=21555)

Changes applied before test

Patch application report for D5926 (id=21556)

Changes applied before test

Revision ContentsChangeset List

Diff 21272

swh/scanner/dashboard/dashboard.py

swh/scanner/data.py

swh/scanner/model.py

swh/scanner/output.py

swh/scanner/scanner.py

swh/scanner/tests/data.py

swh/scanner/tests/flask_api.py

swh/scanner/tests/test_model.py

swh/scanner/tests/test_scanner.py

swh.scanner: use model.from_disk instead of scanner.model to store a source code project
ClosedPublic
Actions

Revision Contents
Changeset List