This triggers a bug in ListOriginContributors, causing it to include
"null" as a contributor.
A future commit will fix this.
The only meaningful changes in this diff are .py scripts and the .dot file, everything else is generated
Depends on D8908.
Differential D8910
Regenerate the test dataset to include a release with no author Authored by vlorentz on Dec 1 2022, 11:31 AM. Tags None Subscribers None
Details
This triggers a bug in ListOriginContributors, causing it to include The only meaningful changes in this diff are .py scripts and the .dot file, everything else is generated Depends on D8908.
Diff Detail
Event TimelineComment Actions Build is green Patch application report for D8910 (id=32113)Could not rebase; Attempt merge onto ec7f568b13... Updating ec7f568..67abec7 Fast-forward conftest.py | 1 + .../graph/utils/ListOriginContributors.java | 141 ++++++ .../org/softwareheritage/graph/utils/TopoSort.java | 134 ++++++ mypy.ini | 6 + requirements-luigi.txt | 2 + requirements-swh-luigi.txt | 2 +- requirements-swh.txt | 1 + swh/graph/luigi.py | 474 ++++++++++++++++++++- .../dataset/compressed/example-labelled.labelobl | Bin 0 -> 772 bytes .../compressed/example-labelled.labeloffsets | 3 +- .../dataset/compressed/example-labelled.labels | 2 +- .../dataset/compressed/example-labelled.properties | 2 +- .../example-transposed-labelled.labelobl | Bin 0 -> 772 bytes .../example-transposed-labelled.labeloffsets | 3 +- .../compressed/example-transposed-labelled.labels | 3 +- .../example-transposed-labelled.properties | 2 +- .../dataset/compressed/example-transposed.graph | 2 +- .../dataset/compressed/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/compressed/example-transposed.offsets | 3 +- .../compressed/example-transposed.properties | 52 +-- .../dataset/compressed/example.edges.count.txt | 2 +- .../dataset/compressed/example.edges.stats.txt | 8 +- swh/graph/tests/dataset/compressed/example.graph | 2 +- .../tests/dataset/compressed/example.indegree | 5 +- .../dataset/compressed/example.labels.count.txt | 2 +- .../dataset/compressed/example.labels.csv.zst | Bin 115 -> 131 bytes .../compressed/example.labels.fcl.bytearray | Bin 110 -> 128 bytes .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes .../compressed/example.labels.fcl.properties | 2 +- .../tests/dataset/compressed/example.labels.mph | Bin 1521 -> 1529 bytes swh/graph/tests/dataset/compressed/example.mph | Bin 961 -> 961 bytes .../dataset/compressed/example.node2swhid.bin | Bin 462 -> 528 bytes .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes .../dataset/compressed/example.nodes.count.txt | 2 +- .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes .../dataset/compressed/example.nodes.stats.txt | 6 +- swh/graph/tests/dataset/compressed/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/compressed/example.offsets | 4 +- swh/graph/tests/dataset/compressed/example.order | Bin 168 -> 192 bytes .../tests/dataset/compressed/example.outdegree | 4 +- .../tests/dataset/compressed/example.persons.mph | Bin 961 -> 961 bytes .../tests/dataset/compressed/example.properties | 50 +-- .../compressed/example.property.author_id.bin | Bin 84 -> 2112 bytes .../example.property.author_timestamp.bin | Bin 168 -> 4224 bytes .../example.property.author_timestamp_offset.bin | Bin 42 -> 1056 bytes .../compressed/example.property.committer_id.bin | Bin 84 -> 2112 bytes .../example.property.committer_timestamp.bin | Bin 168 -> 4224 bytes ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes .../example.property.content.is_skipped.bin | Bin 85 -> 149 bytes .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes .../compressed/example.property.message.bin | 2 + .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes .../compressed/example.property.tag_name.bin | 1 + .../example.property.tag_name.offset.bin | Bin 168 -> 4224 bytes swh/graph/tests/dataset/compressed/example.stats | 28 +- .../logs/example-1669888191558-extract_nodes.log | 31 ++ .../compressed/logs/example-1669888192235-mph.log | 15 + .../compressed/logs/example-1669888192705-bv.log | 35 ++ .../compressed/logs/example-1669888198778-bfs.log | 7 + .../logs/example-1669888199039-permute_bfs.log | 23 + .../logs/example-1669888199374-transpose_bfs.log | 19 + .../logs/example-1669888199720-simplify.log | 22 + .../compressed/logs/example-1669888199989-llp.log | 143 +++++++ .../logs/example-1669888200352-permute_llp.log | 23 + .../compressed/logs/example-1669888200692-obl.log | 4 + .../logs/example-1669888200927-compose_orders.log | 4 + .../logs/example-1669888201039-stats.log | 7 + .../logs/example-1669888201272-transpose.log | 19 + .../logs/example-1669888201615-transpose_obl.log | 4 + .../compressed/logs/example-1669888201853-maps.log | 18 + .../logs/example-1669888202131-extract_persons.log | 11 + .../logs/example-1669888202702-mph_persons.log | 15 + .../logs/example-1669888203136-node_properties.log | 36 ++ .../logs/example-1669888203831-mph_labels.log | 26 ++ .../logs/example-1669888204319-fcl_labels.log | 6 + .../logs/example-1669888204581-edge_labels.log | 39 ++ .../logs/example-1669888210521-edge_labels_obl.log | 4 + ...ple-1669888210788-edge_labels_transpose_obl.log | 4 + .../logs/example-1669888211035-clean_tmp.log | 3 + .../dataset/edges/origin/graph-all.edges.csv.zst | Bin 82 -> 109 bytes .../dataset/edges/origin/graph-all.nodes.csv.zst | Bin 64 -> 95 bytes .../dataset/edges/release/graph-all.edges.csv.zst | Bin 56 -> 73 bytes .../dataset/edges/release/graph-all.nodes.csv.zst | Bin 38 -> 42 bytes .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes swh/graph/tests/dataset/generate_dataset.py | 46 +- swh/graph/tests/dataset/img/example.dot | 13 +- .../tests/dataset/orc/content/content-all.orc | Bin 1240 -> 1226 bytes .../tests/dataset/orc/directory/directory-all.orc | Bin 578 -> 563 bytes .../orc/directory_entry/directory_entry-all.orc | Bin 1126 -> 1115 bytes swh/graph/tests/dataset/orc/origin/origin-all.orc | Bin 817 -> 935 bytes .../dataset/orc/origin_visit/origin_visit-all.orc | Bin 898 -> 924 bytes .../origin_visit_status-all.orc | Bin 1150 -> 1191 bytes .../tests/dataset/orc/release/release-all.orc | Bin 1361 -> 1407 bytes .../tests/dataset/orc/revision/revision-all.orc | Bin 1658 -> 1643 bytes .../revision_extra_headers-all.orc | Bin 253 -> 236 bytes .../orc/revision_history/revision_history-all.orc | Bin 700 -> 685 bytes .../orc/skipped_content/skipped_content-all.orc | Bin 1177 -> 1160 bytes .../tests/dataset/orc/snapshot/snapshot-all.orc | Bin 459 -> 456 bytes .../orc/snapshot_branch/snapshot_branch-all.orc | Bin 865 -> 921 bytes swh/graph/tests/test_cli.py | 4 +- swh/graph/tests/test_grpc.py | 7 +- swh/graph/tests/test_http_client.py | 18 +- swh/graph/tests/test_luigi.py | 4 +- swh/graph/tests/test_origin_contributors.py | 187 ++++++++ swh/graph/tests/test_toposort.py | 67 +++ 106 files changed, 1701 insertions(+), 114 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888191558-extract_nodes.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888192235-mph.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888192705-bv.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888198778-bfs.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888199039-permute_bfs.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888199374-transpose_bfs.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888199720-simplify.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888199989-llp.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888200352-permute_llp.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888200692-obl.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888200927-compose_orders.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888201039-stats.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888201272-transpose.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888201615-transpose_obl.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888201853-maps.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888202131-extract_persons.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888202702-mph_persons.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888203136-node_properties.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888203831-mph_labels.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888204319-fcl_labels.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888204581-edge_labels.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888210521-edge_labels_obl.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888210788-edge_labels_transpose_obl.log create mode 100644 swh/graph/tests/dataset/compressed/logs/example-1669888211035-clean_tmp.log create mode 100644 swh/graph/tests/test_origin_contributors.py create mode 100644 swh/graph/tests/test_toposort.py Changes applied before testcommit 67abec7533eb586402a3b30ef3ce0c85f664f064
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Dec 1 11:27:35 2022 +0100
Regenerate the test dataset to include a release with no author
This triggers a bug in ListOriginContributors, causing it to include
"null" as a contributor.
A future commit will fix this.
commit 9972a08685c3d6e45119494ee6404c66a6374f26
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Dec 1 10:39:09 2022 +0100
Add ListOriginContributors
This Java script (and related Luigi tasks) traverse the graph in
topological order, building up the set of all contributors to a
node and its ancestors, then dump the value of this set for every
origin node they encounter.
commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Nov 29 17:54:30 2022 +0100
Add Luigi task TopoSort and add a simple test
commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Nov 28 16:02:56 2022 +0100
Improve comments
commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 16:15:04 2022 +0100
Add a sample of two ancestor with each node
This allows readers to efficiently get ancestors of nodes with low indegree
(ie. most revisions), as it avoids a random access / API call.
commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 12:54:14 2022 +0100
revert multithreading, it's actually twice as slow as singlethread
commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 12:06:21 2022 +0100
tentative multithread DFS
commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 11:49:32 2022 +0100
Implement a naive topological sort
commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Nov 29 17:01:45 2022 +0100
luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/300/ for more details. Comment Actions Build was aborted Patch application report for D8910 (id=32116)Could not rebase; Attempt merge onto ec7f568b13... Updating ec7f568..ee09b16 Fast-forward conftest.py | 1 + .../graph/utils/ListOriginContributors.java | 141 ++++++ .../org/softwareheritage/graph/utils/TopoSort.java | 134 ++++++ mypy.ini | 6 + requirements-luigi.txt | 2 + requirements-swh-luigi.txt | 2 +- requirements-swh.txt | 1 + swh/graph/luigi.py | 474 ++++++++++++++++++++- .../dataset/compressed/example-labelled.labelobl | Bin 0 -> 772 bytes .../compressed/example-labelled.labeloffsets | 3 +- .../dataset/compressed/example-labelled.labels | 2 +- .../dataset/compressed/example-labelled.properties | 2 +- .../example-transposed-labelled.labelobl | Bin 0 -> 772 bytes .../example-transposed-labelled.labeloffsets | 3 +- .../compressed/example-transposed-labelled.labels | 3 +- .../example-transposed-labelled.properties | 2 +- .../dataset/compressed/example-transposed.graph | 2 +- .../dataset/compressed/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/compressed/example-transposed.offsets | 3 +- .../compressed/example-transposed.properties | 52 +-- .../dataset/compressed/example.edges.count.txt | 2 +- .../dataset/compressed/example.edges.stats.txt | 8 +- swh/graph/tests/dataset/compressed/example.graph | 2 +- .../tests/dataset/compressed/example.indegree | 5 +- .../dataset/compressed/example.labels.count.txt | 2 +- .../dataset/compressed/example.labels.csv.zst | Bin 115 -> 131 bytes .../compressed/example.labels.fcl.bytearray | Bin 110 -> 128 bytes .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes .../compressed/example.labels.fcl.properties | 2 +- .../tests/dataset/compressed/example.labels.mph | Bin 1521 -> 1529 bytes swh/graph/tests/dataset/compressed/example.mph | Bin 961 -> 961 bytes .../dataset/compressed/example.node2swhid.bin | Bin 462 -> 528 bytes .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes .../dataset/compressed/example.nodes.count.txt | 2 +- .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes .../dataset/compressed/example.nodes.stats.txt | 6 +- swh/graph/tests/dataset/compressed/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/compressed/example.offsets | 4 +- swh/graph/tests/dataset/compressed/example.order | Bin 168 -> 192 bytes .../tests/dataset/compressed/example.outdegree | 4 +- .../tests/dataset/compressed/example.persons.mph | Bin 961 -> 961 bytes .../tests/dataset/compressed/example.properties | 50 +-- .../compressed/example.property.author_id.bin | Bin 84 -> 2112 bytes .../example.property.author_timestamp.bin | Bin 168 -> 4224 bytes .../example.property.author_timestamp_offset.bin | Bin 42 -> 1056 bytes .../compressed/example.property.committer_id.bin | Bin 84 -> 2112 bytes .../example.property.committer_timestamp.bin | Bin 168 -> 4224 bytes ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes .../example.property.content.is_skipped.bin | Bin 85 -> 149 bytes .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes .../compressed/example.property.message.bin | 2 + .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes .../compressed/example.property.tag_name.bin | 1 + .../example.property.tag_name.offset.bin | Bin 168 -> 4224 bytes swh/graph/tests/dataset/compressed/example.stats | 28 +- .../dataset/edges/origin/graph-all.edges.csv.zst | Bin 82 -> 109 bytes .../dataset/edges/origin/graph-all.nodes.csv.zst | Bin 64 -> 95 bytes .../dataset/edges/release/graph-all.edges.csv.zst | Bin 56 -> 73 bytes .../dataset/edges/release/graph-all.nodes.csv.zst | Bin 38 -> 42 bytes .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes swh/graph/tests/dataset/generate_dataset.py | 46 +- swh/graph/tests/dataset/img/example.dot | 13 +- .../tests/dataset/orc/content/content-all.orc | Bin 1240 -> 1226 bytes .../tests/dataset/orc/directory/directory-all.orc | Bin 578 -> 563 bytes .../orc/directory_entry/directory_entry-all.orc | Bin 1126 -> 1115 bytes swh/graph/tests/dataset/orc/origin/origin-all.orc | Bin 817 -> 935 bytes .../dataset/orc/origin_visit/origin_visit-all.orc | Bin 898 -> 924 bytes .../origin_visit_status-all.orc | Bin 1150 -> 1191 bytes .../tests/dataset/orc/release/release-all.orc | Bin 1361 -> 1407 bytes .../tests/dataset/orc/revision/revision-all.orc | Bin 1658 -> 1643 bytes .../revision_extra_headers-all.orc | Bin 253 -> 236 bytes .../orc/revision_history/revision_history-all.orc | Bin 700 -> 685 bytes .../orc/skipped_content/skipped_content-all.orc | Bin 1177 -> 1160 bytes .../tests/dataset/orc/snapshot/snapshot-all.orc | Bin 459 -> 456 bytes .../orc/snapshot_branch/snapshot_branch-all.orc | Bin 865 -> 921 bytes swh/graph/tests/test_cli.py | 4 +- swh/graph/tests/test_grpc.py | 7 +- swh/graph/tests/test_http_client.py | 18 +- swh/graph/tests/test_luigi.py | 4 +- swh/graph/tests/test_origin_contributors.py | 187 ++++++++ swh/graph/tests/test_toposort.py | 67 +++ 82 files changed, 1183 insertions(+), 114 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl create mode 100644 swh/graph/tests/test_origin_contributors.py create mode 100644 swh/graph/tests/test_toposort.py Changes applied before testcommit ee09b16376dde6a033a4b6147237cdcfec3f081c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Dec 1 11:27:35 2022 +0100
Regenerate the test dataset to include a release with no author
This triggers a bug in ListOriginContributors, causing it to include
"null" as a contributor.
A future commit will fix this.
commit 9972a08685c3d6e45119494ee6404c66a6374f26
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Dec 1 10:39:09 2022 +0100
Add ListOriginContributors
This Java script (and related Luigi tasks) traverse the graph in
topological order, building up the set of all contributors to a
node and its ancestors, then dump the value of this set for every
origin node they encounter.
commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Nov 29 17:54:30 2022 +0100
Add Luigi task TopoSort and add a simple test
commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Nov 28 16:02:56 2022 +0100
Improve comments
commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 16:15:04 2022 +0100
Add a sample of two ancestor with each node
This allows readers to efficiently get ancestors of nodes with low indegree
(ie. most revisions), as it avoids a random access / API call.
commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 12:54:14 2022 +0100
revert multithreading, it's actually twice as slow as singlethread
commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 12:06:21 2022 +0100
tentative multithread DFS
commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 11:49:32 2022 +0100
Implement a naive topological sort
commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Nov 29 17:01:45 2022 +0100
luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/302/ Comment Actions Build is green Patch application report for D8910 (id=32116)Could not rebase; Attempt merge onto ec7f568b13... Updating ec7f568..ee09b16 Fast-forward conftest.py | 1 + .../graph/utils/ListOriginContributors.java | 141 ++++++ .../org/softwareheritage/graph/utils/TopoSort.java | 134 ++++++ mypy.ini | 6 + requirements-luigi.txt | 2 + requirements-swh-luigi.txt | 2 +- requirements-swh.txt | 1 + swh/graph/luigi.py | 474 ++++++++++++++++++++- .../dataset/compressed/example-labelled.labelobl | Bin 0 -> 772 bytes .../compressed/example-labelled.labeloffsets | 3 +- .../dataset/compressed/example-labelled.labels | 2 +- .../dataset/compressed/example-labelled.properties | 2 +- .../example-transposed-labelled.labelobl | Bin 0 -> 772 bytes .../example-transposed-labelled.labeloffsets | 3 +- .../compressed/example-transposed-labelled.labels | 3 +- .../example-transposed-labelled.properties | 2 +- .../dataset/compressed/example-transposed.graph | 2 +- .../dataset/compressed/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/compressed/example-transposed.offsets | 3 +- .../compressed/example-transposed.properties | 52 +-- .../dataset/compressed/example.edges.count.txt | 2 +- .../dataset/compressed/example.edges.stats.txt | 8 +- swh/graph/tests/dataset/compressed/example.graph | 2 +- .../tests/dataset/compressed/example.indegree | 5 +- .../dataset/compressed/example.labels.count.txt | 2 +- .../dataset/compressed/example.labels.csv.zst | Bin 115 -> 131 bytes .../compressed/example.labels.fcl.bytearray | Bin 110 -> 128 bytes .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes .../compressed/example.labels.fcl.properties | 2 +- .../tests/dataset/compressed/example.labels.mph | Bin 1521 -> 1529 bytes swh/graph/tests/dataset/compressed/example.mph | Bin 961 -> 961 bytes .../dataset/compressed/example.node2swhid.bin | Bin 462 -> 528 bytes .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes .../dataset/compressed/example.nodes.count.txt | 2 +- .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes .../dataset/compressed/example.nodes.stats.txt | 6 +- swh/graph/tests/dataset/compressed/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/compressed/example.offsets | 4 +- swh/graph/tests/dataset/compressed/example.order | Bin 168 -> 192 bytes .../tests/dataset/compressed/example.outdegree | 4 +- .../tests/dataset/compressed/example.persons.mph | Bin 961 -> 961 bytes .../tests/dataset/compressed/example.properties | 50 +-- .../compressed/example.property.author_id.bin | Bin 84 -> 2112 bytes .../example.property.author_timestamp.bin | Bin 168 -> 4224 bytes .../example.property.author_timestamp_offset.bin | Bin 42 -> 1056 bytes .../compressed/example.property.committer_id.bin | Bin 84 -> 2112 bytes .../example.property.committer_timestamp.bin | Bin 168 -> 4224 bytes ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes .../example.property.content.is_skipped.bin | Bin 85 -> 149 bytes .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes .../compressed/example.property.message.bin | 2 + .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes .../compressed/example.property.tag_name.bin | 1 + .../example.property.tag_name.offset.bin | Bin 168 -> 4224 bytes swh/graph/tests/dataset/compressed/example.stats | 28 +- .../dataset/edges/origin/graph-all.edges.csv.zst | Bin 82 -> 109 bytes .../dataset/edges/origin/graph-all.nodes.csv.zst | Bin 64 -> 95 bytes .../dataset/edges/release/graph-all.edges.csv.zst | Bin 56 -> 73 bytes .../dataset/edges/release/graph-all.nodes.csv.zst | Bin 38 -> 42 bytes .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes swh/graph/tests/dataset/generate_dataset.py | 46 +- swh/graph/tests/dataset/img/example.dot | 13 +- .../tests/dataset/orc/content/content-all.orc | Bin 1240 -> 1226 bytes .../tests/dataset/orc/directory/directory-all.orc | Bin 578 -> 563 bytes .../orc/directory_entry/directory_entry-all.orc | Bin 1126 -> 1115 bytes swh/graph/tests/dataset/orc/origin/origin-all.orc | Bin 817 -> 935 bytes .../dataset/orc/origin_visit/origin_visit-all.orc | Bin 898 -> 924 bytes .../origin_visit_status-all.orc | Bin 1150 -> 1191 bytes .../tests/dataset/orc/release/release-all.orc | Bin 1361 -> 1407 bytes .../tests/dataset/orc/revision/revision-all.orc | Bin 1658 -> 1643 bytes .../revision_extra_headers-all.orc | Bin 253 -> 236 bytes .../orc/revision_history/revision_history-all.orc | Bin 700 -> 685 bytes .../orc/skipped_content/skipped_content-all.orc | Bin 1177 -> 1160 bytes .../tests/dataset/orc/snapshot/snapshot-all.orc | Bin 459 -> 456 bytes .../orc/snapshot_branch/snapshot_branch-all.orc | Bin 865 -> 921 bytes swh/graph/tests/test_cli.py | 4 +- swh/graph/tests/test_grpc.py | 7 +- swh/graph/tests/test_http_client.py | 18 +- swh/graph/tests/test_luigi.py | 4 +- swh/graph/tests/test_origin_contributors.py | 187 ++++++++ swh/graph/tests/test_toposort.py | 67 +++ 82 files changed, 1183 insertions(+), 114 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl create mode 100644 swh/graph/tests/test_origin_contributors.py create mode 100644 swh/graph/tests/test_toposort.py Changes applied before testcommit ee09b16376dde6a033a4b6147237cdcfec3f081c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Dec 1 11:27:35 2022 +0100
Regenerate the test dataset to include a release with no author
This triggers a bug in ListOriginContributors, causing it to include
"null" as a contributor.
A future commit will fix this.
commit 9972a08685c3d6e45119494ee6404c66a6374f26
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Dec 1 10:39:09 2022 +0100
Add ListOriginContributors
This Java script (and related Luigi tasks) traverse the graph in
topological order, building up the set of all contributors to a
node and its ancestors, then dump the value of this set for every
origin node they encounter.
commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Nov 29 17:54:30 2022 +0100
Add Luigi task TopoSort and add a simple test
commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Nov 28 16:02:56 2022 +0100
Improve comments
commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 16:15:04 2022 +0100
Add a sample of two ancestor with each node
This allows readers to efficiently get ancestors of nodes with low indegree
(ie. most revisions), as it avoids a random access / API call.
commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 12:54:14 2022 +0100
revert multithreading, it's actually twice as slow as singlethread
commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 12:06:21 2022 +0100
tentative multithread DFS
commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 11:49:32 2022 +0100
Implement a naive topological sort
commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Nov 29 17:01:45 2022 +0100
luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/304/ for more details. Comment Actions Build was aborted Patch application report for D8910 (id=32160)Could not rebase; Attempt merge onto 0a8ae5de6f... Updating 0a8ae5d..559d406 Fast-forward conftest.py | 1 + .../graph/utils/ListOriginContributors.java | 141 ++++++ .../org/softwareheritage/graph/utils/TopoSort.java | 134 ++++++ mypy.ini | 6 + requirements-luigi.txt | 2 + requirements-swh-luigi.txt | 2 +- requirements-swh.txt | 1 + swh/graph/luigi.py | 474 ++++++++++++++++++++- .../dataset/compressed/example-labelled.labelobl | Bin 0 -> 772 bytes .../compressed/example-labelled.labeloffsets | 3 +- .../dataset/compressed/example-labelled.labels | 2 +- .../dataset/compressed/example-labelled.properties | 2 +- .../example-transposed-labelled.labelobl | Bin 0 -> 772 bytes .../example-transposed-labelled.labeloffsets | 3 +- .../compressed/example-transposed-labelled.labels | 3 +- .../example-transposed-labelled.properties | 2 +- .../dataset/compressed/example-transposed.graph | 2 +- .../dataset/compressed/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/compressed/example-transposed.offsets | 3 +- .../compressed/example-transposed.properties | 52 +-- .../dataset/compressed/example.edges.count.txt | 2 +- .../dataset/compressed/example.edges.stats.txt | 8 +- swh/graph/tests/dataset/compressed/example.graph | 2 +- .../tests/dataset/compressed/example.indegree | 5 +- .../dataset/compressed/example.labels.count.txt | 2 +- .../dataset/compressed/example.labels.csv.zst | Bin 115 -> 131 bytes .../compressed/example.labels.fcl.bytearray | Bin 110 -> 128 bytes .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes .../compressed/example.labels.fcl.properties | 2 +- .../tests/dataset/compressed/example.labels.mph | Bin 1521 -> 1529 bytes swh/graph/tests/dataset/compressed/example.mph | Bin 961 -> 961 bytes .../dataset/compressed/example.node2swhid.bin | Bin 462 -> 528 bytes .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes .../dataset/compressed/example.nodes.count.txt | 2 +- .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes .../dataset/compressed/example.nodes.stats.txt | 6 +- swh/graph/tests/dataset/compressed/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/compressed/example.offsets | 4 +- swh/graph/tests/dataset/compressed/example.order | Bin 168 -> 192 bytes .../tests/dataset/compressed/example.outdegree | 4 +- .../tests/dataset/compressed/example.persons.mph | Bin 961 -> 961 bytes .../tests/dataset/compressed/example.properties | 50 +-- .../compressed/example.property.author_id.bin | Bin 84 -> 2112 bytes .../example.property.author_timestamp.bin | Bin 168 -> 4224 bytes .../example.property.author_timestamp_offset.bin | Bin 42 -> 1056 bytes .../compressed/example.property.committer_id.bin | Bin 84 -> 2112 bytes .../example.property.committer_timestamp.bin | Bin 168 -> 4224 bytes ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes .../example.property.content.is_skipped.bin | Bin 85 -> 149 bytes .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes .../compressed/example.property.message.bin | 2 + .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes .../compressed/example.property.tag_name.bin | 1 + .../example.property.tag_name.offset.bin | Bin 168 -> 4224 bytes swh/graph/tests/dataset/compressed/example.stats | 28 +- .../dataset/edges/origin/graph-all.edges.csv.zst | Bin 82 -> 109 bytes .../dataset/edges/origin/graph-all.nodes.csv.zst | Bin 64 -> 95 bytes .../dataset/edges/release/graph-all.edges.csv.zst | Bin 56 -> 73 bytes .../dataset/edges/release/graph-all.nodes.csv.zst | Bin 38 -> 42 bytes .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes swh/graph/tests/dataset/generate_dataset.py | 46 +- swh/graph/tests/dataset/img/example.dot | 13 +- .../tests/dataset/orc/content/content-all.orc | Bin 1240 -> 1226 bytes .../tests/dataset/orc/directory/directory-all.orc | Bin 578 -> 563 bytes .../orc/directory_entry/directory_entry-all.orc | Bin 1126 -> 1115 bytes swh/graph/tests/dataset/orc/origin/origin-all.orc | Bin 817 -> 935 bytes .../dataset/orc/origin_visit/origin_visit-all.orc | Bin 898 -> 924 bytes .../origin_visit_status-all.orc | Bin 1150 -> 1191 bytes .../tests/dataset/orc/release/release-all.orc | Bin 1361 -> 1407 bytes .../tests/dataset/orc/revision/revision-all.orc | Bin 1658 -> 1643 bytes .../revision_extra_headers-all.orc | Bin 253 -> 236 bytes .../orc/revision_history/revision_history-all.orc | Bin 700 -> 685 bytes .../orc/skipped_content/skipped_content-all.orc | Bin 1177 -> 1160 bytes .../tests/dataset/orc/snapshot/snapshot-all.orc | Bin 459 -> 456 bytes .../orc/snapshot_branch/snapshot_branch-all.orc | Bin 865 -> 921 bytes swh/graph/tests/test_cli.py | 4 +- swh/graph/tests/test_grpc.py | 7 +- swh/graph/tests/test_http_client.py | 18 +- swh/graph/tests/test_luigi.py | 4 +- swh/graph/tests/test_origin_contributors.py | 187 ++++++++ swh/graph/tests/test_toposort.py | 67 +++ 82 files changed, 1183 insertions(+), 114 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl create mode 100644 swh/graph/tests/test_origin_contributors.py create mode 100644 swh/graph/tests/test_toposort.py Changes applied before testcommit 559d4068bfe1dd50d57062192c0e22664ada03c8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Dec 1 11:27:35 2022 +0100
Regenerate the test dataset to include a release with no author
This triggers a bug in ListOriginContributors, causing it to include
"null" as a contributor.
A future commit will fix this.
commit f3235e3184850b074b2a332686911688aafcdd84
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Dec 1 10:39:09 2022 +0100
Add ListOriginContributors
This Java script (and related Luigi tasks) traverse the graph in
topological order, building up the set of all contributors to a
node and its ancestors, then dump the value of this set for every
origin node they encounter.
commit ab2703efcb9ad93a3d959596ed7edef27d908164
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Nov 29 17:54:30 2022 +0100
Add Luigi task TopoSort and add a simple test
commit 58f44785816bde0f6cdbf86e3ff6f1fbf385a487
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Mon Nov 28 16:02:56 2022 +0100
Improve comments
commit 922894410b6e14f5a9eeec445d4a0b503df77a9e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 16:15:04 2022 +0100
Add a sample of two ancestor with each node
This allows readers to efficiently get ancestors of nodes with low indegree
(ie. most revisions), as it avoids a random access / API call.
commit 7bee5d47a6eb49ac594f2d019222c176373a5248
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 12:54:14 2022 +0100
revert multithreading, it's actually twice as slow as singlethread
commit 30dad16a2365021bedf72df78d0753e125765016
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 12:06:21 2022 +0100
tentative multithread DFS
commit ed6636c26be869a7309581d0ec664488b4d69e9f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Nov 24 11:49:32 2022 +0100
Implement a naive topological sort
commit b8dc411ccd304597df96d7dd36158fb86e5239fd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue Nov 29 17:01:45 2022 +0100
luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/313/ |