It is going to get large, with the future addition of tasks to generate
the license dataset and the citation dataset.
Depends on D8912.
Differential D8917
Split swh/graph/luigi.py into modules vlorentz on Dec 5 2022, 2:46 PM. Authored by Tags None Subscribers None
Details
It is going to get large, with the future addition of tasks to generate Depends on D8912.
Diff Detail
Event TimelineComment Actions Build is green Patch application report for D8917 (id=32125)Could not rebase; Attempt merge onto ec7f568b13... Updating ec7f568..c8057ac Fast-forward conftest.py | 1 + .../graph/utils/ListOriginContributors.java | 151 +++++++ .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++ mypy.ini | 6 + requirements-luigi.txt | 2 + requirements-swh-luigi.txt | 2 +- requirements-swh.txt | 1 + swh/graph/luigi.py | 186 --------- swh/graph/luigi/__init__.py | 75 ++++ swh/graph/luigi/compressed_graph.py | 438 +++++++++++++++++++++ swh/graph/luigi/misc_datasets.py | 70 ++++ swh/graph/luigi/origin_contributors.py | 188 +++++++++ swh/graph/luigi/utils.py | 34 ++ .../dataset/compressed/example-labelled.labelobl | Bin 0 -> 772 bytes .../compressed/example-labelled.labeloffsets | 3 +- .../dataset/compressed/example-labelled.labels | 2 +- .../dataset/compressed/example-labelled.properties | 2 +- .../example-transposed-labelled.labelobl | Bin 0 -> 772 bytes .../example-transposed-labelled.labeloffsets | 3 +- .../compressed/example-transposed-labelled.labels | 3 +- .../example-transposed-labelled.properties | 2 +- .../dataset/compressed/example-transposed.graph | 2 +- .../dataset/compressed/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/compressed/example-transposed.offsets | 3 +- .../compressed/example-transposed.properties | 52 +-- .../dataset/compressed/example.edges.count.txt | 2 +- .../dataset/compressed/example.edges.stats.txt | 8 +- swh/graph/tests/dataset/compressed/example.graph | 2 +- .../tests/dataset/compressed/example.indegree | 5 +- .../dataset/compressed/example.labels.count.txt | 2 +- .../dataset/compressed/example.labels.csv.zst | Bin 115 -> 131 bytes .../compressed/example.labels.fcl.bytearray | Bin 110 -> 128 bytes .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes .../compressed/example.labels.fcl.properties | 2 +- .../tests/dataset/compressed/example.labels.mph | Bin 1521 -> 1529 bytes swh/graph/tests/dataset/compressed/example.mph | Bin 961 -> 961 bytes .../dataset/compressed/example.node2swhid.bin | Bin 462 -> 528 bytes .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes .../dataset/compressed/example.nodes.count.txt | 2 +- .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes .../dataset/compressed/example.nodes.stats.txt | 6 +- swh/graph/tests/dataset/compressed/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/compressed/example.offsets | 4 +- swh/graph/tests/dataset/compressed/example.order | Bin 168 -> 192 bytes .../tests/dataset/compressed/example.outdegree | 4 +- .../tests/dataset/compressed/example.persons.mph | Bin 961 -> 961 bytes .../tests/dataset/compressed/example.properties | 50 +-- .../compressed/example.property.author_id.bin | Bin 84 -> 2112 bytes .../example.property.author_timestamp.bin | Bin 168 -> 4224 bytes .../example.property.author_timestamp_offset.bin | Bin 42 -> 1056 bytes .../compressed/example.property.committer_id.bin | Bin 84 -> 2112 bytes .../example.property.committer_timestamp.bin | Bin 168 -> 4224 bytes ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes .../example.property.content.is_skipped.bin | Bin 85 -> 149 bytes .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes .../compressed/example.property.message.bin | 2 + .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes .../compressed/example.property.tag_name.bin | 1 + .../example.property.tag_name.offset.bin | Bin 168 -> 4224 bytes swh/graph/tests/dataset/compressed/example.stats | 28 +- .../dataset/edges/origin/graph-all.edges.csv.zst | Bin 82 -> 109 bytes .../dataset/edges/origin/graph-all.nodes.csv.zst | Bin 64 -> 95 bytes .../dataset/edges/release/graph-all.edges.csv.zst | Bin 56 -> 73 bytes .../dataset/edges/release/graph-all.nodes.csv.zst | Bin 38 -> 42 bytes .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes swh/graph/tests/dataset/generate_dataset.py | 46 ++- swh/graph/tests/dataset/img/example.dot | 13 +- .../tests/dataset/orc/content/content-all.orc | Bin 1240 -> 1226 bytes .../tests/dataset/orc/directory/directory-all.orc | Bin 578 -> 563 bytes .../orc/directory_entry/directory_entry-all.orc | Bin 1126 -> 1115 bytes swh/graph/tests/dataset/orc/origin/origin-all.orc | Bin 817 -> 935 bytes .../dataset/orc/origin_visit/origin_visit-all.orc | Bin 898 -> 924 bytes .../origin_visit_status-all.orc | Bin 1150 -> 1191 bytes .../tests/dataset/orc/release/release-all.orc | Bin 1361 -> 1407 bytes .../tests/dataset/orc/revision/revision-all.orc | Bin 1658 -> 1643 bytes .../revision_extra_headers-all.orc | Bin 253 -> 236 bytes .../orc/revision_history/revision_history-all.orc | Bin 700 -> 685 bytes .../orc/skipped_content/skipped_content-all.orc | Bin 1177 -> 1160 bytes .../tests/dataset/orc/snapshot/snapshot-all.orc | Bin 459 -> 456 bytes .../orc/snapshot_branch/snapshot_branch-all.orc | Bin 865 -> 921 bytes swh/graph/tests/test_cli.py | 4 +- swh/graph/tests/test_grpc.py | 7 +- swh/graph/tests/test_http_client.py | 18 +- swh/graph/tests/test_luigi.py | 6 +- swh/graph/tests/test_origin_contributors.py | 186 +++++++++ swh/graph/tests/test_toposort.py | 67 ++++ 87 files changed, 1527 insertions(+), 298 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java delete mode 100644 swh/graph/luigi.py create mode 100644 swh/graph/luigi/__init__.py create mode 100644 swh/graph/luigi/compressed_graph.py create mode 100644 swh/graph/luigi/misc_datasets.py create mode 100644 swh/graph/luigi/origin_contributors.py create mode 100644 swh/graph/luigi/utils.py create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl create mode 100644 swh/graph/tests/test_origin_contributors.py create mode 100644 swh/graph/tests/test_toposort.py Changes applied before testcommit c8057accce23f72c763e0e7bee931568a45a3b7f Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Dec 5 14:42:12 2022 +0100 Split swh/graph/luigi.py into modules It is going to get large, with the future addition of tasks to generate the license dataset and the citation dataset. commit 172ee6deae3102f904284533b657003daf8c0b21 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 11:35:43 2022 +0100 ListOriginContributors: Ignore null author/committer in revisions/releases commit ee09b16376dde6a033a4b6147237cdcfec3f081c Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 11:27:35 2022 +0100 Regenerate the test dataset to include a release with no author This triggers a bug in ListOriginContributors, causing it to include "null" as a contributor. A future commit will fix this. commit 9972a08685c3d6e45119494ee6404c66a6374f26 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 10:39:09 2022 +0100 Add ListOriginContributors This Java script (and related Luigi tasks) traverse the graph in topological order, building up the set of all contributors to a node and its ancestors, then dump the value of this set for every origin node they encounter. commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Nov 29 17:54:30 2022 +0100 Add Luigi task TopoSort and add a simple test commit 78b4d9016cfd5025811607c9f6069fea1b39eb23 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Nov 28 16:02:56 2022 +0100 Improve comments commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 16:15:04 2022 +0100 Add a sample of two ancestor with each node This allows readers to efficiently get ancestors of nodes with low indegree (ie. most revisions), as it avoids a random access / API call. commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 12:54:14 2022 +0100 revert multithreading, it's actually twice as slow as singlethread commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 12:06:21 2022 +0100 tentative multithread DFS commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 11:49:32 2022 +0100 Implement a naive topological sort commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Nov 29 17:01:45 2022 +0100 luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3 See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/305/ for more details. Comment Actions Build was aborted Patch application report for D8917 (id=32162)Could not rebase; Attempt merge onto 0a8ae5de6f... Updating 0a8ae5d..e65858a Fast-forward conftest.py | 1 + .../graph/utils/ListOriginContributors.java | 151 +++++++ .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++ mypy.ini | 6 + requirements-luigi.txt | 2 + requirements-swh-luigi.txt | 2 +- requirements-swh.txt | 1 + swh/graph/luigi.py | 186 --------- swh/graph/luigi/__init__.py | 75 ++++ swh/graph/luigi/compressed_graph.py | 438 +++++++++++++++++++++ swh/graph/luigi/misc_datasets.py | 70 ++++ swh/graph/luigi/origin_contributors.py | 188 +++++++++ swh/graph/luigi/utils.py | 34 ++ .../dataset/compressed/example-labelled.labelobl | Bin 0 -> 772 bytes .../compressed/example-labelled.labeloffsets | 3 +- .../dataset/compressed/example-labelled.labels | 2 +- .../dataset/compressed/example-labelled.properties | 2 +- .../example-transposed-labelled.labelobl | Bin 0 -> 772 bytes .../example-transposed-labelled.labeloffsets | 3 +- .../compressed/example-transposed-labelled.labels | 3 +- .../example-transposed-labelled.properties | 2 +- .../dataset/compressed/example-transposed.graph | 2 +- .../dataset/compressed/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/compressed/example-transposed.offsets | 3 +- .../compressed/example-transposed.properties | 52 +-- .../dataset/compressed/example.edges.count.txt | 2 +- .../dataset/compressed/example.edges.stats.txt | 8 +- swh/graph/tests/dataset/compressed/example.graph | 2 +- .../tests/dataset/compressed/example.indegree | 5 +- .../dataset/compressed/example.labels.count.txt | 2 +- .../dataset/compressed/example.labels.csv.zst | Bin 115 -> 131 bytes .../compressed/example.labels.fcl.bytearray | Bin 110 -> 128 bytes .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes .../compressed/example.labels.fcl.properties | 2 +- .../tests/dataset/compressed/example.labels.mph | Bin 1521 -> 1529 bytes swh/graph/tests/dataset/compressed/example.mph | Bin 961 -> 961 bytes .../dataset/compressed/example.node2swhid.bin | Bin 462 -> 528 bytes .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes .../dataset/compressed/example.nodes.count.txt | 2 +- .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes .../dataset/compressed/example.nodes.stats.txt | 6 +- swh/graph/tests/dataset/compressed/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/compressed/example.offsets | 4 +- swh/graph/tests/dataset/compressed/example.order | Bin 168 -> 192 bytes .../tests/dataset/compressed/example.outdegree | 4 +- .../tests/dataset/compressed/example.persons.mph | Bin 961 -> 961 bytes .../tests/dataset/compressed/example.properties | 50 +-- .../compressed/example.property.author_id.bin | Bin 84 -> 2112 bytes .../example.property.author_timestamp.bin | Bin 168 -> 4224 bytes .../example.property.author_timestamp_offset.bin | Bin 42 -> 1056 bytes .../compressed/example.property.committer_id.bin | Bin 84 -> 2112 bytes .../example.property.committer_timestamp.bin | Bin 168 -> 4224 bytes ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes .../example.property.content.is_skipped.bin | Bin 85 -> 149 bytes .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes .../compressed/example.property.message.bin | 2 + .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes .../compressed/example.property.tag_name.bin | 1 + .../example.property.tag_name.offset.bin | Bin 168 -> 4224 bytes swh/graph/tests/dataset/compressed/example.stats | 28 +- .../dataset/edges/origin/graph-all.edges.csv.zst | Bin 82 -> 109 bytes .../dataset/edges/origin/graph-all.nodes.csv.zst | Bin 64 -> 95 bytes .../dataset/edges/release/graph-all.edges.csv.zst | Bin 56 -> 73 bytes .../dataset/edges/release/graph-all.nodes.csv.zst | Bin 38 -> 42 bytes .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes swh/graph/tests/dataset/generate_dataset.py | 46 ++- swh/graph/tests/dataset/img/example.dot | 13 +- .../tests/dataset/orc/content/content-all.orc | Bin 1240 -> 1226 bytes .../tests/dataset/orc/directory/directory-all.orc | Bin 578 -> 563 bytes .../orc/directory_entry/directory_entry-all.orc | Bin 1126 -> 1115 bytes swh/graph/tests/dataset/orc/origin/origin-all.orc | Bin 817 -> 935 bytes .../dataset/orc/origin_visit/origin_visit-all.orc | Bin 898 -> 924 bytes .../origin_visit_status-all.orc | Bin 1150 -> 1191 bytes .../tests/dataset/orc/release/release-all.orc | Bin 1361 -> 1407 bytes .../tests/dataset/orc/revision/revision-all.orc | Bin 1658 -> 1643 bytes .../revision_extra_headers-all.orc | Bin 253 -> 236 bytes .../orc/revision_history/revision_history-all.orc | Bin 700 -> 685 bytes .../orc/skipped_content/skipped_content-all.orc | Bin 1177 -> 1160 bytes .../tests/dataset/orc/snapshot/snapshot-all.orc | Bin 459 -> 456 bytes .../orc/snapshot_branch/snapshot_branch-all.orc | Bin 865 -> 921 bytes swh/graph/tests/test_cli.py | 4 +- swh/graph/tests/test_grpc.py | 7 +- swh/graph/tests/test_http_client.py | 18 +- swh/graph/tests/test_luigi.py | 6 +- swh/graph/tests/test_origin_contributors.py | 186 +++++++++ swh/graph/tests/test_toposort.py | 67 ++++ 87 files changed, 1527 insertions(+), 298 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java delete mode 100644 swh/graph/luigi.py create mode 100644 swh/graph/luigi/__init__.py create mode 100644 swh/graph/luigi/compressed_graph.py create mode 100644 swh/graph/luigi/misc_datasets.py create mode 100644 swh/graph/luigi/origin_contributors.py create mode 100644 swh/graph/luigi/utils.py create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl create mode 100644 swh/graph/tests/test_origin_contributors.py create mode 100644 swh/graph/tests/test_toposort.py Changes applied before testcommit e65858a73918698996a8066d0df45c3de29b9105 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Dec 5 14:42:12 2022 +0100 Split swh/graph/luigi.py into modules It is going to get large, with the future addition of tasks to generate the license dataset and the citation dataset. commit dfd4c1dc3b224477f9adb33c15f6c75bcdf78244 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 11:35:43 2022 +0100 ListOriginContributors: Ignore null author/committer in revisions/releases commit 559d4068bfe1dd50d57062192c0e22664ada03c8 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 11:27:35 2022 +0100 Regenerate the test dataset to include a release with no author This triggers a bug in ListOriginContributors, causing it to include "null" as a contributor. A future commit will fix this. commit f3235e3184850b074b2a332686911688aafcdd84 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 10:39:09 2022 +0100 Add ListOriginContributors This Java script (and related Luigi tasks) traverse the graph in topological order, building up the set of all contributors to a node and its ancestors, then dump the value of this set for every origin node they encounter. commit ab2703efcb9ad93a3d959596ed7edef27d908164 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Nov 29 17:54:30 2022 +0100 Add Luigi task TopoSort and add a simple test commit 58f44785816bde0f6cdbf86e3ff6f1fbf385a487 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Nov 28 16:02:56 2022 +0100 Improve comments commit 922894410b6e14f5a9eeec445d4a0b503df77a9e Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 16:15:04 2022 +0100 Add a sample of two ancestor with each node This allows readers to efficiently get ancestors of nodes with low indegree (ie. most revisions), as it avoids a random access / API call. commit 7bee5d47a6eb49ac594f2d019222c176373a5248 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 12:54:14 2022 +0100 revert multithreading, it's actually twice as slow as singlethread commit 30dad16a2365021bedf72df78d0753e125765016 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 12:06:21 2022 +0100 tentative multithread DFS commit ed6636c26be869a7309581d0ec664488b4d69e9f Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 11:49:32 2022 +0100 Implement a naive topological sort commit b8dc411ccd304597df96d7dd36158fb86e5239fd Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Nov 29 17:01:45 2022 +0100 luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3 Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/315/ |