This applies @anlambert's comments on D8908 (I renamed the file since, and didn't want to end up with a rebase hell, so I'm opening a new diff)
Depends on D8919
Differential D8930
origin_contributors: Fix typo and improve readability vlorentz on Dec 7 2022, 10:09 AM. Authored by
Details
This applies @anlambert's comments on D8908 (I renamed the file since, and didn't want to end up with a rebase hell, so I'm opening a new diff) Depends on D8919
Diff Detail
Event TimelineComment Actions Build was aborted Patch application report for D8930 (id=32164)Could not rebase; Attempt merge onto 0a8ae5de6f... Updating 0a8ae5d..b8ddd6c Fast-forward conftest.py | 1 + .../graph/utils/ListOriginContributors.java | 151 +++++++ .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++ mypy.ini | 6 + requirements-luigi.txt | 2 + requirements-swh-luigi.txt | 2 +- requirements-swh.txt | 1 + swh/graph/cli.py | 178 ++++++++- swh/graph/luigi.py | 186 --------- swh/graph/luigi/__init__.py | 75 ++++ swh/graph/luigi/compressed_graph.py | 438 +++++++++++++++++++++ swh/graph/luigi/misc_datasets.py | 70 ++++ swh/graph/luigi/origin_contributors.py | 196 +++++++++ swh/graph/luigi/utils.py | 35 ++ .../dataset/compressed/example-labelled.labelobl | Bin 0 -> 772 bytes .../compressed/example-labelled.labeloffsets | 3 +- .../dataset/compressed/example-labelled.labels | 2 +- .../dataset/compressed/example-labelled.properties | 2 +- .../example-transposed-labelled.labelobl | Bin 0 -> 772 bytes .../example-transposed-labelled.labeloffsets | 3 +- .../compressed/example-transposed-labelled.labels | 3 +- .../example-transposed-labelled.properties | 2 +- .../dataset/compressed/example-transposed.graph | 2 +- .../dataset/compressed/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/compressed/example-transposed.offsets | 3 +- .../compressed/example-transposed.properties | 52 +-- .../dataset/compressed/example.edges.count.txt | 2 +- .../dataset/compressed/example.edges.stats.txt | 8 +- swh/graph/tests/dataset/compressed/example.graph | 2 +- .../tests/dataset/compressed/example.indegree | 5 +- .../dataset/compressed/example.labels.count.txt | 2 +- .../dataset/compressed/example.labels.csv.zst | Bin 115 -> 131 bytes .../compressed/example.labels.fcl.bytearray | Bin 110 -> 128 bytes .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes .../compressed/example.labels.fcl.properties | 2 +- .../tests/dataset/compressed/example.labels.mph | Bin 1521 -> 1529 bytes swh/graph/tests/dataset/compressed/example.mph | Bin 961 -> 961 bytes .../dataset/compressed/example.node2swhid.bin | Bin 462 -> 528 bytes .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes .../dataset/compressed/example.nodes.count.txt | 2 +- .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes .../dataset/compressed/example.nodes.stats.txt | 6 +- swh/graph/tests/dataset/compressed/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/compressed/example.offsets | 4 +- swh/graph/tests/dataset/compressed/example.order | Bin 168 -> 192 bytes .../tests/dataset/compressed/example.outdegree | 4 +- .../tests/dataset/compressed/example.persons.mph | Bin 961 -> 961 bytes .../tests/dataset/compressed/example.properties | 50 +-- .../compressed/example.property.author_id.bin | Bin 84 -> 2112 bytes .../example.property.author_timestamp.bin | Bin 168 -> 4224 bytes .../example.property.author_timestamp_offset.bin | Bin 42 -> 1056 bytes .../compressed/example.property.committer_id.bin | Bin 84 -> 2112 bytes .../example.property.committer_timestamp.bin | Bin 168 -> 4224 bytes ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes .../example.property.content.is_skipped.bin | Bin 85 -> 149 bytes .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes .../compressed/example.property.message.bin | 2 + .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes .../compressed/example.property.tag_name.bin | 1 + .../example.property.tag_name.offset.bin | Bin 168 -> 4224 bytes swh/graph/tests/dataset/compressed/example.stats | 28 +- .../dataset/edges/origin/graph-all.edges.csv.zst | Bin 82 -> 109 bytes .../dataset/edges/origin/graph-all.nodes.csv.zst | Bin 64 -> 95 bytes .../dataset/edges/release/graph-all.edges.csv.zst | Bin 56 -> 73 bytes .../dataset/edges/release/graph-all.nodes.csv.zst | Bin 38 -> 42 bytes .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes swh/graph/tests/dataset/generate_dataset.py | 46 ++- swh/graph/tests/dataset/img/example.dot | 13 +- .../tests/dataset/orc/content/content-all.orc | Bin 1240 -> 1226 bytes .../tests/dataset/orc/directory/directory-all.orc | Bin 578 -> 563 bytes .../orc/directory_entry/directory_entry-all.orc | Bin 1126 -> 1115 bytes swh/graph/tests/dataset/orc/origin/origin-all.orc | Bin 817 -> 935 bytes .../dataset/orc/origin_visit/origin_visit-all.orc | Bin 898 -> 924 bytes .../origin_visit_status-all.orc | Bin 1150 -> 1191 bytes .../tests/dataset/orc/release/release-all.orc | Bin 1361 -> 1407 bytes .../tests/dataset/orc/revision/revision-all.orc | Bin 1658 -> 1643 bytes .../revision_extra_headers-all.orc | Bin 253 -> 236 bytes .../orc/revision_history/revision_history-all.orc | Bin 700 -> 685 bytes .../orc/skipped_content/skipped_content-all.orc | Bin 1177 -> 1160 bytes .../tests/dataset/orc/snapshot/snapshot-all.orc | Bin 459 -> 456 bytes .../orc/snapshot_branch/snapshot_branch-all.orc | Bin 865 -> 921 bytes swh/graph/tests/test_cli.py | 54 ++- swh/graph/tests/test_grpc.py | 7 +- swh/graph/tests/test_http_client.py | 18 +- swh/graph/tests/test_luigi.py | 6 +- swh/graph/tests/test_origin_contributors.py | 186 +++++++++ swh/graph/tests/test_toposort.py | 67 ++++ 88 files changed, 1762 insertions(+), 300 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java delete mode 100644 swh/graph/luigi.py create mode 100644 swh/graph/luigi/__init__.py create mode 100644 swh/graph/luigi/compressed_graph.py create mode 100644 swh/graph/luigi/misc_datasets.py create mode 100644 swh/graph/luigi/origin_contributors.py create mode 100644 swh/graph/luigi/utils.py create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl create mode 100644 swh/graph/tests/test_origin_contributors.py create mode 100644 swh/graph/tests/test_toposort.py Changes applied before testcommit b8ddd6ceadbdd4ce493fbc6fcf10b192b367475d Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Wed Dec 7 10:04:39 2022 +0100 origin_contributors: Fix typo and improve readability commit b76801259953ce2f0035bc7a516ec9c17be4f83e Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Dec 5 15:46:19 2022 +0100 Add CLI script to generate Luigi config and call it It can be cumbersome to set paths for all (recursives) dependencies of the task we want to run; this CLI endpoint takes care of most of them. commit e65858a73918698996a8066d0df45c3de29b9105 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Dec 5 14:42:12 2022 +0100 Split swh/graph/luigi.py into modules It is going to get large, with the future addition of tasks to generate the license dataset and the citation dataset. commit dfd4c1dc3b224477f9adb33c15f6c75bcdf78244 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 11:35:43 2022 +0100 ListOriginContributors: Ignore null author/committer in revisions/releases commit 559d4068bfe1dd50d57062192c0e22664ada03c8 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 11:27:35 2022 +0100 Regenerate the test dataset to include a release with no author This triggers a bug in ListOriginContributors, causing it to include "null" as a contributor. A future commit will fix this. commit f3235e3184850b074b2a332686911688aafcdd84 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 10:39:09 2022 +0100 Add ListOriginContributors This Java script (and related Luigi tasks) traverse the graph in topological order, building up the set of all contributors to a node and its ancestors, then dump the value of this set for every origin node they encounter. commit ab2703efcb9ad93a3d959596ed7edef27d908164 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Nov 29 17:54:30 2022 +0100 Add Luigi task TopoSort and add a simple test commit 58f44785816bde0f6cdbf86e3ff6f1fbf385a487 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Nov 28 16:02:56 2022 +0100 Improve comments commit 922894410b6e14f5a9eeec445d4a0b503df77a9e Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 16:15:04 2022 +0100 Add a sample of two ancestor with each node This allows readers to efficiently get ancestors of nodes with low indegree (ie. most revisions), as it avoids a random access / API call. commit 7bee5d47a6eb49ac594f2d019222c176373a5248 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 12:54:14 2022 +0100 revert multithreading, it's actually twice as slow as singlethread commit 30dad16a2365021bedf72df78d0753e125765016 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 12:06:21 2022 +0100 tentative multithread DFS commit ed6636c26be869a7309581d0ec664488b4d69e9f Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 11:49:32 2022 +0100 Implement a naive topological sort commit b8dc411ccd304597df96d7dd36158fb86e5239fd Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Nov 29 17:01:45 2022 +0100 luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3 Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/317/ Comment Actions Build has FAILED Patch application report for D8930 (id=32164)Could not rebase; Attempt merge onto 0a8ae5de6f... Updating 0a8ae5d..b8ddd6c Fast-forward conftest.py | 1 + .../graph/utils/ListOriginContributors.java | 151 +++++++ .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++ mypy.ini | 6 + requirements-luigi.txt | 2 + requirements-swh-luigi.txt | 2 +- requirements-swh.txt | 1 + swh/graph/cli.py | 178 ++++++++- swh/graph/luigi.py | 186 --------- swh/graph/luigi/__init__.py | 75 ++++ swh/graph/luigi/compressed_graph.py | 438 +++++++++++++++++++++ swh/graph/luigi/misc_datasets.py | 70 ++++ swh/graph/luigi/origin_contributors.py | 196 +++++++++ swh/graph/luigi/utils.py | 35 ++ .../dataset/compressed/example-labelled.labelobl | Bin 0 -> 772 bytes .../compressed/example-labelled.labeloffsets | 3 +- .../dataset/compressed/example-labelled.labels | 2 +- .../dataset/compressed/example-labelled.properties | 2 +- .../example-transposed-labelled.labelobl | Bin 0 -> 772 bytes .../example-transposed-labelled.labeloffsets | 3 +- .../compressed/example-transposed-labelled.labels | 3 +- .../example-transposed-labelled.properties | 2 +- .../dataset/compressed/example-transposed.graph | 2 +- .../dataset/compressed/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/compressed/example-transposed.offsets | 3 +- .../compressed/example-transposed.properties | 52 +-- .../dataset/compressed/example.edges.count.txt | 2 +- .../dataset/compressed/example.edges.stats.txt | 8 +- swh/graph/tests/dataset/compressed/example.graph | 2 +- .../tests/dataset/compressed/example.indegree | 5 +- .../dataset/compressed/example.labels.count.txt | 2 +- .../dataset/compressed/example.labels.csv.zst | Bin 115 -> 131 bytes .../compressed/example.labels.fcl.bytearray | Bin 110 -> 128 bytes .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes .../compressed/example.labels.fcl.properties | 2 +- .../tests/dataset/compressed/example.labels.mph | Bin 1521 -> 1529 bytes swh/graph/tests/dataset/compressed/example.mph | Bin 961 -> 961 bytes .../dataset/compressed/example.node2swhid.bin | Bin 462 -> 528 bytes .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes .../dataset/compressed/example.nodes.count.txt | 2 +- .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes .../dataset/compressed/example.nodes.stats.txt | 6 +- swh/graph/tests/dataset/compressed/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/compressed/example.offsets | 4 +- swh/graph/tests/dataset/compressed/example.order | Bin 168 -> 192 bytes .../tests/dataset/compressed/example.outdegree | 4 +- .../tests/dataset/compressed/example.persons.mph | Bin 961 -> 961 bytes .../tests/dataset/compressed/example.properties | 50 +-- .../compressed/example.property.author_id.bin | Bin 84 -> 2112 bytes .../example.property.author_timestamp.bin | Bin 168 -> 4224 bytes .../example.property.author_timestamp_offset.bin | Bin 42 -> 1056 bytes .../compressed/example.property.committer_id.bin | Bin 84 -> 2112 bytes .../example.property.committer_timestamp.bin | Bin 168 -> 4224 bytes ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes .../example.property.content.is_skipped.bin | Bin 85 -> 149 bytes .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes .../compressed/example.property.message.bin | 2 + .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes .../compressed/example.property.tag_name.bin | 1 + .../example.property.tag_name.offset.bin | Bin 168 -> 4224 bytes swh/graph/tests/dataset/compressed/example.stats | 28 +- .../dataset/edges/origin/graph-all.edges.csv.zst | Bin 82 -> 109 bytes .../dataset/edges/origin/graph-all.nodes.csv.zst | Bin 64 -> 95 bytes .../dataset/edges/release/graph-all.edges.csv.zst | Bin 56 -> 73 bytes .../dataset/edges/release/graph-all.nodes.csv.zst | Bin 38 -> 42 bytes .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes swh/graph/tests/dataset/generate_dataset.py | 46 ++- swh/graph/tests/dataset/img/example.dot | 13 +- .../tests/dataset/orc/content/content-all.orc | Bin 1240 -> 1226 bytes .../tests/dataset/orc/directory/directory-all.orc | Bin 578 -> 563 bytes .../orc/directory_entry/directory_entry-all.orc | Bin 1126 -> 1115 bytes swh/graph/tests/dataset/orc/origin/origin-all.orc | Bin 817 -> 935 bytes .../dataset/orc/origin_visit/origin_visit-all.orc | Bin 898 -> 924 bytes .../origin_visit_status-all.orc | Bin 1150 -> 1191 bytes .../tests/dataset/orc/release/release-all.orc | Bin 1361 -> 1407 bytes .../tests/dataset/orc/revision/revision-all.orc | Bin 1658 -> 1643 bytes .../revision_extra_headers-all.orc | Bin 253 -> 236 bytes .../orc/revision_history/revision_history-all.orc | Bin 700 -> 685 bytes .../orc/skipped_content/skipped_content-all.orc | Bin 1177 -> 1160 bytes .../tests/dataset/orc/snapshot/snapshot-all.orc | Bin 459 -> 456 bytes .../orc/snapshot_branch/snapshot_branch-all.orc | Bin 865 -> 921 bytes swh/graph/tests/test_cli.py | 54 ++- swh/graph/tests/test_grpc.py | 7 +- swh/graph/tests/test_http_client.py | 18 +- swh/graph/tests/test_luigi.py | 6 +- swh/graph/tests/test_origin_contributors.py | 186 +++++++++ swh/graph/tests/test_toposort.py | 67 ++++ 88 files changed, 1762 insertions(+), 300 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java delete mode 100644 swh/graph/luigi.py create mode 100644 swh/graph/luigi/__init__.py create mode 100644 swh/graph/luigi/compressed_graph.py create mode 100644 swh/graph/luigi/misc_datasets.py create mode 100644 swh/graph/luigi/origin_contributors.py create mode 100644 swh/graph/luigi/utils.py create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl create mode 100644 swh/graph/tests/test_origin_contributors.py create mode 100644 swh/graph/tests/test_toposort.py Changes applied before testcommit b8ddd6ceadbdd4ce493fbc6fcf10b192b367475d Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Wed Dec 7 10:04:39 2022 +0100 origin_contributors: Fix typo and improve readability commit b76801259953ce2f0035bc7a516ec9c17be4f83e Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Dec 5 15:46:19 2022 +0100 Add CLI script to generate Luigi config and call it It can be cumbersome to set paths for all (recursives) dependencies of the task we want to run; this CLI endpoint takes care of most of them. commit e65858a73918698996a8066d0df45c3de29b9105 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Dec 5 14:42:12 2022 +0100 Split swh/graph/luigi.py into modules It is going to get large, with the future addition of tasks to generate the license dataset and the citation dataset. commit dfd4c1dc3b224477f9adb33c15f6c75bcdf78244 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 11:35:43 2022 +0100 ListOriginContributors: Ignore null author/committer in revisions/releases commit 559d4068bfe1dd50d57062192c0e22664ada03c8 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 11:27:35 2022 +0100 Regenerate the test dataset to include a release with no author This triggers a bug in ListOriginContributors, causing it to include "null" as a contributor. A future commit will fix this. commit f3235e3184850b074b2a332686911688aafcdd84 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Dec 1 10:39:09 2022 +0100 Add ListOriginContributors This Java script (and related Luigi tasks) traverse the graph in topological order, building up the set of all contributors to a node and its ancestors, then dump the value of this set for every origin node they encounter. commit ab2703efcb9ad93a3d959596ed7edef27d908164 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Nov 29 17:54:30 2022 +0100 Add Luigi task TopoSort and add a simple test commit 58f44785816bde0f6cdbf86e3ff6f1fbf385a487 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Nov 28 16:02:56 2022 +0100 Improve comments commit 922894410b6e14f5a9eeec445d4a0b503df77a9e Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 16:15:04 2022 +0100 Add a sample of two ancestor with each node This allows readers to efficiently get ancestors of nodes with low indegree (ie. most revisions), as it avoids a random access / API call. commit 7bee5d47a6eb49ac594f2d019222c176373a5248 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 12:54:14 2022 +0100 revert multithreading, it's actually twice as slow as singlethread commit 30dad16a2365021bedf72df78d0753e125765016 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 12:06:21 2022 +0100 tentative multithread DFS commit ed6636c26be869a7309581d0ec664488b4d69e9f Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Nov 24 11:49:32 2022 +0100 Implement a naive topological sort commit b8dc411ccd304597df96d7dd36158fb86e5239fd Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Nov 29 17:01:45 2022 +0100 luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3 Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/318/ |