Page MenuHomeSoftware Heritage

Split swh/graph/luigi.py into modules
ClosedPublic

Authored by vlorentz on Dec 5 2022, 2:46 PM.

Details

Summary

It is going to get large, with the future addition of tasks to generate
the license dataset and the citation dataset.

Depends on D8912.

Event Timeline

Build is green

Patch application report for D8917 (id=32125)

Could not rebase; Attempt merge onto ec7f568b13...

Updating ec7f568..c8057ac
Fast-forward
 conftest.py                                        |   1 +
 .../graph/utils/ListOriginContributors.java        | 151 +++++++
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++
 mypy.ini                                           |   6 +
 requirements-luigi.txt                             |   2 +
 requirements-swh-luigi.txt                         |   2 +-
 requirements-swh.txt                               |   1 +
 swh/graph/luigi.py                                 | 186 ---------
 swh/graph/luigi/__init__.py                        |  75 ++++
 swh/graph/luigi/compressed_graph.py                | 438 +++++++++++++++++++++
 swh/graph/luigi/misc_datasets.py                   |  70 ++++
 swh/graph/luigi/origin_contributors.py             | 188 +++++++++
 swh/graph/luigi/utils.py                           |  34 ++
 .../dataset/compressed/example-labelled.labelobl   | Bin 0 -> 772 bytes
 .../compressed/example-labelled.labeloffsets       |   3 +-
 .../dataset/compressed/example-labelled.labels     |   2 +-
 .../dataset/compressed/example-labelled.properties |   2 +-
 .../example-transposed-labelled.labelobl           | Bin 0 -> 772 bytes
 .../example-transposed-labelled.labeloffsets       |   3 +-
 .../compressed/example-transposed-labelled.labels  |   3 +-
 .../example-transposed-labelled.properties         |   2 +-
 .../dataset/compressed/example-transposed.graph    |   2 +-
 .../dataset/compressed/example-transposed.obl      | Bin 772 -> 772 bytes
 .../dataset/compressed/example-transposed.offsets  |   3 +-
 .../compressed/example-transposed.properties       |  52 +--
 .../dataset/compressed/example.edges.count.txt     |   2 +-
 .../dataset/compressed/example.edges.stats.txt     |   8 +-
 swh/graph/tests/dataset/compressed/example.graph   |   2 +-
 .../tests/dataset/compressed/example.indegree      |   5 +-
 .../dataset/compressed/example.labels.count.txt    |   2 +-
 .../dataset/compressed/example.labels.csv.zst      | Bin 115 -> 131 bytes
 .../compressed/example.labels.fcl.bytearray        | Bin 110 -> 128 bytes
 .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes
 .../compressed/example.labels.fcl.properties       |   2 +-
 .../tests/dataset/compressed/example.labels.mph    | Bin 1521 -> 1529 bytes
 swh/graph/tests/dataset/compressed/example.mph     | Bin 961 -> 961 bytes
 .../dataset/compressed/example.node2swhid.bin      | Bin 462 -> 528 bytes
 .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes
 .../dataset/compressed/example.nodes.count.txt     |   2 +-
 .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes
 .../dataset/compressed/example.nodes.stats.txt     |   6 +-
 swh/graph/tests/dataset/compressed/example.obl     | Bin 772 -> 772 bytes
 swh/graph/tests/dataset/compressed/example.offsets |   4 +-
 swh/graph/tests/dataset/compressed/example.order   | Bin 168 -> 192 bytes
 .../tests/dataset/compressed/example.outdegree     |   4 +-
 .../tests/dataset/compressed/example.persons.mph   | Bin 961 -> 961 bytes
 .../tests/dataset/compressed/example.properties    |  50 +--
 .../compressed/example.property.author_id.bin      | Bin 84 -> 2112 bytes
 .../example.property.author_timestamp.bin          | Bin 168 -> 4224 bytes
 .../example.property.author_timestamp_offset.bin   | Bin 42 -> 1056 bytes
 .../compressed/example.property.committer_id.bin   | Bin 84 -> 2112 bytes
 .../example.property.committer_timestamp.bin       | Bin 168 -> 4224 bytes
 ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes
 .../example.property.content.is_skipped.bin        | Bin 85 -> 149 bytes
 .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes
 .../compressed/example.property.message.bin        |   2 +
 .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes
 .../compressed/example.property.tag_name.bin       |   1 +
 .../example.property.tag_name.offset.bin           | Bin 168 -> 4224 bytes
 swh/graph/tests/dataset/compressed/example.stats   |  28 +-
 .../dataset/edges/origin/graph-all.edges.csv.zst   | Bin 82 -> 109 bytes
 .../dataset/edges/origin/graph-all.nodes.csv.zst   | Bin 64 -> 95 bytes
 .../dataset/edges/release/graph-all.edges.csv.zst  | Bin 56 -> 73 bytes
 .../dataset/edges/release/graph-all.nodes.csv.zst  | Bin 38 -> 42 bytes
 .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes
 .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes
 swh/graph/tests/dataset/generate_dataset.py        |  46 ++-
 swh/graph/tests/dataset/img/example.dot            |  13 +-
 .../tests/dataset/orc/content/content-all.orc      | Bin 1240 -> 1226 bytes
 .../tests/dataset/orc/directory/directory-all.orc  | Bin 578 -> 563 bytes
 .../orc/directory_entry/directory_entry-all.orc    | Bin 1126 -> 1115 bytes
 swh/graph/tests/dataset/orc/origin/origin-all.orc  | Bin 817 -> 935 bytes
 .../dataset/orc/origin_visit/origin_visit-all.orc  | Bin 898 -> 924 bytes
 .../origin_visit_status-all.orc                    | Bin 1150 -> 1191 bytes
 .../tests/dataset/orc/release/release-all.orc      | Bin 1361 -> 1407 bytes
 .../tests/dataset/orc/revision/revision-all.orc    | Bin 1658 -> 1643 bytes
 .../revision_extra_headers-all.orc                 | Bin 253 -> 236 bytes
 .../orc/revision_history/revision_history-all.orc  | Bin 700 -> 685 bytes
 .../orc/skipped_content/skipped_content-all.orc    | Bin 1177 -> 1160 bytes
 .../tests/dataset/orc/snapshot/snapshot-all.orc    | Bin 459 -> 456 bytes
 .../orc/snapshot_branch/snapshot_branch-all.orc    | Bin 865 -> 921 bytes
 swh/graph/tests/test_cli.py                        |   4 +-
 swh/graph/tests/test_grpc.py                       |   7 +-
 swh/graph/tests/test_http_client.py                |  18 +-
 swh/graph/tests/test_luigi.py                      |   6 +-
 swh/graph/tests/test_origin_contributors.py        | 186 +++++++++
 swh/graph/tests/test_toposort.py                   |  67 ++++
 87 files changed, 1527 insertions(+), 298 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 delete mode 100644 swh/graph/luigi.py
 create mode 100644 swh/graph/luigi/__init__.py
 create mode 100644 swh/graph/luigi/compressed_graph.py
 create mode 100644 swh/graph/luigi/misc_datasets.py
 create mode 100644 swh/graph/luigi/origin_contributors.py
 create mode 100644 swh/graph/luigi/utils.py
 create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl
 create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl
 create mode 100644 swh/graph/tests/test_origin_contributors.py
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit c8057accce23f72c763e0e7bee931568a45a3b7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Dec 5 14:42:12 2022 +0100

    Split swh/graph/luigi.py into modules
    
    It is going to get large, with the future addition of tasks to generate
    the license dataset and the citation dataset.

commit 172ee6deae3102f904284533b657003daf8c0b21
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 11:35:43 2022 +0100

    ListOriginContributors: Ignore null author/committer in revisions/releases

commit ee09b16376dde6a033a4b6147237cdcfec3f081c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 11:27:35 2022 +0100

    Regenerate the test dataset to include a release with no author
    
    This triggers a bug in ListOriginContributors, causing it to include
    "null" as a contributor.
    A future commit will fix this.

commit 9972a08685c3d6e45119494ee6404c66a6374f26
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 10:39:09 2022 +0100

    Add ListOriginContributors
    
    This Java script (and related Luigi tasks) traverse the graph in
    topological order, building up the set of all contributors to a
    node and its ancestors, then dump the value of this set for every
    origin node they encounter.

commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/305/ for more details.

This revision is now accepted and ready to land.Dec 6 2022, 2:46 PM

Build was aborted

Patch application report for D8917 (id=32162)

Could not rebase; Attempt merge onto 0a8ae5de6f...

Updating 0a8ae5d..e65858a
Fast-forward
 conftest.py                                        |   1 +
 .../graph/utils/ListOriginContributors.java        | 151 +++++++
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++
 mypy.ini                                           |   6 +
 requirements-luigi.txt                             |   2 +
 requirements-swh-luigi.txt                         |   2 +-
 requirements-swh.txt                               |   1 +
 swh/graph/luigi.py                                 | 186 ---------
 swh/graph/luigi/__init__.py                        |  75 ++++
 swh/graph/luigi/compressed_graph.py                | 438 +++++++++++++++++++++
 swh/graph/luigi/misc_datasets.py                   |  70 ++++
 swh/graph/luigi/origin_contributors.py             | 188 +++++++++
 swh/graph/luigi/utils.py                           |  34 ++
 .../dataset/compressed/example-labelled.labelobl   | Bin 0 -> 772 bytes
 .../compressed/example-labelled.labeloffsets       |   3 +-
 .../dataset/compressed/example-labelled.labels     |   2 +-
 .../dataset/compressed/example-labelled.properties |   2 +-
 .../example-transposed-labelled.labelobl           | Bin 0 -> 772 bytes
 .../example-transposed-labelled.labeloffsets       |   3 +-
 .../compressed/example-transposed-labelled.labels  |   3 +-
 .../example-transposed-labelled.properties         |   2 +-
 .../dataset/compressed/example-transposed.graph    |   2 +-
 .../dataset/compressed/example-transposed.obl      | Bin 772 -> 772 bytes
 .../dataset/compressed/example-transposed.offsets  |   3 +-
 .../compressed/example-transposed.properties       |  52 +--
 .../dataset/compressed/example.edges.count.txt     |   2 +-
 .../dataset/compressed/example.edges.stats.txt     |   8 +-
 swh/graph/tests/dataset/compressed/example.graph   |   2 +-
 .../tests/dataset/compressed/example.indegree      |   5 +-
 .../dataset/compressed/example.labels.count.txt    |   2 +-
 .../dataset/compressed/example.labels.csv.zst      | Bin 115 -> 131 bytes
 .../compressed/example.labels.fcl.bytearray        | Bin 110 -> 128 bytes
 .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes
 .../compressed/example.labels.fcl.properties       |   2 +-
 .../tests/dataset/compressed/example.labels.mph    | Bin 1521 -> 1529 bytes
 swh/graph/tests/dataset/compressed/example.mph     | Bin 961 -> 961 bytes
 .../dataset/compressed/example.node2swhid.bin      | Bin 462 -> 528 bytes
 .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes
 .../dataset/compressed/example.nodes.count.txt     |   2 +-
 .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes
 .../dataset/compressed/example.nodes.stats.txt     |   6 +-
 swh/graph/tests/dataset/compressed/example.obl     | Bin 772 -> 772 bytes
 swh/graph/tests/dataset/compressed/example.offsets |   4 +-
 swh/graph/tests/dataset/compressed/example.order   | Bin 168 -> 192 bytes
 .../tests/dataset/compressed/example.outdegree     |   4 +-
 .../tests/dataset/compressed/example.persons.mph   | Bin 961 -> 961 bytes
 .../tests/dataset/compressed/example.properties    |  50 +--
 .../compressed/example.property.author_id.bin      | Bin 84 -> 2112 bytes
 .../example.property.author_timestamp.bin          | Bin 168 -> 4224 bytes
 .../example.property.author_timestamp_offset.bin   | Bin 42 -> 1056 bytes
 .../compressed/example.property.committer_id.bin   | Bin 84 -> 2112 bytes
 .../example.property.committer_timestamp.bin       | Bin 168 -> 4224 bytes
 ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes
 .../example.property.content.is_skipped.bin        | Bin 85 -> 149 bytes
 .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes
 .../compressed/example.property.message.bin        |   2 +
 .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes
 .../compressed/example.property.tag_name.bin       |   1 +
 .../example.property.tag_name.offset.bin           | Bin 168 -> 4224 bytes
 swh/graph/tests/dataset/compressed/example.stats   |  28 +-
 .../dataset/edges/origin/graph-all.edges.csv.zst   | Bin 82 -> 109 bytes
 .../dataset/edges/origin/graph-all.nodes.csv.zst   | Bin 64 -> 95 bytes
 .../dataset/edges/release/graph-all.edges.csv.zst  | Bin 56 -> 73 bytes
 .../dataset/edges/release/graph-all.nodes.csv.zst  | Bin 38 -> 42 bytes
 .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes
 .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes
 swh/graph/tests/dataset/generate_dataset.py        |  46 ++-
 swh/graph/tests/dataset/img/example.dot            |  13 +-
 .../tests/dataset/orc/content/content-all.orc      | Bin 1240 -> 1226 bytes
 .../tests/dataset/orc/directory/directory-all.orc  | Bin 578 -> 563 bytes
 .../orc/directory_entry/directory_entry-all.orc    | Bin 1126 -> 1115 bytes
 swh/graph/tests/dataset/orc/origin/origin-all.orc  | Bin 817 -> 935 bytes
 .../dataset/orc/origin_visit/origin_visit-all.orc  | Bin 898 -> 924 bytes
 .../origin_visit_status-all.orc                    | Bin 1150 -> 1191 bytes
 .../tests/dataset/orc/release/release-all.orc      | Bin 1361 -> 1407 bytes
 .../tests/dataset/orc/revision/revision-all.orc    | Bin 1658 -> 1643 bytes
 .../revision_extra_headers-all.orc                 | Bin 253 -> 236 bytes
 .../orc/revision_history/revision_history-all.orc  | Bin 700 -> 685 bytes
 .../orc/skipped_content/skipped_content-all.orc    | Bin 1177 -> 1160 bytes
 .../tests/dataset/orc/snapshot/snapshot-all.orc    | Bin 459 -> 456 bytes
 .../orc/snapshot_branch/snapshot_branch-all.orc    | Bin 865 -> 921 bytes
 swh/graph/tests/test_cli.py                        |   4 +-
 swh/graph/tests/test_grpc.py                       |   7 +-
 swh/graph/tests/test_http_client.py                |  18 +-
 swh/graph/tests/test_luigi.py                      |   6 +-
 swh/graph/tests/test_origin_contributors.py        | 186 +++++++++
 swh/graph/tests/test_toposort.py                   |  67 ++++
 87 files changed, 1527 insertions(+), 298 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 delete mode 100644 swh/graph/luigi.py
 create mode 100644 swh/graph/luigi/__init__.py
 create mode 100644 swh/graph/luigi/compressed_graph.py
 create mode 100644 swh/graph/luigi/misc_datasets.py
 create mode 100644 swh/graph/luigi/origin_contributors.py
 create mode 100644 swh/graph/luigi/utils.py
 create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl
 create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl
 create mode 100644 swh/graph/tests/test_origin_contributors.py
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit e65858a73918698996a8066d0df45c3de29b9105
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Dec 5 14:42:12 2022 +0100

    Split swh/graph/luigi.py into modules
    
    It is going to get large, with the future addition of tasks to generate
    the license dataset and the citation dataset.

commit dfd4c1dc3b224477f9adb33c15f6c75bcdf78244
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 11:35:43 2022 +0100

    ListOriginContributors: Ignore null author/committer in revisions/releases

commit 559d4068bfe1dd50d57062192c0e22664ada03c8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 11:27:35 2022 +0100

    Regenerate the test dataset to include a release with no author
    
    This triggers a bug in ListOriginContributors, causing it to include
    "null" as a contributor.
    A future commit will fix this.

commit f3235e3184850b074b2a332686911688aafcdd84
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 10:39:09 2022 +0100

    Add ListOriginContributors
    
    This Java script (and related Luigi tasks) traverse the graph in
    topological order, building up the set of all contributors to a
    node and its ancestors, then dump the value of this set for every
    origin node they encounter.

commit ab2703efcb9ad93a3d959596ed7edef27d908164
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 58f44785816bde0f6cdbf86e3ff6f1fbf385a487
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 922894410b6e14f5a9eeec445d4a0b503df77a9e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 7bee5d47a6eb49ac594f2d019222c176373a5248
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit 30dad16a2365021bedf72df78d0753e125765016
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ed6636c26be869a7309581d0ec664488b4d69e9f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit b8dc411ccd304597df96d7dd36158fb86e5239fd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/315/
See console output for more information: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/315/console

This revision was landed with ongoing or failed builds.Dec 7 2022, 10:40 AM
This revision was automatically updated to reflect the committed changes.