Page MenuHomeSoftware Heritage

Add CLI script to generate Luigi config and call it
ClosedPublic

Authored by vlorentz on Dec 5 2022, 3:46 PM.

Details

Summary

It can be cumbersome to set paths for all (recursives) dependencies of
the task we want to run; this CLI endpoint takes care of most of them.

Depends on D8917.

Diff Detail

Repository
rDGRPH Compressed graph representation
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8919 (id=32132)

Could not rebase; Attempt merge onto ec7f568b13...

Updating ec7f568..aae3da5
Fast-forward
 conftest.py                                        |   1 +
 .../graph/utils/ListOriginContributors.java        | 151 +++++++
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++
 mypy.ini                                           |   6 +
 requirements-luigi.txt                             |   2 +
 requirements-swh-luigi.txt                         |   2 +-
 requirements-swh.txt                               |   1 +
 swh/graph/cli.py                                   | 180 ++++++++-
 swh/graph/luigi.py                                 | 186 ---------
 swh/graph/luigi/__init__.py                        |  75 ++++
 swh/graph/luigi/compressed_graph.py                | 438 +++++++++++++++++++++
 swh/graph/luigi/misc_datasets.py                   |  70 ++++
 swh/graph/luigi/origin_contributors.py             | 188 +++++++++
 swh/graph/luigi/utils.py                           |  34 ++
 .../dataset/compressed/example-labelled.labelobl   | Bin 0 -> 772 bytes
 .../compressed/example-labelled.labeloffsets       |   3 +-
 .../dataset/compressed/example-labelled.labels     |   2 +-
 .../dataset/compressed/example-labelled.properties |   2 +-
 .../example-transposed-labelled.labelobl           | Bin 0 -> 772 bytes
 .../example-transposed-labelled.labeloffsets       |   3 +-
 .../compressed/example-transposed-labelled.labels  |   3 +-
 .../example-transposed-labelled.properties         |   2 +-
 .../dataset/compressed/example-transposed.graph    |   2 +-
 .../dataset/compressed/example-transposed.obl      | Bin 772 -> 772 bytes
 .../dataset/compressed/example-transposed.offsets  |   3 +-
 .../compressed/example-transposed.properties       |  52 +--
 .../dataset/compressed/example.edges.count.txt     |   2 +-
 .../dataset/compressed/example.edges.stats.txt     |   8 +-
 swh/graph/tests/dataset/compressed/example.graph   |   2 +-
 .../tests/dataset/compressed/example.indegree      |   5 +-
 .../dataset/compressed/example.labels.count.txt    |   2 +-
 .../dataset/compressed/example.labels.csv.zst      | Bin 115 -> 131 bytes
 .../compressed/example.labels.fcl.bytearray        | Bin 110 -> 128 bytes
 .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes
 .../compressed/example.labels.fcl.properties       |   2 +-
 .../tests/dataset/compressed/example.labels.mph    | Bin 1521 -> 1529 bytes
 swh/graph/tests/dataset/compressed/example.mph     | Bin 961 -> 961 bytes
 .../dataset/compressed/example.node2swhid.bin      | Bin 462 -> 528 bytes
 .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes
 .../dataset/compressed/example.nodes.count.txt     |   2 +-
 .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes
 .../dataset/compressed/example.nodes.stats.txt     |   6 +-
 swh/graph/tests/dataset/compressed/example.obl     | Bin 772 -> 772 bytes
 swh/graph/tests/dataset/compressed/example.offsets |   4 +-
 swh/graph/tests/dataset/compressed/example.order   | Bin 168 -> 192 bytes
 .../tests/dataset/compressed/example.outdegree     |   4 +-
 .../tests/dataset/compressed/example.persons.mph   | Bin 961 -> 961 bytes
 .../tests/dataset/compressed/example.properties    |  50 +--
 .../compressed/example.property.author_id.bin      | Bin 84 -> 2112 bytes
 .../example.property.author_timestamp.bin          | Bin 168 -> 4224 bytes
 .../example.property.author_timestamp_offset.bin   | Bin 42 -> 1056 bytes
 .../compressed/example.property.committer_id.bin   | Bin 84 -> 2112 bytes
 .../example.property.committer_timestamp.bin       | Bin 168 -> 4224 bytes
 ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes
 .../example.property.content.is_skipped.bin        | Bin 85 -> 149 bytes
 .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes
 .../compressed/example.property.message.bin        |   2 +
 .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes
 .../compressed/example.property.tag_name.bin       |   1 +
 .../example.property.tag_name.offset.bin           | Bin 168 -> 4224 bytes
 swh/graph/tests/dataset/compressed/example.stats   |  28 +-
 .../dataset/edges/origin/graph-all.edges.csv.zst   | Bin 82 -> 109 bytes
 .../dataset/edges/origin/graph-all.nodes.csv.zst   | Bin 64 -> 95 bytes
 .../dataset/edges/release/graph-all.edges.csv.zst  | Bin 56 -> 73 bytes
 .../dataset/edges/release/graph-all.nodes.csv.zst  | Bin 38 -> 42 bytes
 .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes
 .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes
 swh/graph/tests/dataset/generate_dataset.py        |  46 ++-
 swh/graph/tests/dataset/img/example.dot            |  13 +-
 .../tests/dataset/orc/content/content-all.orc      | Bin 1240 -> 1226 bytes
 .../tests/dataset/orc/directory/directory-all.orc  | Bin 578 -> 563 bytes
 .../orc/directory_entry/directory_entry-all.orc    | Bin 1126 -> 1115 bytes
 swh/graph/tests/dataset/orc/origin/origin-all.orc  | Bin 817 -> 935 bytes
 .../dataset/orc/origin_visit/origin_visit-all.orc  | Bin 898 -> 924 bytes
 .../origin_visit_status-all.orc                    | Bin 1150 -> 1191 bytes
 .../tests/dataset/orc/release/release-all.orc      | Bin 1361 -> 1407 bytes
 .../tests/dataset/orc/revision/revision-all.orc    | Bin 1658 -> 1643 bytes
 .../revision_extra_headers-all.orc                 | Bin 253 -> 236 bytes
 .../orc/revision_history/revision_history-all.orc  | Bin 700 -> 685 bytes
 .../orc/skipped_content/skipped_content-all.orc    | Bin 1177 -> 1160 bytes
 .../tests/dataset/orc/snapshot/snapshot-all.orc    | Bin 459 -> 456 bytes
 .../orc/snapshot_branch/snapshot_branch-all.orc    | Bin 865 -> 921 bytes
 swh/graph/tests/test_cli.py                        |   4 +-
 swh/graph/tests/test_grpc.py                       |   7 +-
 swh/graph/tests/test_http_client.py                |  18 +-
 swh/graph/tests/test_luigi.py                      |   6 +-
 swh/graph/tests/test_origin_contributors.py        | 186 +++++++++
 swh/graph/tests/test_toposort.py                   |  67 ++++
 88 files changed, 1706 insertions(+), 299 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 delete mode 100644 swh/graph/luigi.py
 create mode 100644 swh/graph/luigi/__init__.py
 create mode 100644 swh/graph/luigi/compressed_graph.py
 create mode 100644 swh/graph/luigi/misc_datasets.py
 create mode 100644 swh/graph/luigi/origin_contributors.py
 create mode 100644 swh/graph/luigi/utils.py
 create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl
 create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl
 create mode 100644 swh/graph/tests/test_origin_contributors.py
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit aae3da585907336157e4bacef42ccacbe8c8082f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Dec 5 15:46:19 2022 +0100

    Add CLI script to generate Luigi config and call it
    
    It can be cumbersome to set paths for all (recursives) dependencies of
    the task we want to run; this CLI endpoint takes care of most of them.

commit c8057accce23f72c763e0e7bee931568a45a3b7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Dec 5 14:42:12 2022 +0100

    Split swh/graph/luigi.py into modules
    
    It is going to get large, with the future addition of tasks to generate
    the license dataset and the citation dataset.

commit 172ee6deae3102f904284533b657003daf8c0b21
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 11:35:43 2022 +0100

    ListOriginContributors: Ignore null author/committer in revisions/releases

commit ee09b16376dde6a033a4b6147237cdcfec3f081c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 11:27:35 2022 +0100

    Regenerate the test dataset to include a release with no author
    
    This triggers a bug in ListOriginContributors, causing it to include
    "null" as a contributor.
    A future commit will fix this.

commit 9972a08685c3d6e45119494ee6404c66a6374f26
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 10:39:09 2022 +0100

    Add ListOriginContributors
    
    This Java script (and related Luigi tasks) traverse the graph in
    topological order, building up the set of all contributors to a
    node and its ancestors, then dump the value of this set for every
    origin node they encounter.

commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/306/ for more details.

anlambert added a subscriber: anlambert.

Could you add a test checking luigi parameters are correctly passed to the subprocess.run instruction ?

swh/graph/cli.py
266

s/It/Its/

289

s/It/Its/

297

followed by ???

This revision now requires changes to proceed.Dec 6 2022, 2:54 PM
vlorentz marked 3 inline comments as done.

rebase + fix typos + improve readability

Build was aborted

Patch application report for D8919 (id=32163)

Could not rebase; Attempt merge onto 0a8ae5de6f...

Updating 0a8ae5d..b768012
Fast-forward
 conftest.py                                        |   1 +
 .../graph/utils/ListOriginContributors.java        | 151 +++++++
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++
 mypy.ini                                           |   6 +
 requirements-luigi.txt                             |   2 +
 requirements-swh-luigi.txt                         |   2 +-
 requirements-swh.txt                               |   1 +
 swh/graph/cli.py                                   | 178 ++++++++-
 swh/graph/luigi.py                                 | 186 ---------
 swh/graph/luigi/__init__.py                        |  75 ++++
 swh/graph/luigi/compressed_graph.py                | 438 +++++++++++++++++++++
 swh/graph/luigi/misc_datasets.py                   |  70 ++++
 swh/graph/luigi/origin_contributors.py             | 188 +++++++++
 swh/graph/luigi/utils.py                           |  34 ++
 .../dataset/compressed/example-labelled.labelobl   | Bin 0 -> 772 bytes
 .../compressed/example-labelled.labeloffsets       |   3 +-
 .../dataset/compressed/example-labelled.labels     |   2 +-
 .../dataset/compressed/example-labelled.properties |   2 +-
 .../example-transposed-labelled.labelobl           | Bin 0 -> 772 bytes
 .../example-transposed-labelled.labeloffsets       |   3 +-
 .../compressed/example-transposed-labelled.labels  |   3 +-
 .../example-transposed-labelled.properties         |   2 +-
 .../dataset/compressed/example-transposed.graph    |   2 +-
 .../dataset/compressed/example-transposed.obl      | Bin 772 -> 772 bytes
 .../dataset/compressed/example-transposed.offsets  |   3 +-
 .../compressed/example-transposed.properties       |  52 +--
 .../dataset/compressed/example.edges.count.txt     |   2 +-
 .../dataset/compressed/example.edges.stats.txt     |   8 +-
 swh/graph/tests/dataset/compressed/example.graph   |   2 +-
 .../tests/dataset/compressed/example.indegree      |   5 +-
 .../dataset/compressed/example.labels.count.txt    |   2 +-
 .../dataset/compressed/example.labels.csv.zst      | Bin 115 -> 131 bytes
 .../compressed/example.labels.fcl.bytearray        | Bin 110 -> 128 bytes
 .../dataset/compressed/example.labels.fcl.pointers | Bin 16 -> 24 bytes
 .../compressed/example.labels.fcl.properties       |   2 +-
 .../tests/dataset/compressed/example.labels.mph    | Bin 1521 -> 1529 bytes
 swh/graph/tests/dataset/compressed/example.mph     | Bin 961 -> 961 bytes
 .../dataset/compressed/example.node2swhid.bin      | Bin 462 -> 528 bytes
 .../tests/dataset/compressed/example.node2type.map | Bin 353 -> 361 bytes
 .../dataset/compressed/example.nodes.count.txt     |   2 +-
 .../tests/dataset/compressed/example.nodes.csv.zst | Bin 150 -> 181 bytes
 .../dataset/compressed/example.nodes.stats.txt     |   6 +-
 swh/graph/tests/dataset/compressed/example.obl     | Bin 772 -> 772 bytes
 swh/graph/tests/dataset/compressed/example.offsets |   4 +-
 swh/graph/tests/dataset/compressed/example.order   | Bin 168 -> 192 bytes
 .../tests/dataset/compressed/example.outdegree     |   4 +-
 .../tests/dataset/compressed/example.persons.mph   | Bin 961 -> 961 bytes
 .../tests/dataset/compressed/example.properties    |  50 +--
 .../compressed/example.property.author_id.bin      | Bin 84 -> 2112 bytes
 .../example.property.author_timestamp.bin          | Bin 168 -> 4224 bytes
 .../example.property.author_timestamp_offset.bin   | Bin 42 -> 1056 bytes
 .../compressed/example.property.committer_id.bin   | Bin 84 -> 2112 bytes
 .../example.property.committer_timestamp.bin       | Bin 168 -> 4224 bytes
 ...example.property.committer_timestamp_offset.bin | Bin 42 -> 1056 bytes
 .../example.property.content.is_skipped.bin        | Bin 85 -> 149 bytes
 .../compressed/example.property.content.length.bin | Bin 168 -> 4224 bytes
 .../compressed/example.property.message.bin        |   2 +
 .../compressed/example.property.message.offset.bin | Bin 168 -> 4224 bytes
 .../compressed/example.property.tag_name.bin       |   1 +
 .../example.property.tag_name.offset.bin           | Bin 168 -> 4224 bytes
 swh/graph/tests/dataset/compressed/example.stats   |  28 +-
 .../dataset/edges/origin/graph-all.edges.csv.zst   | Bin 82 -> 109 bytes
 .../dataset/edges/origin/graph-all.nodes.csv.zst   | Bin 64 -> 95 bytes
 .../dataset/edges/release/graph-all.edges.csv.zst  | Bin 56 -> 73 bytes
 .../dataset/edges/release/graph-all.nodes.csv.zst  | Bin 38 -> 42 bytes
 .../dataset/edges/snapshot/graph-all.edges.csv.zst | Bin 94 -> 128 bytes
 .../dataset/edges/snapshot/graph-all.nodes.csv.zst | Bin 33 -> 38 bytes
 swh/graph/tests/dataset/generate_dataset.py        |  46 ++-
 swh/graph/tests/dataset/img/example.dot            |  13 +-
 .../tests/dataset/orc/content/content-all.orc      | Bin 1240 -> 1226 bytes
 .../tests/dataset/orc/directory/directory-all.orc  | Bin 578 -> 563 bytes
 .../orc/directory_entry/directory_entry-all.orc    | Bin 1126 -> 1115 bytes
 swh/graph/tests/dataset/orc/origin/origin-all.orc  | Bin 817 -> 935 bytes
 .../dataset/orc/origin_visit/origin_visit-all.orc  | Bin 898 -> 924 bytes
 .../origin_visit_status-all.orc                    | Bin 1150 -> 1191 bytes
 .../tests/dataset/orc/release/release-all.orc      | Bin 1361 -> 1407 bytes
 .../tests/dataset/orc/revision/revision-all.orc    | Bin 1658 -> 1643 bytes
 .../revision_extra_headers-all.orc                 | Bin 253 -> 236 bytes
 .../orc/revision_history/revision_history-all.orc  | Bin 700 -> 685 bytes
 .../orc/skipped_content/skipped_content-all.orc    | Bin 1177 -> 1160 bytes
 .../tests/dataset/orc/snapshot/snapshot-all.orc    | Bin 459 -> 456 bytes
 .../orc/snapshot_branch/snapshot_branch-all.orc    | Bin 865 -> 921 bytes
 swh/graph/tests/test_cli.py                        |  54 ++-
 swh/graph/tests/test_grpc.py                       |   7 +-
 swh/graph/tests/test_http_client.py                |  18 +-
 swh/graph/tests/test_luigi.py                      |   6 +-
 swh/graph/tests/test_origin_contributors.py        | 186 +++++++++
 swh/graph/tests/test_toposort.py                   |  67 ++++
 88 files changed, 1753 insertions(+), 300 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 delete mode 100644 swh/graph/luigi.py
 create mode 100644 swh/graph/luigi/__init__.py
 create mode 100644 swh/graph/luigi/compressed_graph.py
 create mode 100644 swh/graph/luigi/misc_datasets.py
 create mode 100644 swh/graph/luigi/origin_contributors.py
 create mode 100644 swh/graph/luigi/utils.py
 create mode 100644 swh/graph/tests/dataset/compressed/example-labelled.labelobl
 create mode 100644 swh/graph/tests/dataset/compressed/example-transposed-labelled.labelobl
 create mode 100644 swh/graph/tests/test_origin_contributors.py
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit b76801259953ce2f0035bc7a516ec9c17be4f83e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Dec 5 15:46:19 2022 +0100

    Add CLI script to generate Luigi config and call it
    
    It can be cumbersome to set paths for all (recursives) dependencies of
    the task we want to run; this CLI endpoint takes care of most of them.

commit e65858a73918698996a8066d0df45c3de29b9105
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Dec 5 14:42:12 2022 +0100

    Split swh/graph/luigi.py into modules
    
    It is going to get large, with the future addition of tasks to generate
    the license dataset and the citation dataset.

commit dfd4c1dc3b224477f9adb33c15f6c75bcdf78244
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 11:35:43 2022 +0100

    ListOriginContributors: Ignore null author/committer in revisions/releases

commit 559d4068bfe1dd50d57062192c0e22664ada03c8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 11:27:35 2022 +0100

    Regenerate the test dataset to include a release with no author
    
    This triggers a bug in ListOriginContributors, causing it to include
    "null" as a contributor.
    A future commit will fix this.

commit f3235e3184850b074b2a332686911688aafcdd84
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 10:39:09 2022 +0100

    Add ListOriginContributors
    
    This Java script (and related Luigi tasks) traverse the graph in
    topological order, building up the set of all contributors to a
    node and its ancestors, then dump the value of this set for every
    origin node they encounter.

commit ab2703efcb9ad93a3d959596ed7edef27d908164
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 58f44785816bde0f6cdbf86e3ff6f1fbf385a487
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 922894410b6e14f5a9eeec445d4a0b503df77a9e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 7bee5d47a6eb49ac594f2d019222c176373a5248
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit 30dad16a2365021bedf72df78d0753e125765016
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ed6636c26be869a7309581d0ec664488b4d69e9f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit b8dc411ccd304597df96d7dd36158fb86e5239fd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/316/
See console output for more information: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/316/console

This revision was not accepted when it landed; it landed in state Needs Review.Dec 7 2022, 10:40 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.