Page MenuHomeSoftware Heritage

Add ListOriginContributors
ClosedPublic

Authored by vlorentz on Dec 1 2022, 10:40 AM.

Details

Summary

This Java script (and related Luigi tasks) traverse the graph in
topological order, building up the set of all contributors to a
node and its ancestors, then dump the value of this set for every
origin node they encounter.

Depends on D8883.

Diff Detail

Repository
rDGRPH Compressed graph representation
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D8908 (id=32109)

Could not rebase; Attempt merge onto ec7f568b13...

Updating ec7f568..603e24a
Fast-forward
 conftest.py                                        |   1 +
 .../graph/utils/ListOriginContributors.java        | 143 +++++++
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 ++++++
 mypy.ini                                           |   3 +
 requirements-luigi.txt                             |   2 +
 requirements-swh-luigi.txt                         |   2 +-
 requirements-swh.txt                               |   1 +
 swh/graph/luigi.py                                 | 468 ++++++++++++++++++++-
 swh/graph/tests/test_origin_contributors.py        | 180 ++++++++
 swh/graph/tests/test_toposort.py                   |  59 +++
 10 files changed, 989 insertions(+), 4 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 create mode 100644 swh/graph/tests/test_origin_contributors.py
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit 603e24a498964309f1c42ac47fd3b8f3caa83405
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 10:39:09 2022 +0100

    Add ListOriginContributors
    
    This Java script (and related Luigi tasks) traverse the graph in
    topological order, building up the set of all contributors to a
    node and its ancestors, then dump the value of this set for every
    origin node they encounter.

commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/297/
See console output for more information: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/297/console

Harbormaster returned this revision to the author for changes because remote builds failed.Dec 1 2022, 10:41 AM
Harbormaster failed remote builds in B33062: Diff 32109!

Build is green

Patch application report for D8908 (id=32111)

Could not rebase; Attempt merge onto ec7f568b13...

Updating ec7f568..36f6230
Fast-forward
 conftest.py                                        |   1 +
 .../graph/utils/ListOriginContributors.java        | 143 +++++++
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 ++++++
 mypy.ini                                           |   6 +
 requirements-luigi.txt                             |   2 +
 requirements-swh-luigi.txt                         |   2 +-
 requirements-swh.txt                               |   1 +
 swh/graph/luigi.py                                 | 468 ++++++++++++++++++++-
 swh/graph/tests/test_origin_contributors.py        | 180 ++++++++
 swh/graph/tests/test_toposort.py                   |  59 +++
 10 files changed, 992 insertions(+), 4 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 create mode 100644 swh/graph/tests/test_origin_contributors.py
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit 36f62302756d56c3d61234e2d46d582fe5125853
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 10:39:09 2022 +0100

    Add ListOriginContributors
    
    This Java script (and related Luigi tasks) traverse the graph in
    topological order, building up the set of all contributors to a
    node and its ancestors, then dump the value of this set for every
    origin node they encounter.

commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/298/ for more details.

Build is green

Patch application report for D8908 (id=32112)

Could not rebase; Attempt merge onto ec7f568b13...

Updating ec7f568..9972a08
Fast-forward
 conftest.py                                        |   1 +
 .../graph/utils/ListOriginContributors.java        | 141 +++++++
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 ++++++
 mypy.ini                                           |   6 +
 requirements-luigi.txt                             |   2 +
 requirements-swh-luigi.txt                         |   2 +-
 requirements-swh.txt                               |   1 +
 swh/graph/luigi.py                                 | 468 ++++++++++++++++++++-
 swh/graph/tests/test_origin_contributors.py        | 180 ++++++++
 swh/graph/tests/test_toposort.py                   |  59 +++
 10 files changed, 990 insertions(+), 4 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 create mode 100644 swh/graph/tests/test_origin_contributors.py
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit 9972a08685c3d6e45119494ee6404c66a6374f26
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 10:39:09 2022 +0100

    Add ListOriginContributors
    
    This Java script (and related Luigi tasks) traverse the graph in
    topological order, building up the set of all contributors to a
    node and its ancestors, then dump the value of this set for every
    origin node they encounter.

commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/299/ for more details.

anlambert added a subscriber: anlambert.

LGTM, added a couple of nitpicks as inline comments.

requirements-swh.txt
3

put it in requirements-test.txt then

swh/graph/luigi.py
456–458
with open(tmp_output_path, "wb") as tmp_output:
    subprocess.run(
        ["bash", "-c", script.strip()], stdout=tmp_output, env=env, check=True
    )

I find it more readable this way but maybe there is a reason you cannot use it.

548–553

More readable by wrapping lines imho.

_run_script(
    f"""
    psql '{self.storage_dsn}' -c "\
        COPY (select encode(digest(fullname, 'sha256'), 'base64') as sha256_base64,\
                      encode(fullname, 'base64') as base64,\
                      encode(fullname, 'escape') as escaped from person)\
        TO STDOUT CSV HEADER\
    " | zstdmt -19
    """,
    self.deanonymization_table_path,
)
563

s/cas/was/

This revision is now accepted and ready to land.Dec 6 2022, 4:04 PM
vlorentz added inline comments.
requirements-swh.txt
3

It would need to be in requirements-swh-test.txt so it's installed from the local clone, but our tooling doesn't currently support that.

swh/graph/luigi.py
456–458

You're right, it works and has same throughput and CPU usage.

Build was aborted

Patch application report for D8908 (id=32159)

Could not rebase; Attempt merge onto 0a8ae5de6f...

Updating 0a8ae5d..f3235e3
Fast-forward
 conftest.py                                        |   1 +
 .../graph/utils/ListOriginContributors.java        | 141 +++++++
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 ++++++
 mypy.ini                                           |   6 +
 requirements-luigi.txt                             |   2 +
 requirements-swh-luigi.txt                         |   2 +-
 requirements-swh.txt                               |   1 +
 swh/graph/luigi.py                                 | 468 ++++++++++++++++++++-
 swh/graph/tests/test_origin_contributors.py        | 180 ++++++++
 swh/graph/tests/test_toposort.py                   |  59 +++
 10 files changed, 990 insertions(+), 4 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ListOriginContributors.java
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 create mode 100644 swh/graph/tests/test_origin_contributors.py
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit f3235e3184850b074b2a332686911688aafcdd84
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Dec 1 10:39:09 2022 +0100

    Add ListOriginContributors
    
    This Java script (and related Luigi tasks) traverse the graph in
    topological order, building up the set of all contributors to a
    node and its ancestors, then dump the value of this set for every
    origin node they encounter.

commit ab2703efcb9ad93a3d959596ed7edef27d908164
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 58f44785816bde0f6cdbf86e3ff6f1fbf385a487
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 922894410b6e14f5a9eeec445d4a0b503df77a9e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 7bee5d47a6eb49ac594f2d019222c176373a5248
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit 30dad16a2365021bedf72df78d0753e125765016
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ed6636c26be869a7309581d0ec664488b4d69e9f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit b8dc411ccd304597df96d7dd36158fb86e5239fd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/312/
See console output for more information: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/312/console

This revision was landed with ongoing or failed builds.Dec 7 2022, 10:40 AM
This revision was automatically updated to reflect the committed changes.