Page MenuHomeSoftware Heritage

Add a script to generate a topological sort
ClosedPublic

Authored by vlorentz on Nov 24 2022, 4:20 PM.

Details

Summary

This will (probably) be used to compute the set of all contributors to
an origin (T4695) by propagating them through revision history.

This uses the order induced by a DFS, so that most revisions will be
processed right after their unique ancestor (or at least one of them)
to maximize cache hits when propogating the author set.

Additionally, this includes a count of the number of successors of each
revision, so it can be recorded and decremented on every cache hit,
allowing to discard a revision's author set from the cache once all its
successors are processed.

This is implemented as a single-thread algorithm because:

  1. it's simpler
  2. my attempts multi-threading resulted in worse performance, probably because most of the time is spent locking the head of the stack (or the head and/or tail of a queue, as I tried a BFS too)
Test Plan

Works on teaser dataset and the full graph (takes ~35 hours)

Diff Detail

Event Timeline

vlorentz retitled this revision from Add a script to generate a topological sort to [wip] Add a script to generate a topological sort.

Build is green

Patch application report for D8883 (id=32029)

Rebasing onto b27c3958ca...

Current branch diff-target is up to date.
Changes applied before test
commit b2509d9ec107b93933249faa51dd297585fba2ea
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 0c6d796c9d70d84b88c5fe58197003d9594e53cb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit fd4f0769e5a8a37a64f0bbb10a1edccfcef40abe
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:43:44 2022 +0100

    actually multi-thread...

commit b885b8552a97b59c05f2303824fdf655eb8fd9a6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit dbfd98789034b6fafd398b55af62047a9ddd1ec7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    [wip] toposort

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/280/ for more details.

Build is green

Patch application report for D8883 (id=32044)

Rebasing onto 6572c43136...

First, rewinding head to replay your work on top of it...
Applying: [wip] toposort
Applying: tentative multithread DFS
Applying: actually multi-thread...
Applying: revert multithreading, it's actually twice as slow as singlethread
Applying: Add a sample of two ancestor with each node
Applying: Improve comments
Changes applied before test
commit 404b96d2ac504ea81a2731563eea500500275fdc
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 28978deef1de3632945e487314deac238f89f35f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit d1050b788641575474ba8f9df800ce8abb6ac4fb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit c05d9d311f543b1e7e458df87edcf362f49dfbe0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:43:44 2022 +0100

    actually multi-thread...

commit 3ae6803cb47de97e6a9b09907b58136b4509aaeb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit d7fcc8ee74cd969fd7e0fbf65548bf90383ce5e7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    [wip] toposort

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/281/ for more details.

vlorentz retitled this revision from [wip] Add a script to generate a topological sort to Add a script to generate a topological sort.Nov 29 2022, 5:55 PM
vlorentz edited the test plan for this revision. (Show Details)
  • Add Luigi task TopoSort and add a simple test

Build is green

Patch application report for D8883 (id=32086)

Rebasing onto ec7f568b13...

Current branch diff-target is up to date.
Changes applied before test
commit 57c6a086b8f20ec292e24b4e130cc5d0b6fe82fe
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 00c0ec8d5b01b2451006398d1bc276854ee2bdd3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 1126ed4c15d950c9358db0af18b3f3bbeb9b4dd0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 142ccc631a1b4a408ad75492ba110f7c61bda5b4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit 97e856fbf478c0d911445a58bc46c269a0732679
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:43:44 2022 +0100

    actually multi-thread...

commit 586289fde6e9f52f0dc92601cea6d3e441a9de0e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit 716d8941606dc560bad52b3b949a4bee5d7c0552
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    [wip] toposort

commit 10697bd84d38c33cddc0f7d43fea952c3a1bf912
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/291/ for more details.

Build is green

Patch application report for D8883 (id=32088)

Could not rebase; Attempt merge onto ec7f568b13...

Updating ec7f568..d00b3a3
Fast-forward
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++++
 mypy.ini                                           |   3 +
 requirements-luigi.txt                             |   1 +
 requirements-swh-luigi.txt                         |   2 +-
 swh/graph/luigi.py                                 | 298 ++++++++++++++++++++-
 swh/graph/tests/test_toposort.py                   |  58 ++++
 6 files changed, 492 insertions(+), 4 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit d00b3a32b29eee9bbaca4f058642b491e71a7e42
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 9f0c1b0c93572fbd3fe2ba3ca6487585357e4f5d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 09474668851de7a370a1197ac4534788afd541fc
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 1853c5f9ecda55dfdfa24b9931dabacd9c967279
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit 304ae15f001809d8df06eaea2ab4ec7ef90b6604
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit 99820dd121db972af015fe25e5eacc5933240648
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 10697bd84d38c33cddc0f7d43fea952c3a1bf912
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/293/ for more details.

Build is green

Patch application report for D8883 (id=32087)

Rebasing onto ec7f568b13...

Current branch diff-target is up to date.
Changes applied before test
commit d00b3a32b29eee9bbaca4f058642b491e71a7e42
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 9f0c1b0c93572fbd3fe2ba3ca6487585357e4f5d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 09474668851de7a370a1197ac4534788afd541fc
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 1853c5f9ecda55dfdfa24b9931dabacd9c967279
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit 304ae15f001809d8df06eaea2ab4ec7ef90b6604
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit 99820dd121db972af015fe25e5eacc5933240648
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 10697bd84d38c33cddc0f7d43fea952c3a1bf912
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/292/ for more details.

vlorentz edited the test plan for this revision. (Show Details)
vlorentz edited the test plan for this revision. (Show Details)

Build is green

Patch application report for D8883 (id=32090)

Rebasing onto ec7f568b13...

Current branch diff-target is up to date.
Changes applied before test
commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/295/ for more details.

Build is green

Patch application report for D8883 (id=32091)

Could not rebase; Attempt merge onto ec7f568b13...

Updating ec7f568..39fefbf
Fast-forward
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++++
 mypy.ini                                           |   3 +
 requirements-luigi.txt                             |   1 +
 requirements-swh-luigi.txt                         |   2 +-
 swh/graph/luigi.py                                 | 305 ++++++++++++++++++++-
 swh/graph/tests/test_toposort.py                   |  58 ++++
 6 files changed, 499 insertions(+), 4 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit 39fefbfc108087b4b7f86c39312d1f94f06cc16a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 78b4d9016cfd5025811607c9f6069fea1b39eb23
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 0a651262c32ff3bca6951323a2ab9fe5e5204f97
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 23f9256cd34f97bc3e6dd9eda51c07232f736e0f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit a62fa7f4b7c468ee7ef731986c7d7fc33c7f4042
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ab744a8ada1de4cb6a9d3d904406f9e40d74a3db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit 550235e4e7a04f10e5c9869e5717b16ca5a2edf8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/296/ for more details.

ardumont added inline comments.
java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
75

an ancestor or a successor?

anlambert added a subscriber: anlambert.

LGTM, been a while since I read java code, so verbose (especially for iterations).

java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
75

successor as a leaf has no child nodes.

This revision is now accepted and ready to land.Dec 6 2022, 3:17 PM

LGTM, been a while since I read java code, so verbose (especially for iterations).

yes, same.
I had a hard time going through it...

Build was aborted

Patch application report for D8883 (id=32158)

Could not rebase; Attempt merge onto 0a8ae5de6f...

Updating 0a8ae5d..ab2703e
Fast-forward
 .../org/softwareheritage/graph/utils/TopoSort.java | 134 +++++++++
 mypy.ini                                           |   3 +
 requirements-luigi.txt                             |   1 +
 requirements-swh-luigi.txt                         |   2 +-
 swh/graph/luigi.py                                 | 305 ++++++++++++++++++++-
 swh/graph/tests/test_toposort.py                   |  58 ++++
 6 files changed, 499 insertions(+), 4 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/TopoSort.java
 create mode 100644 swh/graph/tests/test_toposort.py
Changes applied before test
commit ab2703efcb9ad93a3d959596ed7edef27d908164
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:54:30 2022 +0100

    Add Luigi task TopoSort and add a simple test

commit 58f44785816bde0f6cdbf86e3ff6f1fbf385a487
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Nov 28 16:02:56 2022 +0100

    Improve comments

commit 922894410b6e14f5a9eeec445d4a0b503df77a9e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 16:15:04 2022 +0100

    Add a sample of two ancestor with each node
    
    This allows readers to efficiently get ancestors of nodes with low indegree
    (ie. most revisions), as it avoids a random access / API call.

commit 7bee5d47a6eb49ac594f2d019222c176373a5248
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:54:14 2022 +0100

    revert multithreading, it's actually twice as slow as singlethread

commit 30dad16a2365021bedf72df78d0753e125765016
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 12:06:21 2022 +0100

    tentative multithread DFS

commit ed6636c26be869a7309581d0ec664488b4d69e9f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 24 11:49:32 2022 +0100

    Implement a naive topological sort

commit b8dc411ccd304597df96d7dd36158fb86e5239fd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 29 17:01:45 2022 +0100

    luigi: Add tasks UploadGraphToS3 and DownloadGraphFromS3

Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/311/
See console output for more information: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/311/console