Page MenuHomeSoftware Heritage

Add LLP compression to the WebGraph pipeline
Changes PlannedPublic

Authored by seirl on Thu, Jan 7, 5:44 PM.

Diff Detail

Repository
rDGRPH Graph service
Branch
llp_pipeline
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18217
Build 28118: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 28117: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4821 (id=17075)

Could not rebase; Attempt merge onto 07a8f25eae...

Updating 07a8f25..6d9e6ae
Fast-forward
 java/pom.xml                                       |   5 ++
 .../graph/utils/ComposePermutations.java           |  44 ++++++++++++
 swh/graph/cli.py                                   |  10 +--
 swh/graph/config.py                                |   4 +-
 .../tests/dataset/output/example-transposed.graph  |   2 +-
 .../tests/dataset/output/example-transposed.obl    | Bin 772 -> 772 bytes
 .../dataset/output/example-transposed.offsets      |   3 +-
 .../dataset/output/example-transposed.properties   |  44 ++++++------
 swh/graph/tests/dataset/output/example.graph       |   2 +-
 swh/graph/tests/dataset/output/example.mph         | Bin 957 -> 957 bytes
 .../tests/dataset/output/example.node2swhid.bin    | Bin 462 -> 462 bytes
 .../tests/dataset/output/example.node2type.map     | Bin 353 -> 353 bytes
 swh/graph/tests/dataset/output/example.obl         | Bin 772 -> 772 bytes
 swh/graph/tests/dataset/output/example.offsets     |   6 +-
 swh/graph/tests/dataset/output/example.order       | Bin 168 -> 168 bytes
 swh/graph/tests/dataset/output/example.properties  |  44 ++++++------
 swh/graph/tests/dataset/output/example.stats       |  16 ++---
 .../tests/dataset/output/example.swhid2node.bin    | Bin 630 -> 630 bytes
 swh/graph/webgraph.py                              |  77 ++++++++++++++-------
 19 files changed, 168 insertions(+), 89 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ComposePermutations.java
Changes applied before test
commit 6d9e6aea93478b91aa980f6562f53c85b1794fbb
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 17:21:58 2021 +0100

    pom.xml: add dependency to old commons-lang version to workaround an incorrect import in LLP

commit 918305047e39cf98182d3603bf674a0f7aeb99e1
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 17:19:39 2021 +0100

    Update test dataset to use LLP compressed version

commit 8a3234dbeae4ffe806a37e09b407b10cbe783947
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 16:16:02 2021 +0100

    Add LLP compression to the WebGraph pipeline

commit bcc0e1e5a41249cfd6fd428fcda8aac36b19a51f
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 17:42:39 2021 +0100

    config: sane default for batch_size using a heuristic on ram size

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/62/ for more details.

seirl requested review of this revision.Thu, Jan 7, 5:46 PM
zack requested changes to this revision.Fri, Jan 8, 4:09 PM
zack added inline comments.
java/src/main/java/org/softwareheritage/graph/utils/ComposePermutations.java
10–12

A docstring stating what this is used for in the context of swh-graph would be nice to have here.
(What a permutation composition is is pretty obvious both in absolute terms and in the code, but a big picture kind of brief explanation would be useful for whoever opens this file.)

swh/graph/webgraph.py
69–73

Still not a fan of this. I understand your argument that is not really useful because the first naive BV graph is loaded only once, but (1) it only takes 12 minutes in a process that takes more than 1 weeky and (2) the subsequent step might fail for whatever reason, and if it does and we have to restart it, we might gain something. Also, the steps are configurable, so people who don't like this can just skipping this step.

90–92

It's probably worth to make these values configurable right now before landing, rather than later. It's easy to do and we still want to play with them quite a bit I think.

Aside from the TODO, are these the "good ones" based on the limited experience we have thus far with this?

130

this should stay if you accept my suggestion of keeping the -bv.obl step

This revision now requires changes to proceed.Fri, Jan 8, 4:09 PM
seirl planned changes to this revision.Fri, Jan 8, 4:17 PM

I'm realizing that this is missing the "simplified" step and needs more changes.

swh/graph/webgraph.py
90–92

Yes, these are the good ones.

130

Not really since we remove the -bv graph entirely, the .obl will be useless.