Page MenuHomeSoftware Heritage

Add LLP compression to the WebGraph pipeline
ClosedPublic

Authored by seirl on Jan 7 2021, 5:44 PM.

Diff Detail

Event Timeline

Build is green

Patch application report for D4821 (id=17075)

Could not rebase; Attempt merge onto 07a8f25eae...

Updating 07a8f25..6d9e6ae
Fast-forward
 java/pom.xml                                       |   5 ++
 .../graph/utils/ComposePermutations.java           |  44 ++++++++++++
 swh/graph/cli.py                                   |  10 +--
 swh/graph/config.py                                |   4 +-
 .../tests/dataset/output/example-transposed.graph  |   2 +-
 .../tests/dataset/output/example-transposed.obl    | Bin 772 -> 772 bytes
 .../dataset/output/example-transposed.offsets      |   3 +-
 .../dataset/output/example-transposed.properties   |  44 ++++++------
 swh/graph/tests/dataset/output/example.graph       |   2 +-
 swh/graph/tests/dataset/output/example.mph         | Bin 957 -> 957 bytes
 .../tests/dataset/output/example.node2swhid.bin    | Bin 462 -> 462 bytes
 .../tests/dataset/output/example.node2type.map     | Bin 353 -> 353 bytes
 swh/graph/tests/dataset/output/example.obl         | Bin 772 -> 772 bytes
 swh/graph/tests/dataset/output/example.offsets     |   6 +-
 swh/graph/tests/dataset/output/example.order       | Bin 168 -> 168 bytes
 swh/graph/tests/dataset/output/example.properties  |  44 ++++++------
 swh/graph/tests/dataset/output/example.stats       |  16 ++---
 .../tests/dataset/output/example.swhid2node.bin    | Bin 630 -> 630 bytes
 swh/graph/webgraph.py                              |  77 ++++++++++++++-------
 19 files changed, 168 insertions(+), 89 deletions(-)
 create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ComposePermutations.java
Changes applied before test
commit 6d9e6aea93478b91aa980f6562f53c85b1794fbb
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 17:21:58 2021 +0100

    pom.xml: add dependency to old commons-lang version to workaround an incorrect import in LLP

commit 918305047e39cf98182d3603bf674a0f7aeb99e1
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 17:19:39 2021 +0100

    Update test dataset to use LLP compressed version

commit 8a3234dbeae4ffe806a37e09b407b10cbe783947
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 16:16:02 2021 +0100

    Add LLP compression to the WebGraph pipeline

commit bcc0e1e5a41249cfd6fd428fcda8aac36b19a51f
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 17:42:39 2021 +0100

    config: sane default for batch_size using a heuristic on ram size

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/62/ for more details.

seirl requested review of this revision.Jan 7 2021, 5:46 PM
zack requested changes to this revision.Jan 8 2021, 4:09 PM
zack added inline comments.
java/src/main/java/org/softwareheritage/graph/utils/ComposePermutations.java
10–12

A docstring stating what this is used for in the context of swh-graph would be nice to have here.
(What a permutation composition is is pretty obvious both in absolute terms and in the code, but a big picture kind of brief explanation would be useful for whoever opens this file.)

swh/graph/webgraph.py
69–73

Still not a fan of this. I understand your argument that is not really useful because the first naive BV graph is loaded only once, but (1) it only takes 12 minutes in a process that takes more than 1 weeky and (2) the subsequent step might fail for whatever reason, and if it does and we have to restart it, we might gain something. Also, the steps are configurable, so people who don't like this can just skipping this step.

90–92

It's probably worth to make these values configurable right now before landing, rather than later. It's easy to do and we still want to play with them quite a bit I think.

Aside from the TODO, are these the "good ones" based on the limited experience we have thus far with this?

130

this should stay if you accept my suggestion of keeping the -bv.obl step

This revision now requires changes to proceed.Jan 8 2021, 4:09 PM
seirl planned changes to this revision.Jan 8 2021, 4:17 PM

I'm realizing that this is missing the "simplified" step and needs more changes.

swh/graph/webgraph.py
90–92

Yes, these are the good ones.

130

Not really since we remove the -bv graph entirely, the .obl will be useless.

Add simplify step, fix various review comments

seirl marked 3 inline comments as done.Dec 4 2021, 2:07 AM

Build is green

Patch application report for D4821 (id=24490)

Rebasing onto 58de681bd7...

Current branch diff-target is up to date.
Changes applied before test
commit 00112952614e0e0b72e216dca3ab5ef1439b2a82
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Thu Jan 7 16:16:02 2021 +0100

    Add LLP compression to the WebGraph pipeline
    
    Closes T2647

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/143/ for more details.

vlorentz added inline comments.
java/src/main/java/org/softwareheritage/graph/utils/ComposePermutations.java
1

you should add a copyright header to all Java files in the repo, btw

swh/graph/config.py
45–46 ↗(On Diff #24490)

could you document this somewhere?

LGTM.

Just to be sure: test_pipeline() from test_cli.py is now run with all new passes as well, and as such it also testes the LLP step(s), correct?
It seems that way to me because test_pipeline() seems to be running all passes, but I'd like this to be double-checked before landing.

swh/graph/config.py
45–46 ↗(On Diff #24490)

Good point. It looks like we don't document configuration options anywhere, whereas we should.
@seirl: please file a separate task about this.

This revision is now accepted and ready to land.Dec 6 2021, 6:08 PM

Yes, this is implicitly all tested under the compression pipeline test.