Details
- Reviewers
zack - Maniphest Tasks
- T2647: add LLP support to graph compression pipeline
- Commits
- rDGRPH00112952614e: Add LLP compression to the WebGraph pipeline
Diff Detail
- Repository
- rDGRPH Compressed graph representation
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
Build is green
Patch application report for D4821 (id=17075)
Could not rebase; Attempt merge onto 07a8f25eae...
Updating 07a8f25..6d9e6ae Fast-forward java/pom.xml | 5 ++ .../graph/utils/ComposePermutations.java | 44 ++++++++++++ swh/graph/cli.py | 10 +-- swh/graph/config.py | 4 +- .../tests/dataset/output/example-transposed.graph | 2 +- .../tests/dataset/output/example-transposed.obl | Bin 772 -> 772 bytes .../dataset/output/example-transposed.offsets | 3 +- .../dataset/output/example-transposed.properties | 44 ++++++------ swh/graph/tests/dataset/output/example.graph | 2 +- swh/graph/tests/dataset/output/example.mph | Bin 957 -> 957 bytes .../tests/dataset/output/example.node2swhid.bin | Bin 462 -> 462 bytes .../tests/dataset/output/example.node2type.map | Bin 353 -> 353 bytes swh/graph/tests/dataset/output/example.obl | Bin 772 -> 772 bytes swh/graph/tests/dataset/output/example.offsets | 6 +- swh/graph/tests/dataset/output/example.order | Bin 168 -> 168 bytes swh/graph/tests/dataset/output/example.properties | 44 ++++++------ swh/graph/tests/dataset/output/example.stats | 16 ++--- .../tests/dataset/output/example.swhid2node.bin | Bin 630 -> 630 bytes swh/graph/webgraph.py | 77 ++++++++++++++------- 19 files changed, 168 insertions(+), 89 deletions(-) create mode 100644 java/src/main/java/org/softwareheritage/graph/utils/ComposePermutations.java
Changes applied before test
commit 6d9e6aea93478b91aa980f6562f53c85b1794fbb Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Thu Jan 7 17:21:58 2021 +0100 pom.xml: add dependency to old commons-lang version to workaround an incorrect import in LLP commit 918305047e39cf98182d3603bf674a0f7aeb99e1 Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Thu Jan 7 17:19:39 2021 +0100 Update test dataset to use LLP compressed version commit 8a3234dbeae4ffe806a37e09b407b10cbe783947 Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Thu Jan 7 16:16:02 2021 +0100 Add LLP compression to the WebGraph pipeline commit bcc0e1e5a41249cfd6fd428fcda8aac36b19a51f Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Thu Jan 7 17:42:39 2021 +0100 config: sane default for batch_size using a heuristic on ram size
See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/62/ for more details.
java/src/main/java/org/softwareheritage/graph/utils/ComposePermutations.java | ||
---|---|---|
10–12 | A docstring stating what this is used for in the context of swh-graph would be nice to have here. | |
swh/graph/webgraph.py | ||
73–77 | Still not a fan of this. I understand your argument that is not really useful because the first naive BV graph is loaded only once, but (1) it only takes 12 minutes in a process that takes more than 1 weeky and (2) the subsequent step might fail for whatever reason, and if it does and we have to restart it, we might gain something. Also, the steps are configurable, so people who don't like this can just skipping this step. | |
130–132 | It's probably worth to make these values configurable right now before landing, rather than later. It's easy to do and we still want to play with them quite a bit I think. Aside from the TODO, are these the "good ones" based on the limited experience we have thus far with this? | |
134 | this should stay if you accept my suggestion of keeping the -bv.obl step |
Build is green
Patch application report for D4821 (id=24490)
Rebasing onto 58de681bd7...
Current branch diff-target is up to date.
Changes applied before test
commit 00112952614e0e0b72e216dca3ab5ef1439b2a82 Author: Antoine Pietri <antoine.pietri1@gmail.com> Date: Thu Jan 7 16:16:02 2021 +0100 Add LLP compression to the WebGraph pipeline Closes T2647
See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/143/ for more details.
LGTM.
Just to be sure: test_pipeline() from test_cli.py is now run with all new passes as well, and as such it also testes the LLP step(s), correct?
It seems that way to me because test_pipeline() seems to be running all passes, but I'd like this to be double-checked before landing.
swh/graph/config.py | ||
---|---|---|
44–45 | Good point. It looks like we don't document configuration options anywhere, whereas we should. |