Page MenuHomeSoftware Heritage

ExtractNodes: read ORC files in parallel
ClosedPublic

Authored by seirl on May 3 2022, 9:11 PM.

Details

Summary

Spawn many sort(1) in parallel to avoid locking, then a sort -m to merge all the batches

Benchmarks on popular-3k-python:

Before: 51:52.18 total
After: 5:53.77 total

Diff Detail

Repository
rDGRPH Compressed graph representation
Branch
parallel_extract_nodes
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 29041
Build 45400: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 45399: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D7733 (id=27963)

Rebasing onto 2998fc43f8...

Current branch diff-target is up to date.
Changes applied before test
commit 3f6d9f0a9f0a21ee83f8bffb31fc5f16c0adf403
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 3 21:10:12 2022 +0200

    ExtractNodes: spawn many sort(1) in parallel to avoid locking, then a sort -m to merge all the batches

commit fffc2cc6318b19e4aa98982fd52523c682b7000e
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 3 19:25:35 2022 +0200

    ExtractNodes: read ORC files in parallel

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/180/ for more details.

seirl requested review of this revision.May 3 2022, 9:16 PM

Compute sane default for RAM usage

Build is green

Patch application report for D7733 (id=27964)

Rebasing onto 2998fc43f8...

Current branch diff-target is up to date.
Changes applied before test
commit cb3620595185bf8efb5f6237f61ae4c7dd8f25f8
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 3 21:10:12 2022 +0200

    ExtractNodes: spawn many sort(1) in parallel to avoid locking, then a sort -m to merge all the batches

commit fffc2cc6318b19e4aa98982fd52523c682b7000e
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 3 19:25:35 2022 +0200

    ExtractNodes: read ORC files in parallel

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/181/ for more details.

Fix sort buffer size argument (add b suffix)

Build is green

Patch application report for D7733 (id=27968)

Rebasing onto 2998fc43f8...

Current branch diff-target is up to date.
Changes applied before test
commit 5ffb3f39f9fd4b52c5b5f4b63af7dd5a5c7770e5
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 3 21:10:12 2022 +0200

    ExtractNodes: spawn many sort(1) in parallel to avoid locking, then a sort -m to merge all the batches

commit fffc2cc6318b19e4aa98982fd52523c682b7000e
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue May 3 19:25:35 2022 +0200

    ExtractNodes: read ORC files in parallel

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/182/ for more details.

This revision is now accepted and ready to land.May 4 2022, 3:01 PM