Page MenuHomeSoftware Heritage

Add graph properties compressed from the ORC dataset
ClosedPublic

Authored by seirl on Mar 10 2022, 2:16 PM.

Details

Summary

This commit adds the handling of graph *properties*, i.e., data attached
to nodes or edges (commit timestamps, commit messages, content lengths,
...) to swh-graph.

The class WriteNodeProperties is used to extract the node properties
from the ORCGraphDataset and write them in separate files, in compressed
format. The properties can then be read using the SwhGraphProperties
class.

The compression pipeline and the tests were all changed to use the new
dataset format.

Unfortunately there are a lot of interlocking parts and refactors that I had to
work on in parallel, so this commit is not as... atomic as it could be.

The CI also won't pass until a new version of WebGraph is released.

Diff Detail

Repository
rDGRPH Compressed graph representation
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D7331 (id=26513)

Rebasing onto 45c609c9aa...

First, rewinding head to replay your work on top of it...
Applying: Add graph properties compressed from the ORC dataset
Changes applied before test
commit 08efb0926cd71d0606bf06258a860d659e6e1879
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Sat Feb 5 00:24:01 2022 +0100

    Add graph properties compressed from the ORC dataset
    
    This commit adds the handling of graph *properties*, i.e., data attached
    to nodes or edges (commit timestamps, commit messages, content lengths,
    ...) to swh-graph.
    
    The class WriteNodeProperties is used to extract the node properties
    from the ORCGraphDataset and write them in separate files, in compressed
    format. The properties can then be read using the SwhGraphProperties
    class.
    
    The compression pipeline and the tests were all changed to use the new
    dataset format.

Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/168/
See console output for more information: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/168/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 10 2022, 2:17 PM
Harbormaster failed remote builds in B27398: Diff 26513!
seirl requested review of this revision.Mar 18 2022, 3:17 PM
This revision is now accepted and ready to land.Mar 18 2022, 3:20 PM

Rebase, fix writenodeproperties enum number

Build has FAILED

Patch application report for D7331 (id=27008)

Rebasing onto 8307b841e9...

Current branch diff-target is up to date.
Changes applied before test
commit 4187f8bd68bf6ddb79f980916022844b30c2928b
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Wed Mar 23 02:08:29 2022 +0100

    compression: add --batch-size to ScatteredArcsASCIIGraph

commit 352d27f4b3f6b230757966ea73eb7c123d56337c
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Sat Feb 5 00:24:01 2022 +0100

    Add graph properties compressed from the ORC dataset
    
    This commit adds the handling of graph *properties*, i.e., data attached
    to nodes or edges (commit timestamps, commit messages, content lengths,
    ...) to swh-graph.
    
    The class WriteNodeProperties is used to extract the node properties
    from the ORCGraphDataset and write them in separate files, in compressed
    format. The properties can then be read using the SwhGraphProperties
    class.
    
    The compression pipeline and the tests were all changed to use the new
    dataset format.

Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/172/
See console output for more information: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/172/console

  • Ignore typing in generate_dataset.py to avoid hard dependency to swh-dataset

Build has FAILED

Patch application report for D7331 (id=27019)

Rebasing onto 8307b841e9...

Current branch diff-target is up to date.
Changes applied before test
commit fc0a1f24e6b1eee41634d4ee98c5590412d091be
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue Mar 29 16:34:29 2022 +0200

    Ignore typing in generate_dataset.py to avoid hard dependency to swh-dataset

commit 4187f8bd68bf6ddb79f980916022844b30c2928b
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Wed Mar 23 02:08:29 2022 +0100

    compression: add --batch-size to ScatteredArcsASCIIGraph

commit 352d27f4b3f6b230757966ea73eb7c123d56337c
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Sat Feb 5 00:24:01 2022 +0100

    Add graph properties compressed from the ORC dataset
    
    This commit adds the handling of graph *properties*, i.e., data attached
    to nodes or edges (commit timestamps, commit messages, content lengths,
    ...) to swh-graph.
    
    The class WriteNodeProperties is used to extract the node properties
    from the ORCGraphDataset and write them in separate files, in compressed
    format. The properties can then be read using the SwhGraphProperties
    class.
    
    The compression pipeline and the tests were all changed to use the new
    dataset format.

Link to build: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/173/
See console output for more information: https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/173/console

  • Ignore typing in generate_dataset.py to avoid hard dependency to swh-dataset

Build is green

Patch application report for D7331 (id=27020)

Rebasing onto 8307b841e9...

Current branch diff-target is up to date.
Changes applied before test
commit 3563007e9adedc4addb27a2cd9299a94463d7e96
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue Mar 29 16:34:29 2022 +0200

    Ignore typing in generate_dataset.py to avoid hard dependency to swh-dataset

commit 4187f8bd68bf6ddb79f980916022844b30c2928b
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Wed Mar 23 02:08:29 2022 +0100

    compression: add --batch-size to ScatteredArcsASCIIGraph

commit 352d27f4b3f6b230757966ea73eb7c123d56337c
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Sat Feb 5 00:24:01 2022 +0100

    Add graph properties compressed from the ORC dataset
    
    This commit adds the handling of graph *properties*, i.e., data attached
    to nodes or edges (commit timestamps, commit messages, content lengths,
    ...) to swh-graph.
    
    The class WriteNodeProperties is used to extract the node properties
    from the ORCGraphDataset and write them in separate files, in compressed
    format. The properties can then be read using the SwhGraphProperties
    class.
    
    The compression pipeline and the tests were all changed to use the new
    dataset format.

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/174/ for more details.