Page MenuHomeSoftware Heritage

Add a ORC file loading function
Needs ReviewPublic

Authored by douardda on Apr 7 2022, 12:04 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

this generates swh.model objects from ORC files.
This should allow to rebuild a storage from an ORC dataset of the
archive.

Note: not all object types are supported for now (eg. ExtID, metadata

related objects, etc. are not yet supported).

Depends on D7519

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D7520 (id=27283)

Could not rebase; Attempt merge onto 9d97f0c082...

Updating 9d97f0c..0c0df2e
Fast-forward
 swh/dataset/cli.py                |  42 ++++---
 swh/dataset/orc_loader.py         | 254 ++++++++++++++++++++++++++++++++++++++
 swh/dataset/test/test_orc_load.py |  26 ++++
 3 files changed, 303 insertions(+), 19 deletions(-)
 create mode 100644 swh/dataset/orc_loader.py
 create mode 100644 swh/dataset/test/test_orc_load.py
Changes applied before test
commit 0c0df2e01c76aed77a662d7f22481af3d3da0c89
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 14:16:39 2022 +0100

    Add a ORC file loading function
    
    this generates swh.model objects from ORC files.
    This should allow to rebuild a storage from an ORC dataset of the
    archive.
    
    Note: not all object types are supported for now (eg. ExtID, metadata
          related objects, etc. are not yet supported).

commit 18325cc8e78e99ac35f550687e41b6f21c5d3a9f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Apr 6 16:16:09 2022 +0200

    Reduce cli's loading time by moving import statements in commands

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/125/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/125/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 7 2022, 12:06 PM
Harbormaster failed remote builds in B28213: Diff 27283!

Build has FAILED

Patch application report for D7520 (id=27320)

Rebasing onto 18325cc8e7...

Current branch diff-target is up to date.
Changes applied before test
commit 8be63a1a7d7430794b2e4e31aa6f8af50a074dd4
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 14:16:39 2022 +0100

    Add a ORC file loading function
    
    this generates swh.model objects from ORC files.
    This should allow to rebuild a storage from an ORC dataset of the
    archive.
    
    Note: not all object types are supported for now (eg. ExtID, metadata
          related objects, etc. are not yet supported).

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/126/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/126/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 8 2022, 11:55 AM
Harbormaster failed remote builds in B28252: Diff 27320!

add forgotten test/__init__.py

Build is green

Patch application report for D7520 (id=27322)

Rebasing onto 18325cc8e7...

Current branch diff-target is up to date.
Changes applied before test
commit dbf1b87b0cb59a8c76a9928f1efdacd87abcf4ad
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 14:16:39 2022 +0100

    Add a ORC file loading function
    
    this generates swh.model objects from ORC files.
    This should allow to rebuild a storage from an ORC dataset of the
    archive.
    
    Note: not all object types are supported for now (eg. ExtID, metadata
          related objects, etc. are not yet supported).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/127/ for more details.

Have you tried using swh.storage.postgresql.converters.db_to_*? It is very similar to these cvrt_* functions, so you'd only need to convert dates from ISO8601 strings to datetime

swh/dataset/orc_loader.py
1

missing copyright notice

13–32

should be replaced by the list defined in swh-model, right?

48

isn't prefix always author_date or committer_date?