Page MenuHomeSoftware Heritage

Add a ORC file loading function
Needs ReviewPublic

Authored by douardda on Apr 7 2022, 12:04 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

this generates swh.model objects from ORC files.
This should allow to rebuild a storage from an ORC dataset of the
archive.

Note: not all object types are supported for now (eg. ExtID, metadata

related objects, etc. are not yet supported).

Depends on D7519

Diff Detail

Unit TestsFailed

TimeTest
0 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.dataset.test.test_orc_load
.tox/py3/lib/python3.7/site-packages/swh/dataset/test/test_orc_load.py:10: in <module> from .test_orc import orc_export E ImportError: attempted relative import with no known parent package

Event Timeline

Build has FAILED

Patch application report for D7520 (id=27283)

Could not rebase; Attempt merge onto 9d97f0c082...

Updating 9d97f0c..0c0df2e
Fast-forward
 swh/dataset/cli.py                |  42 ++++---
 swh/dataset/orc_loader.py         | 254 ++++++++++++++++++++++++++++++++++++++
 swh/dataset/test/test_orc_load.py |  26 ++++
 3 files changed, 303 insertions(+), 19 deletions(-)
 create mode 100644 swh/dataset/orc_loader.py
 create mode 100644 swh/dataset/test/test_orc_load.py
Changes applied before test
commit 0c0df2e01c76aed77a662d7f22481af3d3da0c89
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 14:16:39 2022 +0100

    Add a ORC file loading function
    
    this generates swh.model objects from ORC files.
    This should allow to rebuild a storage from an ORC dataset of the
    archive.
    
    Note: not all object types are supported for now (eg. ExtID, metadata
          related objects, etc. are not yet supported).

commit 18325cc8e78e99ac35f550687e41b6f21c5d3a9f
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Apr 6 16:16:09 2022 +0200

    Reduce cli's loading time by moving import statements in commands

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/125/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/125/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 7 2022, 12:06 PM
Harbormaster failed remote builds in B28213: Diff 27283!

Build has FAILED

Patch application report for D7520 (id=27320)

Rebasing onto 18325cc8e7...

Current branch diff-target is up to date.
Changes applied before test
commit 8be63a1a7d7430794b2e4e31aa6f8af50a074dd4
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 14:16:39 2022 +0100

    Add a ORC file loading function
    
    this generates swh.model objects from ORC files.
    This should allow to rebuild a storage from an ORC dataset of the
    archive.
    
    Note: not all object types are supported for now (eg. ExtID, metadata
          related objects, etc. are not yet supported).

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/126/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/126/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 8 2022, 11:55 AM
Harbormaster failed remote builds in B28252: Diff 27320!

add forgotten test/__init__.py

Build is green

Patch application report for D7520 (id=27322)

Rebasing onto 18325cc8e7...

Current branch diff-target is up to date.
Changes applied before test
commit dbf1b87b0cb59a8c76a9928f1efdacd87abcf4ad
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 14:16:39 2022 +0100

    Add a ORC file loading function
    
    this generates swh.model objects from ORC files.
    This should allow to rebuild a storage from an ORC dataset of the
    archive.
    
    Note: not all object types are supported for now (eg. ExtID, metadata
          related objects, etc. are not yet supported).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/127/ for more details.

Have you tried using swh.storage.postgresql.converters.db_to_*? It is very similar to these cvrt_* functions, so you'd only need to convert dates from ISO8601 strings to datetime

swh/dataset/orc_loader.py
1

missing copyright notice

13–32

should be replaced by the list defined in swh-model, right?

48

isn't prefix always author_date or committer_date?