Page MenuHomeSoftware Heritage

Add support for limited row numbers in ORC files
ClosedPublic

Authored by douardda on Mar 29 2022, 5:43 PM.

Details

Summary

Make it possible to specify a maximum number of rows a table can store
in a single ORC file. The limit can only be set on main tables for now
(i.e. cannot be specified for tables like revision_history or
directory_entry).

This can be set by configuration only (no extra cli options).

Depends on D7389.

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7461 (id=27046)

Could not rebase; Attempt merge onto 5a8a8a7847...

Updating 5a8a8a7..e8ccb16
Fast-forward
 swh/dataset/exporters/orc.py |  95 ++++++++++++++++++++++++++++++++++------
 swh/dataset/relational.py    |  10 +++++
 swh/dataset/test/test_orc.py | 100 +++++++++++++++++++++++++++++++++++++++----
 3 files changed, 183 insertions(+), 22 deletions(-)
Changes applied before test
commit e8ccb166a6aaa82f5917388f9b995c830499170a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

commit fd3f9aa61de374655fd4bc4920d5047eb7d0c4ca
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:24:31 2022 +0100

    Add the raw_manifest column for revision, release and directory ORC files

commit 5c652bb058e2c1b59bafefd6817f392fdc171a20
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:22:47 2022 +0100

    Export revision extra headers in a dedicated ORC file

commit 45c8124b7a310963a868eb6602ea24e240d761e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Mar 18 12:20:20 2022 +0100

    Add the type fields for revision and origin_visit_status ORC table

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/98/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/dataset/exporters/orc.py
1

bump the date

137–148

so before this diff, self.writers were never closed?

159–163

shouldn't this fail? it will drop data if no one notices the warning...

swh/dataset/test/test_orc.py
6

missing a copyright header btw

This revision is now accepted and ready to land.Mar 29 2022, 5:59 PM
swh/dataset/exporters/orc.py
137–148

they would thanks to the ExitStack used in the Exporter class

159–163

yeah that's a concern I have, not sure yet...

douardda added inline comments.
swh/dataset/exporters/orc.py
159–163

Actually I don't think this is a big deal here. If we want to fail, it should be a check done before starting to export anything.

check for max_rows validity at ORCExporter class instanciation + copyrights

Build has FAILED

Patch application report for D7461 (id=27065)

Rebasing onto fd3f9aa61d...

Current branch diff-target is up to date.
Changes applied before test
commit 850ee3be47cf3b1e0ab53f8820cc5e4c86b94f38
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/104/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/104/console

Build is green

Patch application report for D7461 (id=27074)

Rebasing onto fd3f9aa61d...

Current branch diff-target is up to date.
Changes applied before test
commit 3df08fd71759487e963e6569c8dfd0c502b060de
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Mar 23 16:35:52 2022 +0100

    Add support for limited row numbers in ORC files
    
    Make it possible to specify a maximum number of rows a table can store
    in a single ORC file. The limit can only be set on main tables for now
    (i.e. cannot be specified for tables like revision_history or
    directory_entry).
    
    This can be set by configuration only (no extra cli options).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/109/ for more details.