Page MenuHomeSoftware Heritage

Add ORC exporter
AcceptedPublic

Authored by seirl on Dec 17 2020, 7:40 PM.

Details

Reviewers
zack
Group Reviewers
Reviewers
Summary

This adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3.

Diff Detail

Repository
rDDATASET Datasets
Branch
orc_exporter
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18011
Build 27820: arc lint + arc unit

Event Timeline

seirl retitled this revision from Add ORC exporter This adds a new exporter in columnar format (Apache ORC) using the PyORC library. The output can be used on various clouds like AWS S3. to Add ORC exporterThis adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3..Dec 17 2020, 7:40 PM
seirl added a reviewer: Reviewers.
seirl retitled this revision from Add ORC exporterThis adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3. to Add ORC exporter.
seirl edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.Dec 18 2020, 8:34 AM
olasd added inline comments.
swh/dataset/exporters/orc.py
103–105

According to python docs datetime.fromtimestamp uses the UNIX localtime() and can overflow.

I suggest you use the other idiom from the docs:

datetime.datetime(1970, 1, 1, tzinfo=datetime.timezone.utc) + datetime.timedelta(seconds=timestamp["seconds"], microseconds=timestamp["microseconds"])

This form also has the benefit of not going through a float and potentially losing precision.

It also sets the tzinfo object to something explicit; I think your version would have given you a /localized/ datetime?

183–184

This is going to look weird for alias branches (but I don't think you can do much better...)

Ah, after reading through, you resolve alias branches. Nevermind. (I guess this needs to be documented in the export format!)