Page MenuHomeSoftware Heritage

Add ORC exporter
ClosedPublic

Authored by seirl on Dec 17 2020, 7:40 PM.

Details

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

seirl retitled this revision from Add ORC exporter This adds a new exporter in columnar format (Apache ORC) using the PyORC library. The output can be used on various clouds like AWS S3. to Add ORC exporterThis adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3..Dec 17 2020, 7:40 PM
seirl added a reviewer: Reviewers.
seirl retitled this revision from Add ORC exporterThis adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3. to Add ORC exporter.
seirl edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.Dec 18 2020, 8:34 AM
olasd added inline comments.
swh/dataset/exporters/orc.py
103–105

According to python docs datetime.fromtimestamp uses the UNIX localtime() and can overflow.

I suggest you use the other idiom from the docs:

datetime.datetime(1970, 1, 1, tzinfo=datetime.timezone.utc) + datetime.timedelta(seconds=timestamp["seconds"], microseconds=timestamp["microseconds"])

This form also has the benefit of not going through a float and potentially losing precision.

It also sets the tzinfo object to something explicit; I think your version would have given you a /localized/ datetime?

183–184

This is going to look weird for alias branches (but I don't think you can do much better...)

Ah, after reading through, you resolve alias branches. Nevermind. (I guess this needs to be documented in the export format!)

  • Add unit tests
  • Refactor export paths in the base Exporter class

ORC exporter: avoid fromtimestamp(), use datetime() from epoch instead

I added unit tests and reworked the logic, and also addressed @olasd 's comment. Could you please rereview? :-)

This revision is now accepted and ready to land.Feb 15 2021, 5:44 PM