This adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3.
Details
Details
- Reviewers
zack olasd - Group Reviewers
Reviewers - Commits
- rDDATASETcf125983309e: Add ORC exporter
rDDATASET35253c89a722: ORC exporter: Add unit tests
rDDATASETbf8d2625d3b3: Refactor export paths in the base Exporter class
rDDATASET40f068d648d2: ORC exporter: avoid fromtimestamp(), use datetime() from epoch instead
Diff Detail
Diff Detail
- Repository
- rDDATASET Datasets
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
swh/dataset/exporters/orc.py | ||
---|---|---|
103–105 | According to python docs datetime.fromtimestamp uses the UNIX localtime() and can overflow. I suggest you use the other idiom from the docs: datetime.datetime(1970, 1, 1, tzinfo=datetime.timezone.utc) + datetime.timedelta(seconds=timestamp["seconds"], microseconds=timestamp["microseconds"]) This form also has the benefit of not going through a float and potentially losing precision. It also sets the tzinfo object to something explicit; I think your version would have given you a /localized/ datetime? | |
183–184 | This is going to look weird for alias branches (but I don't think you can do much better...) Ah, after reading through, you resolve alias branches. Nevermind. (I guess this needs to be documented in the export format!) |
Comment Actions
I added unit tests and reworked the logic, and also addressed @olasd 's comment. Could you please rereview? :-)