This adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3.
Diff Detail
Diff Detail
- Repository
- rDDATASET Datasets
- Branch
- orc_exporter
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 18011 Build 27820: arc lint + arc unit
Event Timeline
swh/dataset/exporters/orc.py | ||
---|---|---|
103–105 | According to python docs datetime.fromtimestamp uses the UNIX localtime() and can overflow. I suggest you use the other idiom from the docs: datetime.datetime(1970, 1, 1, tzinfo=datetime.timezone.utc) + datetime.timedelta(seconds=timestamp["seconds"], microseconds=timestamp["microseconds"]) This form also has the benefit of not going through a float and potentially losing precision. It also sets the tzinfo object to something explicit; I think your version would have given you a /localized/ datetime? | |
183–184 | This is going to look weird for alias branches (but I don't think you can do much better...) Ah, after reading through, you resolve alias branches. Nevermind. (I guess this needs to be documented in the export format!) |