HomeSoftware Heritage

Add ORC exporter

Description

Add ORC exporter

This new exporter allows us to export the SWH graph dataset as a set of
relational tables in a static columnar format called ORC. This can then
be uploaded on data processing engines like Amazon Athena, BigQuery or
Azure Databricks.

This replaces the old scripts that were extracting the data directly
from the PostgreSQL database, to be integrated to the journal instead.
A notable change is that we now use the ORC format instead of Parquet,
as it supports streamed writes, which simplifies the data buffering and
allows for larger dataset files.

Details

Provenance
seirlAuthored on Dec 16 2020, 5:00 PM
seirlPushed on Feb 12 2021, 9:54 PM
Differential Revision
D4762: Add ORC exporter
Parents
rDDATASETe439aa686f22: Edge exporter: use common remove_pull_requests() function
Branches
Unknown
Tags
Unknown