diff --git a/docs/graph/dataset.rst b/docs/graph/dataset.rst --- a/docs/graph/dataset.rst +++ b/docs/graph/dataset.rst @@ -1,40 +1,201 @@ Dataset ======= -We provide the full graph dataset along with two "teaser" datasets that can be -used for trying out smaller-scale experiments before using the full graph. +We aim to provide regular exports of the Software Heritage graph in two +different formats: + +- **Columnar data storage**: a set of relational tables stored in a columnar + format such as `Apache ORC `_, which is particularly + suited for scale-out analyses on data lakes and big data processing + ecosystems such as the Hadoop environment. + +- **Compressed graph**: a compact and highly-efficient representation of the + graph dataset, suited for scale-up analysis on high-end machines with large + amounts of memory. The graph is compressed in *Boldi-Vigna representation*, + designed to be loaded by the `WebGraph framework + `_, specifically using our `swh-graph + library `_. + + +Summary of dataset versions +--------------------------- + +**Full graph**: + +.. list-table:: + :header-rows: 1 + + * - Name + - # Nodes + - # Edges + - Columnar + - Compressed + + * - `2021-03-23`_ + - 20,667,308,808 + - 232,748,148,441 + - ✔ + - ✔ + + * - `2020-12-15`_ + - 19,330,739,526 + - 213,848,749,638 + - ✗ + - ✔ + + * - `2020-05-20`_ + - 17,075,708,289 + - 203,351,589,619 + - ✗ + - ✔ + + * - `2019-01-28`_ + - 11,683,687,950 + - 159,578,271,511 + - ✔ + - ✔ + + +**Teaser datasets**: + +.. list-table:: + :header-rows: 1 + + * - Name + - # Nodes + - # Edges + - Columnar + - Compressed + + * - `2020-12-15-gitlab-all`_ + - 1,083,011,764 + - 27,919,670,049 + - ✗ + - ✔ + + * - `2020-12-15-gitlab-100k`_ + - 304,037,235 + - 9,516,984,175 + - ✗ + - ✔ + + * - `2019-01-28-popular-4k`_ + - ? + - ? + - ✔ + - ✗ + + * - `2019-01-28-popular-3k-python`_ + - 27,363,226 + - 346,413,337 + - ✔ + - ✔ + +Full graph datasets +------------------- + + +2021-03-23 +~~~~~~~~~~ -All the main URLs are relative to our dataset prefix: -`https://annex.softwareheritage.org/public/dataset/ `__. +A full export of the graph dated from March 2021. -The Software Heritage Graph Dataset contains a table representation of the full -Software Heritage Graph. It is available in the following formats: +- **Columnar tables (Apache ORC)**: -- **PostgreSQL (compressed)**: + - **Total size**: 8.4 TiB + - **URL**: `/graph/2021-03-23/orc/ + `_ + - **S3**: ``s3://softwareheritage/graph/2021-03-23/orc`` - - **Total size**: 1.2 TiB - - **URL**: `/graph/latest/sql/ - `_ +- **Compressed graph**: + + - **URL**: `/graph/2021-03-23/compressed/ + `_ + + +2020-12-15 +~~~~~~~~~~ + +A full export of the graph dated from December 2020. Only available in +compressed representation. + +- **Compressed graph**: + + - **URL**: `/graph/2020-12-15/compressed/ + `_ + + +2020-05-20 +~~~~~~~~~~ + + +A full export of the graph dated from May 2020. Only available in +compressed representation. +**(DEPRECATED: known issue with missing snapshot edges.)** + +- **Compressed graph**: + + - **URL**: `/graph/2020-05-20/compressed/ + `_ + + +2019-01-28 +~~~~~~~~~~ + +A full export of the graph dated from January 2019. The export was done in two +phases, one of them called "2018-09-25" and the other "2019-01-28". They both +refer to the same dataset, but the different formats have various +inconsistencies between them. +**(DEPRECATED: early export pipeline, various inconsistencies).** -- **Apache Parquet**: +- **Columnar tables (Apache Parquet)**: - **Total size**: 1.2 TiB - - **URL**: `/graph/latest/parquet/ - `_ - - **S3**: ``s3://softwareheritage/graph`` + - **URL**: `/graph/2019-01-28/parquet/ + `_ + - **S3**: ``s3://softwareheritage/graph/2018-09-25/parquet`` + +- **Compressed graph**: + + - **URL**: `/graph/2019-01-28/compressed/ + `_ + Teaser datasets --------------- -If the above dataset is too big, we also provide the following "teaser" +If the above datasets are too big, we also provide "teaser" datasets that can get you started and have a smaller size fingerprint. -popular-4k -~~~~~~~~~~ +2020-12-15-gitlab-all +~~~~~~~~~~~~~~~~~~~~~ + +A teaser dataset containing the entirety of Gitlab, exported in December 2020. +Available in compressed graph format. + +- **Compressed graph**: + + - **URL**: `/graph/2020-12-15-gitlab-all/compressed/ + `_ + +2020-12-15-gitlab-100k +~~~~~~~~~~~~~~~~~~~~~~ + +A teaser dataset containing the 100k most popular Gitlab repositories, +exported in December 2020. Available in compressed graph format. + +- **Compressed graph**: -The ``popular-4k`` teaser contains a subset of 4000 popular -repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to -pick the software origins was the following: + - **URL**: `/graph/2020-12-15-gitlab-100k/compressed/ + `_ + + +2019-01-28-popular-4k +~~~~~~~~~~~~~~~~~~~~~ + +This teaser dataset contains a subset of 4000 popular repositories from GitHub, +Gitlab, PyPI and Debian. The selection criteria to pick the software origins +was the following: - The 1000 most popular GitHub projects (by number of stars) - The 1000 most popular Gitlab projects (by number of stars) @@ -43,23 +204,15 @@ - The 1000 most popular Debian packages (by "votes" according to the `Debian Popularity Contest `_ database) -This teaser is available in the following formats: - -- **PostgreSQL (compressed)**: - - - **Total size**: 23 GiB - - **URL**: `/graph/latest/popular-4k/sql/ - `_ - -- **Apache Parquet**: +- **Columnar (Apache Parquet)**: - **Total size**: 27 GiB - - **URL**: `/graph/latest/popular-4k/parquet/ - `_ - - **S3**: ``s3://softwareheritage/teasers/popular-4k`` + - **URL**: `/graph/2019-01-28-popular-4k/parquet/ + `_ + - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-4k/parquet/`` -popular-3k-python -~~~~~~~~~~~~~~~~~ +2019-01-28-popular-3k-python +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``popular-3k-python`` teaser contains a subset of 3052 popular repositories **tagged as being written in the Python language**, from GitHub, @@ -75,15 +228,9 @@ "votes" according to the `Debian Popularity Contest `_ database). -- **PostgreSQL (compressed)**: - - - **Total size**: 4.7 GiB - - **URL**: `/graph/latest/popular-3k-python/sql/ - `_ - -- **Apache Parquet**: +- **Columnar (Apache Parquet)**: - **Total size**: 5.3 GiB - - **URL**: `/graph/latest/popular-3k-python/parquet/ - `_ - - **S3**: ``s3://softwareheritage/teasers/popular-4k`` + - **URL**: `/graph/2019-01-28-popular-3k-python/parquet/ + `_ + - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-3k-python/parquet/`` diff --git a/docs/graph/index.rst b/docs/graph/index.rst --- a/docs/graph/index.rst +++ b/docs/graph/index.rst @@ -17,9 +17,9 @@ artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, -including downloadable CSV dumps and Apache Parquet files for local use, -as well as a public instance on Amazon Athena interactive query service -for ready-to-use powerful analytical processing. +including relational Apache ORC files for local use, as well as a public +instance on Amazon Athena interactive query service for ready-to-use powerful +analytical processing. By accessing the dataset, you agree with the Software Heritage `Ethical Charter for using the archive