diff --git a/docs/graph/dataset.rst b/docs/graph/dataset.rst index 855b7a7..e19e578 100644 --- a/docs/graph/dataset.rst +++ b/docs/graph/dataset.rst @@ -1,89 +1,236 @@ Dataset ======= -We provide the full graph dataset along with two "teaser" datasets that can be -used for trying out smaller-scale experiments before using the full graph. +We aim to provide regular exports of the Software Heritage graph in two +different formats: + +- **Columnar data storage**: a set of relational tables stored in a columnar + format such as `Apache ORC `_, which is particularly + suited for scale-out analyses on data lakes and big data processing + ecosystems such as the Hadoop environment. + +- **Compressed graph**: a compact and highly-efficient representation of the + graph dataset, suited for scale-up analysis on high-end machines with large + amounts of memory. The graph is compressed in *Boldi-Vigna representation*, + designed to be loaded by the `WebGraph framework + `_, specifically using our `swh-graph + library `_. + + +Summary of dataset versions +--------------------------- + +**Full graph**: + +.. list-table:: + :header-rows: 1 + + * - Name + - # Nodes + - # Edges + - Columnar + - Compressed + + * - `2021-03-23`_ + - 20,667,308,808 + - 232,748,148,441 + - ✔ + - ✔ + + * - `2020-12-15`_ + - 19,330,739,526 + - 213,848,749,638 + - ✗ + - ✔ + + * - `2020-05-20`_ + - 17,075,708,289 + - 203,351,589,619 + - ✗ + - ✔ + + * - `2019-01-28`_ + - 11,683,687,950 + - 159,578,271,511 + - ✔ + - ✔ + + +**Teaser datasets**: + +.. list-table:: + :header-rows: 1 + + * - Name + - # Nodes + - # Edges + - Columnar + - Compressed + + * - `2020-12-15-gitlab-all`_ + - 1,083,011,764 + - 27,919,670,049 + - ✗ + - ✔ + + * - `2020-12-15-gitlab-100k`_ + - 304,037,235 + - 9,516,984,175 + - ✗ + - ✔ + + * - `2019-01-28-popular-4k`_ + - ? + - ? + - ✔ + - ✗ + + * - `2019-01-28-popular-3k-python`_ + - 27,363,226 + - 346,413,337 + - ✔ + - ✔ + +Full graph datasets +------------------- + + +2021-03-23 +~~~~~~~~~~ -All the main URLs are relative to our dataset prefix: -`https://annex.softwareheritage.org/public/dataset/ `__. +A full export of the graph dated from March 2021. -The Software Heritage Graph Dataset contains a table representation of the full -Software Heritage Graph. It is available in the following formats: +- **Columnar tables (Apache ORC)**: -- **PostgreSQL (compressed)**: + - **Total size**: 8.4 TiB + - **URL**: `/graph/2021-03-23/orc/ + `_ + - **S3**: ``s3://softwareheritage/graph/2021-03-23/orc`` - - **Total size**: 1.2 TiB - - **URL**: `/graph/latest/sql/ - `_ +- **Compressed graph**: + + - **URL**: `/graph/2021-03-23/compressed/ + `_ + + +2020-12-15 +~~~~~~~~~~ + +A full export of the graph dated from December 2020. Only available in +compressed representation. + +- **Compressed graph**: + + - **URL**: `/graph/2020-12-15/compressed/ + `_ + + +2020-05-20 +~~~~~~~~~~ + + +A full export of the graph dated from May 2020. Only available in +compressed representation. +**(DEPRECATED: known issue with missing snapshot edges.)** + +- **Compressed graph**: + + - **URL**: `/graph/2020-05-20/compressed/ + `_ + + +2019-01-28 +~~~~~~~~~~ + +A full export of the graph dated from January 2019. The export was done in two +phases, one of them called "2018-09-25" and the other "2019-01-28". They both +refer to the same dataset, but the different formats have various +inconsistencies between them. +**(DEPRECATED: early export pipeline, various inconsistencies).** -- **Apache Parquet**: +- **Columnar tables (Apache Parquet)**: - **Total size**: 1.2 TiB - - **URL**: `/graph/latest/parquet/ - `_ - - **S3**: ``s3://softwareheritage/graph`` + - **URL**: `/graph/2019-01-28/parquet/ + `_ + - **S3**: ``s3://softwareheritage/graph/2018-09-25/parquet`` + +- **Compressed graph**: + + - **URL**: `/graph/2019-01-28/compressed/ + `_ + Teaser datasets --------------- -If the above dataset is too big, we also provide the following "teaser" +If the above datasets are too big, we also provide "teaser" datasets that can get you started and have a smaller size fingerprint. -popular-4k -~~~~~~~~~~ +2020-12-15-gitlab-all +~~~~~~~~~~~~~~~~~~~~~ + +A teaser dataset containing the entirety of Gitlab, exported in December 2020. +Available in compressed graph format. + +- **Compressed graph**: + + - **URL**: `/graph/2020-12-15-gitlab-all/compressed/ + `_ + +2020-12-15-gitlab-100k +~~~~~~~~~~~~~~~~~~~~~~ + +A teaser dataset containing the 100k most popular Gitlab repositories, +exported in December 2020. Available in compressed graph format. + +- **Compressed graph**: -The ``popular-4k`` teaser contains a subset of 4000 popular -repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to -pick the software origins was the following: + - **URL**: `/graph/2020-12-15-gitlab-100k/compressed/ + `_ + + +2019-01-28-popular-4k +~~~~~~~~~~~~~~~~~~~~~ + +This teaser dataset contains a subset of 4000 popular repositories from GitHub, +Gitlab, PyPI and Debian. The selection criteria to pick the software origins +was the following: - The 1000 most popular GitHub projects (by number of stars) - The 1000 most popular Gitlab projects (by number of stars) - The 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - The 1000 most popular Debian packages (by "votes" according to the `Debian Popularity Contest `_ database) -This teaser is available in the following formats: - -- **PostgreSQL (compressed)**: - - - **Total size**: 23 GiB - - **URL**: `/graph/latest/popular-4k/sql/ - `_ - -- **Apache Parquet**: +- **Columnar (Apache Parquet)**: - **Total size**: 27 GiB - - **URL**: `/graph/latest/popular-4k/parquet/ - `_ - - **S3**: ``s3://softwareheritage/teasers/popular-4k`` + - **URL**: `/graph/2019-01-28-popular-4k/parquet/ + `_ + - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-4k/parquet/`` -popular-3k-python -~~~~~~~~~~~~~~~~~ +2019-01-28-popular-3k-python +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``popular-3k-python`` teaser contains a subset of 3052 popular repositories **tagged as being written in the Python language**, from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following, similar to ``popular-4k``: - the 1000 most popular GitHub projects written in Python (by number of stars), - the 131 Gitlab projects written in Python that have 2 stars or more, - the 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - the 1000 most popular Debian packages with the `debtag `_ ``implemented-in::python`` (by "votes" according to the `Debian Popularity Contest `_ database). -- **PostgreSQL (compressed)**: - - - **Total size**: 4.7 GiB - - **URL**: `/graph/latest/popular-3k-python/sql/ - `_ - -- **Apache Parquet**: +- **Columnar (Apache Parquet)**: - **Total size**: 5.3 GiB - - **URL**: `/graph/latest/popular-3k-python/parquet/ - `_ - - **S3**: ``s3://softwareheritage/teasers/popular-4k`` + - **URL**: `/graph/2019-01-28-popular-3k-python/parquet/ + `_ + - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-3k-python/parquet/`` diff --git a/docs/graph/index.rst b/docs/graph/index.rst index 58b96ef..bba72f9 100644 --- a/docs/graph/index.rst +++ b/docs/graph/index.rst @@ -1,56 +1,56 @@ .. _swh-graph-dataset: Software Heritage Graph Dataset =============================== This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including `GitHub `__ and `GitLab `__), FOSS distributions (e.g., `Debian `__), and language-specific package managers (e.g., `PyPI `__). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, -including downloadable CSV dumps and Apache Parquet files for local use, -as well as a public instance on Amazon Athena interactive query service -for ready-to-use powerful analytical processing. +including relational Apache ORC files for local use, as well as a public +instance on Amazon Athena interactive query service for ready-to-use powerful +analytical processing. By accessing the dataset, you agree with the Software Heritage `Ethical Charter for using the archive data `__, and the `terms of use for bulk access `__. If you use this dataset for research purposes, please cite the following paper: * | Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. | *The Software Heritage Graph Dataset: Public software development under one roof.* | In proceedings of `MSR 2019 `_: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with `ICSE 2019 `_. | `preprint `_, `bibtex `_ .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: dataset schema postgresql athena databricks Indices and tables ------------------ * :ref:`genindex` * :ref:`modindex` * :ref:`search`