Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9696620
D7487.id27156.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
8 KB
Subscribers
None
D7487.id27156.diff
View Options
diff --git a/docs/graph/dataset.rst b/docs/graph/dataset.rst
--- a/docs/graph/dataset.rst
+++ b/docs/graph/dataset.rst
@@ -1,40 +1,198 @@
Dataset
=======
-We provide the full graph dataset along with two "teaser" datasets that can be
-used for trying out smaller-scale experiments before using the full graph.
+We aim to provide regular exports of the Software Heritage graph in two
+different formats:
+
+- **Columnar data storage**: a set of relational tables stored in a columnar
+ format such as `Apache ORC <https://orc.apache.org/>`_, which is particularly
+ suited for scale-out analyses on data lakes and big data processing
+ ecosystems such as the Hadoop environment.
+
+- **Compressed graph**: a compact and highly-efficient representation of the
+ graph dataset, suited for scale-up analysis on high-end machines with large
+ amounts of memory. The graph is compressed in *Boldi-Vigna representation*,
+ designed to be loaded by the `WebGraph framework
+ <https://webgraph.di.unimi.it/>`_, specifically using our `swh-graph
+ library <https://docs.softwareheritage.org/devel/swh-graph/index.html>`_.
+
+
+Summary of dataset versions
+---------------------------
+
+**Full graph**:
+
+.. list-table::
+ :header-rows: 1
+
+ * - Name
+ - # Nodes
+ - # Edges
+ - Columnar
+ - Compressed
+
+ * - `2021-03-23`_
+ - 20667308808
+ - 232748148441
+ - ✔
+ - ✔
+
+ * - `2020-12-15`_
+ - 19330739526
+ - 213848749638
+ - ✗
+ - ✔
+
+ * - `2020-05-20`_
+ - 17075708289
+ - 203351589619
+ - ✗
+ - ✔
+
+ * - `2019-01-28`_
+ - 11683687950
+ - 159578271511
+ - ✔
+ - ✔
+
+
+**Teaser datasets**:
+
+.. list-table::
+ :header-rows: 1
+
+ * - Name
+ - # Nodes
+ - # Edges
+ - Columnar
+ - Compressed
+
+ * - `2020-12-15-gitlab-all`_
+ - 1083011764
+ - 27919670049
+ - ✗
+ - ✔
+
+ * - `2020-12-15-gitlab-100k`_
+ - 304037235
+ - 9516984175
+ - ✗
+ - ✔
+
+ * - `2019-01-28-popular-4k`_
+ - ?
+ - ?
+ - ✔
+ - ✗
+
+ * - `2019-01-28-popular-3k-python`_
+ - 27363226
+ - 346413337
+ - ✔
+ - ✔
+
+Full graph datasets
+-------------------
+
+
+2021-03-23
+~~~~~~~~~~
-All the main URLs are relative to our dataset prefix:
-`https://annex.softwareheritage.org/public/dataset/ <https://annex.softwareheritage.org/public/dataset/>`__.
+A full export of the graph dated from March 2021.
-The Software Heritage Graph Dataset contains a table representation of the full
-Software Heritage Graph. It is available in the following formats:
+- **Columnar tables (Apache ORC)**:
-- **PostgreSQL (compressed)**:
+ - **Total size**: 8.4 TiB
+ - **URL**: `/graph/2021-03-23/orc/
+ <https://annex.softwareheritage.org/public/dataset/graph/2021-03-23/orc/>`_
+ - **S3**: ``s3://softwareheritage/graph/2021-03-23/orc``
- - **Total size**: 1.2 TiB
- - **URL**: `/graph/latest/sql/
- <https://annex.softwareheritage.org/public/dataset/graph/latest/sql/>`_
+- **Compressed graph**:
+
+ - **URL**: `/graph/2021-03-23/compressed/
+ <https://annex.softwareheritage.org/public/dataset/graph/2021-03-23/compressed/>`_
+
+
+2020-12-15
+~~~~~~~~~~
+
+A full export of the graph dated from December 2020. Only available in
+compressed representation.
+
+- **Compressed graph**:
+
+ - **URL**: `/graph/2020-12-15/compressed/
+ <https://annex.softwareheritage.org/public/dataset/graph/2020-12-15/compressed/>`_
+
+
+2020-05-20
+~~~~~~~~~~
+
+
+A full export of the graph dated from May 2020. Only available in
+compressed representation.
+**(DEPRECATED: known issue with missing snapshot edges.)**
+
+- **Compressed graph**:
+
+ - **URL**: `/graph/2020-05-20/compressed/
+ <https://annex.softwareheritage.org/public/dataset/graph/2020-05-20/compressed/>`_
+
+
+2019-01-28
+~~~~~~~~~~
+
+A full export of the graph dated from January 2019.
+**(DEPRECATED: early export pipeline, various inconsistencies)**
-- **Apache Parquet**:
+- **Columnar tables (Apache Parquet)**:
- **Total size**: 1.2 TiB
- - **URL**: `/graph/latest/parquet/
- <https://annex.softwareheritage.org/public/dataset/graph/latest/parquet/>`_
- - **S3**: ``s3://softwareheritage/graph``
+ - **URL**: `/graph/2019-01-28/parquet/
+ <https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/parquet/>`_
+ - **S3**: ``s3://softwareheritage/graph/2019-01-28/parquet``
+
+- **Compressed graph**:
+
+ - **URL**: `/graph/2019-01-28/compressed/
+ <https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/compressed/>`_
+
Teaser datasets
---------------
-If the above dataset is too big, we also provide the following "teaser"
+If the above datasets are too big, we also provide "teaser"
datasets that can get you started and have a smaller size fingerprint.
-popular-4k
-~~~~~~~~~~
+2020-12-15-gitlab-all
+~~~~~~~~~~~~~~~~~~~~~
+
+A teaser dataset containing the entirety of Gitlab, exported in December 2020.
+Available in compressed graph format.
+
+- **Compressed graph**:
+
+ - **URL**: `/graph/2020-12-15-gitlab-all/compressed/
+ <https://annex.softwareheritage.org/public/dataset/graph/2020-12-15-gitlab-all/compressed/>`_
+
+2020-12-15-gitlab-100k
+~~~~~~~~~~~~~~~~~~~~~~
+
+A teaser dataset containing the 100k most popular Gitlab repositories,
+exported in December 2020. Available in compressed graph format.
+
+- **Compressed graph**:
-The ``popular-4k`` teaser contains a subset of 4000 popular
-repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to
-pick the software origins was the following:
+ - **URL**: `/graph/2020-12-15-gitlab-100k/compressed/
+ <https://annex.softwareheritage.org/public/dataset/graph/2020-12-15-gitlab-100k/compressed/>`_
+
+
+2019-01-28-popular-4k
+~~~~~~~~~~~~~~~~~~~~~
+
+This teaser dataset contains a subset of 4000 popular repositories from GitHub,
+Gitlab, PyPI and Debian. The selection criteria to pick the software origins
+was the following:
- The 1000 most popular GitHub projects (by number of stars)
- The 1000 most popular Gitlab projects (by number of stars)
@@ -43,23 +201,15 @@
- The 1000 most popular Debian packages (by "votes" according to the `Debian
Popularity Contest <https://popcon.debian.org/>`_ database)
-This teaser is available in the following formats:
-
-- **PostgreSQL (compressed)**:
-
- - **Total size**: 23 GiB
- - **URL**: `/graph/latest/popular-4k/sql/
- <https://annex.softwareheritage.org/public/dataset/graph/latest/popular-4k/sql/>`_
-
-- **Apache Parquet**:
+- **Columnar (Apache Parquet)**:
- **Total size**: 27 GiB
- - **URL**: `/graph/latest/popular-4k/parquet/
- <https://annex.softwareheritage.org/public/dataset/graph/latest/popular-4k/parquet/>`_
- - **S3**: ``s3://softwareheritage/teasers/popular-4k``
+ - **URL**: `/graph/2019-01-28-popular-4k/parquet/
+ <https://annex.softwareheritage.org/public/dataset/graph/2019-01-28-popular-4k/parquet/>`_
+ - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-4k/parquet/``
-popular-3k-python
-~~~~~~~~~~~~~~~~~
+2019-01-28-popular-3k-python
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``popular-3k-python`` teaser contains a subset of 3052 popular
repositories **tagged as being written in the Python language**, from GitHub,
@@ -75,15 +225,9 @@
"votes" according to the `Debian Popularity Contest
<https://popcon.debian.org/>`_ database).
-- **PostgreSQL (compressed)**:
-
- - **Total size**: 4.7 GiB
- - **URL**: `/graph/latest/popular-3k-python/sql/
- <https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/sql/>`_
-
-- **Apache Parquet**:
+- **Columnar (Apache Parquet)**:
- **Total size**: 5.3 GiB
- - **URL**: `/graph/latest/popular-3k-python/parquet/
- <https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/parquet/>`_
- - **S3**: ``s3://softwareheritage/teasers/popular-4k``
+ - **URL**: `/graph/2019-01-28-popular-3k-python/parquet/
+ <https://annex.softwareheritage.org/public/dataset/graph/2019-01-28-popular-3k-python/parquet/>`_
+ - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-3k-python/parquet/``
diff --git a/docs/graph/index.rst b/docs/graph/index.rst
--- a/docs/graph/index.rst
+++ b/docs/graph/index.rst
@@ -17,9 +17,9 @@
artifacts have been observed in the wild.
The Software Heritage graph dataset is available in multiple formats,
-including downloadable CSV dumps and Apache Parquet files for local use,
-as well as a public instance on Amazon Athena interactive query service
-for ready-to-use powerful analytical processing.
+including relational Apache ORC files for local use, as well as a public
+instance on Amazon Athena interactive query service for ready-to-use powerful
+analytical processing.
By accessing the dataset, you agree with the Software Heritage `Ethical
Charter for using the archive
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Sun, Aug 17, 8:48 PM (1 d, 2 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3223193
Attached To
D7487: Docs: update dataset list with recent datasets
Event Timeline
Log In to Comment