diff --git a/docs/graph/dataset.rst b/docs/graph/dataset.rst
index e19e578..19ab422 100644
--- a/docs/graph/dataset.rst
+++ b/docs/graph/dataset.rst
@@ -1,236 +1,238 @@
+.. _swh-dataset-list:
+
Dataset
=======
We aim to provide regular exports of the Software Heritage graph in two
different formats:
- **Columnar data storage**: a set of relational tables stored in a columnar
format such as `Apache ORC `_, which is particularly
suited for scale-out analyses on data lakes and big data processing
ecosystems such as the Hadoop environment.
- **Compressed graph**: a compact and highly-efficient representation of the
graph dataset, suited for scale-up analysis on high-end machines with large
amounts of memory. The graph is compressed in *Boldi-Vigna representation*,
designed to be loaded by the `WebGraph framework
`_, specifically using our `swh-graph
library `_.
Summary of dataset versions
---------------------------
**Full graph**:
.. list-table::
:header-rows: 1
* - Name
- # Nodes
- # Edges
- Columnar
- Compressed
* - `2021-03-23`_
- 20,667,308,808
- 232,748,148,441
- ✔
- ✔
* - `2020-12-15`_
- 19,330,739,526
- 213,848,749,638
- ✗
- ✔
* - `2020-05-20`_
- 17,075,708,289
- 203,351,589,619
- ✗
- ✔
* - `2019-01-28`_
- 11,683,687,950
- 159,578,271,511
- ✔
- ✔
**Teaser datasets**:
.. list-table::
:header-rows: 1
* - Name
- # Nodes
- # Edges
- Columnar
- Compressed
* - `2020-12-15-gitlab-all`_
- 1,083,011,764
- 27,919,670,049
- ✗
- ✔
* - `2020-12-15-gitlab-100k`_
- 304,037,235
- 9,516,984,175
- ✗
- ✔
* - `2019-01-28-popular-4k`_
- ?
- ?
- ✔
- ✗
* - `2019-01-28-popular-3k-python`_
- 27,363,226
- 346,413,337
- ✔
- ✔
Full graph datasets
-------------------
2021-03-23
~~~~~~~~~~
A full export of the graph dated from March 2021.
- **Columnar tables (Apache ORC)**:
- **Total size**: 8.4 TiB
- **URL**: `/graph/2021-03-23/orc/
`_
- **S3**: ``s3://softwareheritage/graph/2021-03-23/orc``
- **Compressed graph**:
- **URL**: `/graph/2021-03-23/compressed/
`_
2020-12-15
~~~~~~~~~~
A full export of the graph dated from December 2020. Only available in
compressed representation.
- **Compressed graph**:
- **URL**: `/graph/2020-12-15/compressed/
`_
2020-05-20
~~~~~~~~~~
A full export of the graph dated from May 2020. Only available in
compressed representation.
**(DEPRECATED: known issue with missing snapshot edges.)**
- **Compressed graph**:
- **URL**: `/graph/2020-05-20/compressed/
`_
2019-01-28
~~~~~~~~~~
A full export of the graph dated from January 2019. The export was done in two
phases, one of them called "2018-09-25" and the other "2019-01-28". They both
refer to the same dataset, but the different formats have various
inconsistencies between them.
**(DEPRECATED: early export pipeline, various inconsistencies).**
- **Columnar tables (Apache Parquet)**:
- **Total size**: 1.2 TiB
- **URL**: `/graph/2019-01-28/parquet/
`_
- **S3**: ``s3://softwareheritage/graph/2018-09-25/parquet``
- **Compressed graph**:
- **URL**: `/graph/2019-01-28/compressed/
`_
Teaser datasets
---------------
If the above datasets are too big, we also provide "teaser"
datasets that can get you started and have a smaller size fingerprint.
2020-12-15-gitlab-all
~~~~~~~~~~~~~~~~~~~~~
A teaser dataset containing the entirety of Gitlab, exported in December 2020.
Available in compressed graph format.
- **Compressed graph**:
- **URL**: `/graph/2020-12-15-gitlab-all/compressed/
`_
2020-12-15-gitlab-100k
~~~~~~~~~~~~~~~~~~~~~~
A teaser dataset containing the 100k most popular Gitlab repositories,
exported in December 2020. Available in compressed graph format.
- **Compressed graph**:
- **URL**: `/graph/2020-12-15-gitlab-100k/compressed/
`_
2019-01-28-popular-4k
~~~~~~~~~~~~~~~~~~~~~
This teaser dataset contains a subset of 4000 popular repositories from GitHub,
Gitlab, PyPI and Debian. The selection criteria to pick the software origins
was the following:
- The 1000 most popular GitHub projects (by number of stars)
- The 1000 most popular Gitlab projects (by number of stars)
- The 1000 most popular PyPI projects (by usage statistics, according to the
`Top PyPI Packages `_ database),
- The 1000 most popular Debian packages (by "votes" according to the `Debian
Popularity Contest `_ database)
- **Columnar (Apache Parquet)**:
- **Total size**: 27 GiB
- **URL**: `/graph/2019-01-28-popular-4k/parquet/
`_
- **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-4k/parquet/``
2019-01-28-popular-3k-python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``popular-3k-python`` teaser contains a subset of 3052 popular
repositories **tagged as being written in the Python language**, from GitHub,
Gitlab, PyPI and Debian. The selection criteria to pick the software origins
was the following, similar to ``popular-4k``:
- the 1000 most popular GitHub projects written in Python (by number of stars),
- the 131 Gitlab projects written in Python that have 2 stars or more,
- the 1000 most popular PyPI projects (by usage statistics, according to the
`Top PyPI Packages `_ database),
- the 1000 most popular Debian packages with the
`debtag `_ ``implemented-in::python`` (by
"votes" according to the `Debian Popularity Contest
`_ database).
- **Columnar (Apache Parquet)**:
- **Total size**: 5.3 GiB
- **URL**: `/graph/2019-01-28-popular-3k-python/parquet/
`_
- **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-3k-python/parquet/``
diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst
index 8905106..016b4fe 100644
--- a/docs/graph/schema.rst
+++ b/docs/graph/schema.rst
@@ -1,136 +1,139 @@
+.. _swh-dataset-schema:
+
Relational schema
=================
The Merkle DAG of the Software Heritage archive is encoded in the dataset as a
set of relational tables.
This page documents the relational schema of the **latest version** of the
graph dataset.
-A simplified view of the corresponding database schema is shown here:
+..
+ A simplified view of the corresponding database schema is shown here:
-.. image:: _images/dataset-schema.svg
+ .. image:: _images/dataset-schema.svg
**Note**: To limit abuse, some columns containing personal information are
pseudonimized in the dataset using a hash algorithm. Individual authors may be
retrieved by querying the Software Heritage API.
- **content**: contains information on the contents stored in
the archive.
- ``sha1`` (string): the SHA-1 of the content (hexadecimal)
- ``sha1_git`` (string): the Git SHA-1 of the content (hexadecimal)
- ``sha256`` (string): the SHA-256 of the content (hexadecimal)
- ``blake2s256`` (bytes): the BLAKE2s-256 of the content (hexadecimal)
- ``length`` (integer): the length of the content
- ``status`` (string): the visibility status of the content
- **skipped_content**: contains information on the contents that were not
archived for various reasons.
- ``sha1`` (string): the SHA-1 of the skipped content (hexadecimal)
- ``sha1_git`` (string): the Git SHA-1 of the skipped content (hexadecimal)
- ``sha256`` (string): the SHA-256 of the skipped content (hexadecimal)
- ``blake2s256`` (bytes): the BLAKE2s-256 of the skipped content
(hexadecimal)
- ``length`` (integer): the length of the skipped content
- ``status`` (string): the visibility status of the skipped content
- ``reason`` (string): the reason why the content was skipped
- **directory**: contains the directories stored in the archive.
- ``id`` (string): the intrinsic hash of the directory (hexadecimal),
recursively computed with the Git SHA-1 algorithm
- **directory_entry**: contains the entries in directories.
- ``directory_id`` (string): the Git SHA-1 of the directory
containing the entry (hexadecimal).
- ``name`` (bytes): the name of the file (basename of its path)
- ``type`` (string): the type of object the branch points to (either
``revision``, ``directory`` or ``content``).
- ``target`` (string): the Git SHA-1 of the object this
entry points to (hexadecimal).
- ``perms`` (integer): the permissions of the object
- **revision**: contains the revisions stored in the archive.
- ``id`` (string): the intrinsic hash of the revision (hexadecimal),
recursively computed with the Git SHA-1 algorithm. For Git repositories,
this corresponds to the commit hash.
- ``message`` (bytes): the revision message
- ``author`` (string): an anonymized hash of the author of the revision.
- ``date`` (timestamp): the date the revision was authored
- ``date_offset`` (integer): the offset of the timezone of ``date``
- ``committer`` (string): an anonymized hash of the committer of the revision.
- ``committer_date`` (timestamp): the date the revision was committed
- ``committer_date_offset`` (integer): the offset of the timezone of
``committer_date``
- ``directory`` (string): the Git SHA-1 of the directory the revision points
to (hexadecimal). Every revision points to the root directory of the
project source tree to which it corresponds.
- **revision_history**: contains the ordered set of parents of each revision.
Each revision has an ordered set of parents (0 for the initial commit of a
repository, 1 for a regular commit, 2 for a regular merge commit and 3 or
more for octopus-style merge commits).
- ``id`` (string): the Git SHA-1 identifier of the revision (hexadecimal)
- ``parent_id`` (string): the Git SHA-1 identifier of the parent (hexadecimal)
- ``parent_rank`` (integer): the rank of the parent, which defines the
ordering between the parents of the revision
- **release**: contains the releases stored in the archive.
- ``id`` (string): the intrinsic hash of the release (hexadecimal),
recursively computed with the Git SHA-1 algorithm
- ``target`` (string): the Git SHA-1 of the object the release points to
(hexadecimal)
- ``date`` (timestamp): the date the release was created
- ``author`` (integer): the author of the revision
- ``name`` (bytes): the release name
- ``message`` (bytes): the release message
- **snapshot**: contains the list of snapshots stored in the archive.
- ``id`` (string): the intrinsic hash of the snapshot (hexadecimal),
recursively computed with the Git SHA-1 algorithm.
- **snapshot_branch**: contains the list of branches associated with
each snapshot.
- ``snapshot_id`` (string): the intrinsic hash of the snapshot (hexadecimal)
- ``name`` (bytes): the name of the branch
- ``target`` (string): the intrinsic hash of the object the branch points to
(hexadecimal)
- ``target_type`` (string): the type of object the branch points to (either
``release``, ``revision``, ``directory`` or ``content``).
- **origin**: the software origins from which the projects in the dataset were
archived.
- ``url`` (bytes): the URL of the origin
- **origin_visit**: the different visits of each origin. Since Software
Heritage archives software continuously, software origins are crawled more
than once. Each of these "visits" is an entry in this table.
- ``origin``: (string) the URL of the origin visited
- ``visit``: (integer) an integer identifier of the visit
- ``date``: (timestamp) the date at which the origin was visited
- ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``,
``svn``, ``git``, ``ftp``, ``deb``, ...)
- **origin_visit_status**: the status of each visit.
- ``origin``: (string) the URL of the origin visited
- ``visit``: (integer) an integer identifier of the visit
- ``date``: (timestamp) the date at which the origin was visited
- ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``,
``svn``, ``git``, ``ftp``, ``deb``, ...)
- ``snapshot_id`` (string): the intrinsic hash of the snapshot archived in
this visit (hexadecimal).
- ``status`` (string): the integer identifier of the snapshot archived in
this visit, either ``partial`` for partial visits or ``full`` for full
visits.