diff --git a/docs/graph/dataset.rst b/docs/graph/dataset.rst index e19e578..19ab422 100644 --- a/docs/graph/dataset.rst +++ b/docs/graph/dataset.rst @@ -1,236 +1,238 @@ +.. _swh-dataset-list: + Dataset ======= We aim to provide regular exports of the Software Heritage graph in two different formats: - **Columnar data storage**: a set of relational tables stored in a columnar format such as `Apache ORC `_, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment. - **Compressed graph**: a compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in *Boldi-Vigna representation*, designed to be loaded by the `WebGraph framework `_, specifically using our `swh-graph library `_. Summary of dataset versions --------------------------- **Full graph**: .. list-table:: :header-rows: 1 * - Name - # Nodes - # Edges - Columnar - Compressed * - `2021-03-23`_ - 20,667,308,808 - 232,748,148,441 - ✔ - ✔ * - `2020-12-15`_ - 19,330,739,526 - 213,848,749,638 - ✗ - ✔ * - `2020-05-20`_ - 17,075,708,289 - 203,351,589,619 - ✗ - ✔ * - `2019-01-28`_ - 11,683,687,950 - 159,578,271,511 - ✔ - ✔ **Teaser datasets**: .. list-table:: :header-rows: 1 * - Name - # Nodes - # Edges - Columnar - Compressed * - `2020-12-15-gitlab-all`_ - 1,083,011,764 - 27,919,670,049 - ✗ - ✔ * - `2020-12-15-gitlab-100k`_ - 304,037,235 - 9,516,984,175 - ✗ - ✔ * - `2019-01-28-popular-4k`_ - ? - ? - ✔ - ✗ * - `2019-01-28-popular-3k-python`_ - 27,363,226 - 346,413,337 - ✔ - ✔ Full graph datasets ------------------- 2021-03-23 ~~~~~~~~~~ A full export of the graph dated from March 2021. - **Columnar tables (Apache ORC)**: - **Total size**: 8.4 TiB - **URL**: `/graph/2021-03-23/orc/ `_ - **S3**: ``s3://softwareheritage/graph/2021-03-23/orc`` - **Compressed graph**: - **URL**: `/graph/2021-03-23/compressed/ `_ 2020-12-15 ~~~~~~~~~~ A full export of the graph dated from December 2020. Only available in compressed representation. - **Compressed graph**: - **URL**: `/graph/2020-12-15/compressed/ `_ 2020-05-20 ~~~~~~~~~~ A full export of the graph dated from May 2020. Only available in compressed representation. **(DEPRECATED: known issue with missing snapshot edges.)** - **Compressed graph**: - **URL**: `/graph/2020-05-20/compressed/ `_ 2019-01-28 ~~~~~~~~~~ A full export of the graph dated from January 2019. The export was done in two phases, one of them called "2018-09-25" and the other "2019-01-28". They both refer to the same dataset, but the different formats have various inconsistencies between them. **(DEPRECATED: early export pipeline, various inconsistencies).** - **Columnar tables (Apache Parquet)**: - **Total size**: 1.2 TiB - **URL**: `/graph/2019-01-28/parquet/ `_ - **S3**: ``s3://softwareheritage/graph/2018-09-25/parquet`` - **Compressed graph**: - **URL**: `/graph/2019-01-28/compressed/ `_ Teaser datasets --------------- If the above datasets are too big, we also provide "teaser" datasets that can get you started and have a smaller size fingerprint. 2020-12-15-gitlab-all ~~~~~~~~~~~~~~~~~~~~~ A teaser dataset containing the entirety of Gitlab, exported in December 2020. Available in compressed graph format. - **Compressed graph**: - **URL**: `/graph/2020-12-15-gitlab-all/compressed/ `_ 2020-12-15-gitlab-100k ~~~~~~~~~~~~~~~~~~~~~~ A teaser dataset containing the 100k most popular Gitlab repositories, exported in December 2020. Available in compressed graph format. - **Compressed graph**: - **URL**: `/graph/2020-12-15-gitlab-100k/compressed/ `_ 2019-01-28-popular-4k ~~~~~~~~~~~~~~~~~~~~~ This teaser dataset contains a subset of 4000 popular repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following: - The 1000 most popular GitHub projects (by number of stars) - The 1000 most popular Gitlab projects (by number of stars) - The 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - The 1000 most popular Debian packages (by "votes" according to the `Debian Popularity Contest `_ database) - **Columnar (Apache Parquet)**: - **Total size**: 27 GiB - **URL**: `/graph/2019-01-28-popular-4k/parquet/ `_ - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-4k/parquet/`` 2019-01-28-popular-3k-python ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``popular-3k-python`` teaser contains a subset of 3052 popular repositories **tagged as being written in the Python language**, from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following, similar to ``popular-4k``: - the 1000 most popular GitHub projects written in Python (by number of stars), - the 131 Gitlab projects written in Python that have 2 stars or more, - the 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - the 1000 most popular Debian packages with the `debtag `_ ``implemented-in::python`` (by "votes" according to the `Debian Popularity Contest `_ database). - **Columnar (Apache Parquet)**: - **Total size**: 5.3 GiB - **URL**: `/graph/2019-01-28-popular-3k-python/parquet/ `_ - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-3k-python/parquet/`` diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst index 8905106..016b4fe 100644 --- a/docs/graph/schema.rst +++ b/docs/graph/schema.rst @@ -1,136 +1,139 @@ +.. _swh-dataset-schema: + Relational schema ================= The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables. This page documents the relational schema of the **latest version** of the graph dataset. -A simplified view of the corresponding database schema is shown here: +.. + A simplified view of the corresponding database schema is shown here: -.. image:: _images/dataset-schema.svg + .. image:: _images/dataset-schema.svg **Note**: To limit abuse, some columns containing personal information are pseudonimized in the dataset using a hash algorithm. Individual authors may be retrieved by querying the Software Heritage API. - **content**: contains information on the contents stored in the archive. - ``sha1`` (string): the SHA-1 of the content (hexadecimal) - ``sha1_git`` (string): the Git SHA-1 of the content (hexadecimal) - ``sha256`` (string): the SHA-256 of the content (hexadecimal) - ``blake2s256`` (bytes): the BLAKE2s-256 of the content (hexadecimal) - ``length`` (integer): the length of the content - ``status`` (string): the visibility status of the content - **skipped_content**: contains information on the contents that were not archived for various reasons. - ``sha1`` (string): the SHA-1 of the skipped content (hexadecimal) - ``sha1_git`` (string): the Git SHA-1 of the skipped content (hexadecimal) - ``sha256`` (string): the SHA-256 of the skipped content (hexadecimal) - ``blake2s256`` (bytes): the BLAKE2s-256 of the skipped content (hexadecimal) - ``length`` (integer): the length of the skipped content - ``status`` (string): the visibility status of the skipped content - ``reason`` (string): the reason why the content was skipped - **directory**: contains the directories stored in the archive. - ``id`` (string): the intrinsic hash of the directory (hexadecimal), recursively computed with the Git SHA-1 algorithm - **directory_entry**: contains the entries in directories. - ``directory_id`` (string): the Git SHA-1 of the directory containing the entry (hexadecimal). - ``name`` (bytes): the name of the file (basename of its path) - ``type`` (string): the type of object the branch points to (either ``revision``, ``directory`` or ``content``). - ``target`` (string): the Git SHA-1 of the object this entry points to (hexadecimal). - ``perms`` (integer): the permissions of the object - **revision**: contains the revisions stored in the archive. - ``id`` (string): the intrinsic hash of the revision (hexadecimal), recursively computed with the Git SHA-1 algorithm. For Git repositories, this corresponds to the commit hash. - ``message`` (bytes): the revision message - ``author`` (string): an anonymized hash of the author of the revision. - ``date`` (timestamp): the date the revision was authored - ``date_offset`` (integer): the offset of the timezone of ``date`` - ``committer`` (string): an anonymized hash of the committer of the revision. - ``committer_date`` (timestamp): the date the revision was committed - ``committer_date_offset`` (integer): the offset of the timezone of ``committer_date`` - ``directory`` (string): the Git SHA-1 of the directory the revision points to (hexadecimal). Every revision points to the root directory of the project source tree to which it corresponds. - **revision_history**: contains the ordered set of parents of each revision. Each revision has an ordered set of parents (0 for the initial commit of a repository, 1 for a regular commit, 2 for a regular merge commit and 3 or more for octopus-style merge commits). - ``id`` (string): the Git SHA-1 identifier of the revision (hexadecimal) - ``parent_id`` (string): the Git SHA-1 identifier of the parent (hexadecimal) - ``parent_rank`` (integer): the rank of the parent, which defines the ordering between the parents of the revision - **release**: contains the releases stored in the archive. - ``id`` (string): the intrinsic hash of the release (hexadecimal), recursively computed with the Git SHA-1 algorithm - ``target`` (string): the Git SHA-1 of the object the release points to (hexadecimal) - ``date`` (timestamp): the date the release was created - ``author`` (integer): the author of the revision - ``name`` (bytes): the release name - ``message`` (bytes): the release message - **snapshot**: contains the list of snapshots stored in the archive. - ``id`` (string): the intrinsic hash of the snapshot (hexadecimal), recursively computed with the Git SHA-1 algorithm. - **snapshot_branch**: contains the list of branches associated with each snapshot. - ``snapshot_id`` (string): the intrinsic hash of the snapshot (hexadecimal) - ``name`` (bytes): the name of the branch - ``target`` (string): the intrinsic hash of the object the branch points to (hexadecimal) - ``target_type`` (string): the type of object the branch points to (either ``release``, ``revision``, ``directory`` or ``content``). - **origin**: the software origins from which the projects in the dataset were archived. - ``url`` (bytes): the URL of the origin - **origin_visit**: the different visits of each origin. Since Software Heritage archives software continuously, software origins are crawled more than once. Each of these "visits" is an entry in this table. - ``origin``: (string) the URL of the origin visited - ``visit``: (integer) an integer identifier of the visit - ``date``: (timestamp) the date at which the origin was visited - ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``, ``svn``, ``git``, ``ftp``, ``deb``, ...) - **origin_visit_status**: the status of each visit. - ``origin``: (string) the URL of the origin visited - ``visit``: (integer) an integer identifier of the visit - ``date``: (timestamp) the date at which the origin was visited - ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``, ``svn``, ``git``, ``ftp``, ``deb``, ...) - ``snapshot_id`` (string): the intrinsic hash of the snapshot archived in this visit (hexadecimal). - ``status`` (string): the integer identifier of the snapshot archived in this visit, either ``partial`` for partial visits or ``full`` for full visits.