diff --git a/docs/graph/dataset.rst b/docs/graph/dataset.rst index d27822d..855b7a7 100644 --- a/docs/graph/dataset.rst +++ b/docs/graph/dataset.rst @@ -1,89 +1,89 @@ Dataset ======= We provide the full graph dataset along with two "teaser" datasets that can be used for trying out smaller-scale experiments before using the full graph. All the main URLs are relative to our dataset prefix: `https://annex.softwareheritage.org/public/dataset/ `__. The Software Heritage Graph Dataset contains a table representation of the full Software Heritage Graph. It is available in the following formats: - **PostgreSQL (compressed)**: - **Total size**: 1.2 TiB - **URL**: `/graph/latest/sql/ `_ - **Apache Parquet**: - **Total size**: 1.2 TiB - **URL**: `/graph/latest/parquet/ `_ - **S3**: ``s3://softwareheritage/graph`` Teaser datasets --------------- If the above dataset is too big, we also provide the following "teaser" datasets that can get you started and have a smaller size fingerprint. popular-4k ~~~~~~~~~~ The ``popular-4k`` teaser contains a subset of 4000 popular repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following: - The 1000 most popular GitHub projects (by number of stars) - The 1000 most popular Gitlab projects (by number of stars) - The 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - The 1000 most popular Debian packages (by "votes" according to the `Debian Popularity Contest `_ database) This teaser is available in the following formats: - **PostgreSQL (compressed)**: - **Total size**: 23 GiB - **URL**: `/graph/latest/popular-4k/sql/ `_ - **Apache Parquet**: - **Total size**: 27 GiB - **URL**: `/graph/latest/popular-4k/parquet/ `_ - **S3**: ``s3://softwareheritage/teasers/popular-4k`` popular-3k-python ~~~~~~~~~~~~~~~~~ The ``popular-3k-python`` teaser contains a subset of 3052 popular repositories **tagged as being written in the Python language**, from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following, similar to ``popular-4k``: - the 1000 most popular GitHub projects written in Python (by number of stars), - the 131 Gitlab projects written in Python that have 2 stars or more, - the 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - the 1000 most popular Debian packages with the `debtag `_ ``implemented-in::python`` (by "votes" according to the `Debian Popularity Contest `_ database). - **PostgreSQL (compressed)**: - **Total size**: 4.7 GiB - **URL**: `/graph/latest/popular-3k-python/sql/ `_ - **Apache Parquet**: - **Total size**: 5.3 GiB - - **URL**: `/graph/latest/popular-3k-python/sql/ + - **URL**: `/graph/latest/popular-3k-python/parquet/ `_ - **S3**: ``s3://softwareheritage/teasers/popular-4k`` diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst index 13409b7..536d87f 100644 --- a/docs/graph/schema.rst +++ b/docs/graph/schema.rst @@ -1,142 +1,142 @@ Relational schema ================= The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables. A simplified view of the corresponding database schema is shown here: .. image:: _images/db-schema.svg This page documents the details of the schema. - **content**: contains information on the contents stored in the archive. - ``sha1`` (bytes): the SHA-1 of the content - ``sha1_git`` (bytes): the Git SHA-1 of the content - ``length`` (integer): the length of the content - **skipped_content**: contains information on the contents that were not archived for - various reasons. + various reasons. - ``sha1`` (bytes): the SHA-1 of the missing content - ``sha1_git`` (bytes): the Git SHA-1 of the missing content - ``length`` (integer): the length of the missing content - **directory**: contains the directories stored in the archive. - ``id`` (bytes): the intrinsic identifier of the directory, recursively computed with the Git SHA-1 algorithm - ``dir_entries`` (array of integers): the list of directories contained in this directory, as references to an entry in the ``directory_entry_dir`` table. - ``file_entries`` (array of integers): the list of files contained in this directory, as references to an entry in the ``directory_entry_file`` table. - ``rev_entries`` (array of integers): the list of revisions contained in this directory, as references to an entry in the ``directory_entry_rev`` table. - **directory_entry_file**: contains information about file entries in directories. - ``id`` (integer): unique identifier for the entry - ``target`` (bytes): the Git SHA-1 of the content this entry points to - ``name`` (bytes): the name of the file (basename of its path) - ``perms`` (integer): the permissions of the file - **directory_entry_dir**: contains information about directory entries in directories. - ``id`` (integer): unique identifier for the entry - ``target`` (bytes): the Git SHA-1 of the directory this entry points to - ``name`` (bytes): the name of the directory - ``perms`` (integer): the permissions of the directory - **directory_entry_rev**: contains information about revision entries in directories. - ``id`` (integer): unique identifier for the entry - ``target`` (bytes): the Git SHA-1 of the revision this entry points to - ``name`` (bytes): the name of the directory that contains this revision - ``perms`` (integer): the permissions of the revision - **person**: deduplicates commit authors by their names and e-mail addresses. For pseudonymization purposes and in order to prevent abuse, these columns were removed from the dataset, and this table only contains the ID of the author. Individual authors may be retrieved using this ID from the Software Heritage api. - ``id`` (integer): the identifier of the person - **revision**: contains the revisions stored in the archive. - ``id`` (bytes): the intrinsic identifier of the revision, recursively computed with the Git SHA-1 algorithm. For Git repositories, this corresponds to the revision hash. - ``date`` (timestamp): the date the revision was authored - ``committer_date`` (timestamp): the date the revision was committed - ``author`` (integer): the author of the revision - ``committer`` (integer): the committer of the revision - ``message`` (bytes): the revision message - ``directory`` (bytes): the Git SHA-1 of the directory the revision points to. Every revision points to the root directory of the project source tree to which it corresponds. - **revision_history**: contains the ordered set of parents of each revision. Each revision has an ordered set of parents (0 for the initial commit of a repository, 1 for a regular commit, 2 for a regular merge commit and 3 or more for octopus-style merge commits). - ``id`` (bytes): the Git SHA-1 identifier of the revision - ``parent_id`` (bytes): the Git SHA-1 identifier of the parent - ``parent_rank`` (integer): the rank of the parent which defines the total order of the parents of the revision - **release**: contains the releases stored in the archive. - ``id`` (bytes): the intrinsic identifier of the release, recursively computed with the Git SHA-1 algorithm. - ``target`` (bytes): the Git SHA-1 of the object the release points to. - ``date`` (timestamp): the date the release was created - ``author`` (integer): the author of the revision - ``name`` (bytes): the release name - ``message`` (bytes): the release message - **snapshot**: contains the list of snapshots stored in the archive. - ``id`` (bytes): the intrinsic identifier of the snapshot, recursively computed with the Git SHA-1 algorithm. - ``object_id`` (integer): the primary key of the snapshot - **snapshot_branches**: contains the identifiers of branches associated with each snapshot. This is an intermediary table through which is represented the many-to-many relationship between snapshots and branches. - ``snapshot_id`` (integer): the integer identifier of the snapshot - ``branch_id`` (integer): the identifier of the branch - **snapshot_branch**: contains the list of branches. - ``object_id`` (integer): the identifier of the branch - ``name`` (bytes): the name of the branch - ``target`` (bytes): the Git SHA-1 of the object the branch points to - ``target_type`` (string): the type of object the branch points to (either ``release``, ``revision``, ``directory`` or ``content``). - **origin**: the software origins from which the projects in the dataset were archived. - ``id`` (integer): the identifier of the origin - ``url`` (bytes): the URL of the origin - ``type`` (string): the type of origin (e.g ``git``, ``pypi``, ``hg``, ``svn``, ``git``, ``ftp``, ``deb``, ...) - **origin_visit**: the different visits of each origin. Since Software Heritage archives software continuously, software origins are crawled more than once. Each of these "visits" is an entry in this table. - ``origin``: (integer) the identifier of the origin visited - ``date``: (timestamp) the date at which the origin was visited - ``snapshot_id`` (integer): the integer identifier of the snapshot archived in this visit.