diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst index 0d60d86..94c93f1 100644 --- a/docs/graph/schema.rst +++ b/docs/graph/schema.rst @@ -1,134 +1,134 @@ Relational schema ================= The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables. A simplified view of the corresponding database schema is shown here: .. image:: _images/db-schema.svg This page documents the details of the schema. **Note**: To limit abuse, some columns containing personal information are pseudonimized in the dataset using a hash algorithm. Individual authors may be retrieved by querying the Software Heritage API. -- **content**: contains information on the contents stored in - the archive. +- **content**: contains information on the contents stored in + the archive. - ``sha1`` (string): the SHA-1 of the content (hexadecimal) - ``sha1_git`` (string): the Git SHA-1 of the content (hexadecimal) - ``sha256`` (string): the SHA-256 of the content (hexadecimal) - ``blake2s256`` (bytes): the BLAKE2s-256 of the content (hexadecimal) - ``length`` (integer): the length of the content - ``status`` (string): the visibility status of the content -- **skipped_content**: contains information on the contents that were not +- **skipped_content**: contains information on the contents that were not archived for various reasons. - ``sha1`` (string): the SHA-1 of the skipped content (hexadecimal) - ``sha1_git`` (string): the Git SHA-1 of the skipped content (hexadecimal) - ``sha256`` (string): the SHA-256 of the skipped content (hexadecimal) - ``blake2s256`` (bytes): the BLAKE2s-256 of the skipped content - (hexadecimal) + (hexadecimal) - ``length`` (integer): the length of the skipped content - ``status`` (string): the visibility status of the skipped content - ``reason`` (string): the reason why the content was skipped - **directory**: contains the directories stored in the archive. - ``id`` (string): the intrinsic hash of the directory (hexadecimal), recursively computed with the Git SHA-1 algorithm - **directory_entry**: contains the entries in directories. - ``directory_id`` (string): the Git SHA-1 of the directory - containing the entry (hexadecimal). + containing the entry (hexadecimal). - ``name`` (bytes): the name of the file (basename of its path) - ``type`` (string): the type of object the branch points to (either - ``revision``, ``directory`` or ``content``). + ``revision``, ``directory`` or ``content``). - ``target`` (string): the Git SHA-1 of the object this - entry points to (hexadecimal). + entry points to (hexadecimal). - ``perms`` (integer): the permissions of the object - **revision**: contains the revisions stored in the archive. - ``id`` (string): the intrinsic hash of the revision (hexadecimal), - recursively computed with the Git SHA-1 algorithm. For Git - repositories, this corresponds to the commit hash. + recursively computed with the Git SHA-1 algorithm. For Git repositories, + this corresponds to the commit hash. - ``message`` (bytes): the revision message - ``author`` (string): an anonymized hash of the author of the revision. - ``date`` (timestamp): the date the revision was authored - ``date_offset`` (integer): the offset of the timezone of ``date`` - ``committer`` (string): an anonymized hash of the committer of the revision. - ``committer_date`` (timestamp): the date the revision was committed - ``committer_date_offset`` (integer): the offset of the timezone of - ``committer_date`` + ``committer_date`` - ``directory`` (string): the Git SHA-1 of the directory the revision points - to (hexadecimal). Every revision points to the root directory of the - project source tree to which it corresponds. + to (hexadecimal). Every revision points to the root directory of the + project source tree to which it corresponds. - **revision_history**: contains the ordered set of parents of each revision. Each revision has an ordered set of parents (0 for the initial commit of a repository, 1 for a regular commit, 2 for a regular merge commit and 3 or more for octopus-style merge commits). - ``id`` (string): the Git SHA-1 identifier of the revision (hexadecimal) - ``parent_id`` (string): the Git SHA-1 identifier of the parent (hexadecimal) - ``parent_rank`` (integer): the rank of the parent, which defines the ordering between the parents of the revision - **release**: contains the releases stored in the archive. - ``id`` (string): the intrinsic hash of the release (hexadecimal), - recursively computed with the Git SHA-1 algorithm. + recursively computed with the Git SHA-1 algorithm - ``target`` (string): the Git SHA-1 of the object the release points to - (hexadecimal). + (hexadecimal) - ``date`` (timestamp): the date the release was created - ``author`` (integer): the author of the revision - ``name`` (bytes): the release name - ``message`` (bytes): the release message - **snapshot**: contains the list of snapshots stored in the archive. - ``id`` (string): the intrinsic hash of the snapshot (hexadecimal), - recursively computed with the Git SHA-1 algorithm. + recursively computed with the Git SHA-1 algorithm. - **snapshot_branch**: contains the list of branches associated with each snapshot. - ``snapshot_id`` (string): the intrinsic hash of the snapshot (hexadecimal) - ``name`` (bytes): the name of the branch - ``target`` (string): the intrinsic hash of the object the branch points to - (hexadecimal) + (hexadecimal) - ``target_type`` (string): the type of object the branch points to (either - ``release``, ``revision``, ``directory`` or ``content``). + ``release``, ``revision``, ``directory`` or ``content``). - **origin**: the software origins from which the projects in the dataset were archived. - ``url`` (bytes): the URL of the origin - **origin_visit**: the different visits of each origin. Since Software Heritage archives software continuously, software origins are crawled more than once. Each of these "visits" is an entry in this table. - ``origin``: (string) the URL of the origin visited - ``visit``: (integer) an integer identifier of the visit - ``date``: (timestamp) the date at which the origin was visited - ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``, ``svn``, ``git``, ``ftp``, ``deb``, ...) - **origin_visit_status**: the status of each visit. - ``origin``: (string) the URL of the origin visited - ``visit``: (integer) an integer identifier of the visit - ``date``: (timestamp) the date at which the origin was visited - ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``, ``svn``, ``git``, ``ftp``, ``deb``, ...) - ``snapshot_id`` (string): the intrinsic hash of the snapshot archived in - this visit (hexadecimal). - - ``status`` (string): the integer identifier of the snapshot archived - in this visit, either ``partial`` for partial visits or ``full`` for - full visits. + this visit (hexadecimal). + - ``status`` (string): the integer identifier of the snapshot archived in + this visit, either ``partial`` for partial visits or ``full`` for full + visits.