diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst index f4fa884..e2518f6 100644 --- a/docs/graph/schema.rst +++ b/docs/graph/schema.rst @@ -1,128 +1,142 @@ Relational schema ================= The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables. A simplified view of the corresponding database schema is shown here: .. image:: _images/db-schema.svg This page documents the details of the schema. - **content**: contains information on the contents stored in the archive. - - ``sha1`` (bytes): the SHA-1 of the content - - ``sha1_git`` (bytes): the Git SHA-1 of the content - - ``length`` (integer): the length of the content + - ``sha1`` (bytes): the SHA-1 of the content + - ``sha1_git`` (bytes): the Git SHA-1 of the content + - ``length`` (integer): the length of the content - **skipped_content**: contains information on the contents that were not archived for various reasons. - - ``sha1`` (bytes): the SHA-1 of the missing content - - ``sha1_git`` (bytes): the Git SHA-1 of the missing content - - ``length`` (integer): the length of the missing content + - ``sha1`` (bytes): the SHA-1 of the missing content + - ``sha1_git`` (bytes): the Git SHA-1 of the missing content + - ``length`` (integer): the length of the missing content - **directory**: contains the directories stored in the archive. - - ``id`` (bytes): the intrinsic identifier of the directory, recursively - computed with the Git SHA-1 algorithm - - ``dir_entries`` (array of integers): the list of directories contained in - this directory, as references to an entry in the ``directory_entry_dir`` - table. - - ``file_entries`` (array of integers): the list of files contained in - this directory, as references to an entry in the ``directory_entry_file`` - table. - - ``rev_entries`` (array of integers): the list of revisions contained in - this directory, as references to an entry in the ``directory_entry_rev`` - table. + - ``id`` (bytes): the intrinsic identifier of the directory, recursively + computed with the Git SHA-1 algorithm + - ``dir_entries`` (array of integers): the list of directories contained in + this directory, as references to an entry in the ``directory_entry_dir`` + table. + - ``file_entries`` (array of integers): the list of files contained in + this directory, as references to an entry in the ``directory_entry_file`` + table. + - ``rev_entries`` (array of integers): the list of revisions contained in + this directory, as references to an entry in the ``directory_entry_rev`` + table. - **directory_entry_file**: contains informations about file entries in directories. - - ``id`` (integer): unique identifier for the entry - - ``target`` (bytes): the Git SHA-1 of the content this entry points to - - ``name`` (bytes): the name of the file (basename of its path) - - ``perms`` (integer): the permissions of the file + - ``id`` (integer): unique identifier for the entry + - ``target`` (bytes): the Git SHA-1 of the content this entry points to + - ``name`` (bytes): the name of the file (basename of its path) + - ``perms`` (integer): the permissions of the file - **directory_entry_dir**: contains informations about directory entries in directories. - - ``id`` (integer): unique identifier for the entry - - ``target`` (bytes): the Git SHA-1 of the directory this entry points to - - ``name`` (bytes): the name of the directory - - ``perms`` (integer): the permissions of the directory + - ``id`` (integer): unique identifier for the entry + - ``target`` (bytes): the Git SHA-1 of the directory this entry points to + - ``name`` (bytes): the name of the directory + - ``perms`` (integer): the permissions of the directory - **directory_entry_rev**: contains informations about revision entries in directories. - - ``id`` (integer): unique identifier for the entry - - ``target`` (bytes): the Git SHA-1 of the revision this entry points to - - ``name`` (bytes): the name of the directory that contains this revision - - ``perms`` (integer): the permissions of the revision + - ``id`` (integer): unique identifier for the entry + - ``target`` (bytes): the Git SHA-1 of the revision this entry points to + - ``name`` (bytes): the name of the directory that contains this revision + - ``perms`` (integer): the permissions of the revision -- **revision**: contains the revisions stored in the archive. +- **person**: deduplicates commit authors by their names and e-mail addresses. + For pseudonymization purposes and in order to prevent abuse, these columns + were removed from the dataset, and this table only contains the ID of the + author. Individual authors may be retrieved using this ID from the Software + Heritage api. + + - ``id`` (integer): the identifier of the person - - ``id`` (bytes): the intrinsic identifier of the revision, recursively - computed with the Git SHA-1 algorithm. For Git repositories, this - corresponds to the commit hash. - -- The ``revision`` table contains all the revisions, identified by - their intrinsic hash in the ``id`` field. Each revision points to the - root directory of the project source tree, identified by the - ``directory`` field which references the ``sha1_git`` cryptographic - hash of the directory. The table also contains metadata on the - revisions, notably the ``author`` and ``committer`` fields, the - ``date`` and ``committer_date`` fields and the ``message`` field. - - Each revision has an ordered set of parents (0 for the initial commit - of a repository, 1 for a normal commit and 2 or more for a merge - commit). These parents are stored in the ``revision_history`` table, - one row per parent. Each parent is identified by the ``id`` - identifier, pointing to the hash of the revision, the ``parent_id`` - identifier, pointing to the hash of the parent revision, and the - ``parent_rank`` integer which defines the order of the parents of - each revision. - -- The ``person`` table deduplicates commit authors by their name and - e-mail addresses. For pseudonymization purposes and in order to - prevent abuse, these columns were removed from the dataset, and this - table only contains the ``id`` column referenced by the ``author`` - and ``committer`` fields of the ``revision`` table. Individual - authors may be retrieved using this ID from the Software Heritage - api. - -- The ``release`` table contains the releases in the archive. They are - also identified by their intrinsic hash ``id`` and point to a - revision referenced by its hash in the ``target`` field. The metadata - fields are semantically similar to the ``revision`` table (i.e - ``author``, ``date``, ``message``). - -- The ``snapshot`` table contains the list of snapshots identified by - their intrinsic hash ``id``, and their integer primary key in the - archive ``object_id``. Each snapshot maps to a list of branches - listed in the table ``snapshot_branch`` through the many-to-many - relationship intermediate table ``snapshot_branches``, which - references the ``object_id`` fields of the ``snapshot`` and - ``snapshot_branch`` tables. The ``snapshot_branch`` table also - contains the ``name`` of the branch and the ``target`` it points to - (identified by its intrinsic hash), either a ``release``, - ``revision``, ``directory`` or ``content`` object depending on the - value of the ``target_type`` field. - -In addition to the nodes and edges of the graph, the dataset also -contains crawling information, as a set of triples capturing where (an -origin url) and when (a timestamp) a given snapshot has been -encountered. - -- The ``origin`` table contains the origins from which the software - projects in the dataset were archived, identified by their ``id`` - identifier, and ``type`` and ``url`` metadata. - - Since Software Heritage archives software continuously, software - origins are crawled more than once. Every “visit” of an origin is - stored in the ``origin_visit`` table, which contains the identifier - ``origin`` of the origin visited, the ``date`` of the visit and a - ``snapshot_id`` integer which points to the ``object_id`` identifier - of the ``snapshot`` table. +- **revision**: contains the revisions stored in the archive. + - ``id`` (bytes): the intrinsic identifier of the revision, recursively + computed with the Git SHA-1 algorithm. For Git repositories, this + corresponds to the revision hash. + - ``date`` (timestamp): the date the revision was authored + - ``committer_date`` (timestamp): the date the revision was committed + - ``author`` (integer): the author of the revision + - ``committer`` (integer): the committer of the revision + - ``message`` (bytes): the revision message + - ``directory`` (bytes): the Git SHA-1 of the directory the revision points + to. Every revision points to the root directory of the project source + tree to which it corresponds. + +- **revision_history**: contains the ordered set of parents of each revision. + Each revision has an ordered set of parents (0 for the initial commit of a + repository, 1 for a regular commit, 2 for a regular merge commit and 3 or + more for octopus-style merge commits). + + - ``id`` (bytes): the Git SHA-1 identifier of the revision + - ``parent_id`` (bytes): the Git SHA-1 identifier of the parent + - ``parent_rank`` (integer): the rank of the parent which defines the total + order of the parents of the revision + +- **release**: contains the releases stored in the archive. + + - ``id`` (bytes): the intrinsic identifier of the release, recursively + computed with the Git SHA-1 algorithm. + - ``target`` (bytes): the Git SHA-1 of the object the release points to. + - ``date`` (timestamp): the date the release was created + - ``author`` (integer): the author of the revision + - ``name`` (bytes): the release name + - ``message`` (bytes): the release message + +- **snapshot**: contains the list of snapshots stored in the archive. + + - ``id`` (bytes): the intrinsic identifier of the snapshot, recursively + computed with the Git SHA-1 algorithm. + - ``object_id`` (integer): the primary key of the snapshot + +- **snapshot_branches**: contains the identifiers of branches associated with + each snapshot. This is an intermediary table through which is represented the + many-to-many relationship between snapshots and branches. + + - ``snapshot_id`` (integer): the integer identifier of the snapshot + - ``branch_id`` (integer): the identifier of the branch + +- **snapshot_branch**: contains the list of branches. + + - ``object_id`` (integer): the identifier of the branch + - ``name`` (bytes): the name of the branch + - ``target`` (bytes): the Git SHA-1 of the object the branch points to + - ``target_type`` (string): the type of object the branch points to (either + ``release``, ``revision``, ``directory`` or ``content``). + +- **origin**: the software origins from which the projects in the dataset were + archived. + + - ``id`` (integer): the identifier of the origin + - ``url`` (bytes): the URL of the origin + - ``type`` (string): the type of origin (e.g ``git``, ``pypi``, ``hg``, + ``svn``, ``git``, ``ftp``, ``deb``, ...) + +- **origin_visit**: the different visits of each origin. Since Software + Heritage archives software continuously, software origins are crawled more + than once. Each of these "visits" is an entry in this table. + + - ``origin``: (integer) the identifier of the origin visited + - ``date``: (timestamp) the date at which the origin was visited + - ``snapshot_id`` (integer): the integer identifier of the snapshot archived + in this visit.