diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst index d82bb55a..b3d1a84c 100644 --- a/docs/extrinsic-metadata-specification.rst +++ b/docs/extrinsic-metadata-specification.rst @@ -1,251 +1,251 @@ :orphan: .. _extrinsic-metadata-specification: Extrinsic metadata specification ================================ :term:`Extrinsic metadata` is information about software that is not part of the source code itself but still closely related to the software. Typical sources for extrinsic metadata are: the hosting place of a repository, which can offer metadata via its web view or API; external registries like collaborative curation initiatives; and out-of-band information available at source code archival time. Since they are not part of the source code, a dedicated mechanism to fetch and store them is needed. This specification assumes the reader is familiar with Software Heritage's :ref:`architecture` and :ref:`data-model`. Metadata sources ---------------- Authorities ^^^^^^^^^^^ Metadata authorities are entities that provide metadata about an :term:`origin`. Metadata authorities include: code hosting places, :term:`deposit` submitters, and registries (eg. Wikidata). An authority is uniquely defined by these properties: * its type, representing the kind of authority, which is one of these values: - * `deposit`, for metadata pushed to Software Heritage at the same time + * `deposit_client`, for metadata pushed to Software Heritage at the same time as a software artifact * `forge`, for metadata pulled from the same source as the one hosting the software artifacts (which includes package managers) * `registry`, for metadata pulled from a third-party * its URL, which unambiguously identifies an instance of the authority type. Examples: =============== ================================= type url =============== ================================= -deposit https://hal.archives-ouvertes.fr/ -deposit https://hal.inria.fr/ -deposit https://software.intel.com/ +deposit_client https://hal.archives-ouvertes.fr/ +deposit_client https://hal.inria.fr/ +deposit_client https://software.intel.com/ forge https://gitlab.com/ forge https://gitlab.inria.fr/ forge https://0xacab.org/ forge https://github.com/ registry https://www.wikidata.org/ registry https://swmath.org/ registry https://ascl.net/ =============== ================================= Metadata fetchers ^^^^^^^^^^^^^^^^^ Metadata fetchers are software components used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive. A metadata fetcher is uniquely defined by these properties: * its type * its version Examples: * :term:`loaders <loader>`, which may either discover metadata as a side-effect of loading source code, or be dedicated to fetching metadata. * :term:`listers <lister>`, which may discover metadata as a side-effect of discovering origins. * :term:`deposit` submitters, which push metadata to SWH from a third-party; usually at the same time as a :term:`software artifact` * crawlers, which fetch metadata from an authority in a way that is none of the above (eg. by querying a specific API of the origin's forge). Storage API ----------- Authorities and metadata fetchers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :term:`storage` API offers these endpoints to manipulate metadata authorities and metadata fetchers: * ``metadata_authority_add(type, url, metadata)`` which adds a new metadata authority to the storage. * ``metadata_authority_get(type, url)`` which looks up a known authority (there is at most one) and if it is known, returns a dictionary with keys ``type``, ``url``, and ``metadata``. * ``metadata_fetcher_add(name, version, metadata)`` which adds a new metadata fetcher to the storage. * ``metadata_fetcher_get(name, version)`` which looks up a known fetcher (there is at most one) and if it is known, returns a dictionary with keys ``name``, ``version``, and ``metadata``. These `metadata` fields contain JSON-encodable dictionaries with information about the authority/fetcher, in a format specific to each authority/fetcher. With authority, the `metadata` field is reserved for information describing and qualifying the authority. With fetchers, the `metadata` field is reserved for configuration metadata and other technical usage. Origin metadata ^^^^^^^^^^^^^^^ Extrinsic metadata are stored in SWH's :term:`storage database`. The storage API offers three endpoints to manipulate origin metadata: * Adding metadata:: origin_metadata_add(origin_url, discovery_date, authority, fetcher, format, metadata) which adds a new `metadata` byte string obtained from a given authority and associated to the origin. `discovery_date` is a Python datetime. `authority` must be a dict containing keys `type` and `url`, and `fetcher` a dict containing keys `name` and `version`. The authority and fetcher must be known to the storage before using this endpoint. `format` is a text field indicating the format of the content of the `metadata` byte string. * Getting latest metadata:: origin_metadata_get_latest(origin_url, authority) where `authority` must be a dict containing keys `type` and `url`, which returns a dictionary corresponding to the latest metadata entry added from this origin, in the format:: { 'origin_url': ..., 'authority': {'type': ..., 'url': ...}, 'fetcher': {'name': ..., 'version': ...}, 'discovery_date': ..., 'format': '...', 'metadata': b'...' } * Getting all metadata:: origin_metadata_get(origin_url, authority, page_token, limit) where `authority` must be a dict containing keys `type` and `url` which returns a dictionary with keys: * `next_page_token`, which is an opaque token to be used as `page_token` for retrieving the next page. if absent, there is no more pages to gather. * `results`: list of dictionaries, one for each metadata item deposited, corresponding to the given origin and obtained from the specified authority. Each of these dictionaries is in the following format:: { 'authority': {'type': ..., 'url': ...}, 'fetcher': {'name': ..., 'version': ...}, 'discovery_date': ..., 'format': '...', 'metadata': b'...' } The parameters ``page_token`` and ``limit`` are used for pagination based on an arbitrary order. An initial query to ``origin_metadata_get`` must set ``page_token`` to ``None``, and further query must use the value from the previous query's ``next_page_token`` to get the next page of results. ``metadata`` is a bytes array (eventually encoded using Base64). Its format is specific to each authority; and is treated as an opaque value by the storage. Unifying these various formats into a common language is outside the scope of this specification. Artifact metadata ^^^^^^^^^^^^^^^^^ In addition to origin metadata, the storage database stores metadata on all software artifacts supported by the data model. This works similarly to origin metadata, with one major difference: extrinsic metadata can be given on a specific artifact within a specified context (for example: a directory in a specific revision from a specific visit on a specific origin) which will be stored along the metadata itself. For example, two origins may develop the same file independently; the information about authorship, licensing or even description may vary about the same artifact in a different context. This is why it is important to qualify the metadata with the complete context for which it is intended, if any. for each artifact type ``<X>``, there are two endpoints to manipulate metadata associated with artifacts of that type: * Adding metadata:: <X>_metadata_add(id, context, discovery_date, authority, fetcher, format, metadata) * Getting all metadata:: <X>_metadata_get(id, authority, after, page_token, limit) definited similarly to ``origin_metadata_add`` and ``origin_metadata_get``, but where ``id`` is a core SWHID (with type matching ``<X>``), and with an extra ``context`` (argument when adding metadata, and dictionary key when getting them) that is a dictionary with keys depending on the artifact type ``<X>``: * for ``snapshot``: ``origin`` (a URL) and ``visit`` (an integer) * for ``release``: those above, plus ``snapshot`` (the core SWHID of a snapshot) * for ``revision``: all those above, plus ``release`` (the core SWHID of a release) * for ``directory``: all those above, plus ``revision`` (the core SWHID of a revision) and ``path`` (a byte string), representing the path to this directory from the root of the ``revision`` * for ``content``: all those above, plus ``directory`` (the core SWHID of a directory) All keys are optional, but should be provided whenever possible. The dictionary may be empty, if metadata is fully independent from context. In all cases, ``visit`` should only be provided if ``origin`` is (as visit ids are only unique with respect to an origin). diff --git a/swh/storage/interface.py b/swh/storage/interface.py index cd5c214b..ffe1ef6e 100644 --- a/swh/storage/interface.py +++ b/swh/storage/interface.py @@ -1,1293 +1,1293 @@ # Copyright (C) 2015-2020 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import datetime from typing import Any, Dict, Iterable, List, Optional, Union from swh.core.api import remote_api_endpoint from swh.model.identifiers import SWHID from swh.model.model import ( Content, Directory, Origin, OriginVisit, OriginVisitStatus, Revision, Release, Snapshot, SkippedContent, MetadataAuthority, MetadataAuthorityType, MetadataFetcher, MetadataTargetType, RawExtrinsicMetadata, ) def deprecated(f): f.deprecated_endpoint = True return f class StorageInterface: @remote_api_endpoint("check_config") def check_config(self, *, check_write): """Check that the storage is configured and ready to go.""" ... @remote_api_endpoint("content/add") def content_add(self, content: Iterable[Content]) -> Dict: """Add content blobs to the storage Args: contents (iterable): iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys: - data (bytes): the actual content - length (int): content length - one key for each checksum algorithm in :data:`swh.model.hashutil.ALGORITHMS`, mapped to the corresponding checksum - status (str): one of visible, hidden Raises: The following exceptions can occur: - HashCollision in case of collision - Any other exceptions raise by the db In case of errors, some of the content may have been stored in the DB and in the objstorage. Since additions to both idempotent, that should not be a problem. Returns: Summary dict with the following keys and associated values: content:add: New contents added content:add:bytes: Sum of the contents' length data """ ... @remote_api_endpoint("content/update") def content_update(self, content, keys=[]): """Update content blobs to the storage. Does nothing for unknown contents or skipped ones. Args: content (iterable): iterable of dictionaries representing individual pieces of content to update. Each dictionary has the following keys: - data (bytes): the actual content - length (int): content length (default: -1) - one key for each checksum algorithm in :data:`swh.model.hashutil.ALGORITHMS`, mapped to the corresponding checksum - status (str): one of visible, hidden, absent keys (list): List of keys (str) whose values needs an update, e.g., new hash column """ ... @remote_api_endpoint("content/add_metadata") def content_add_metadata(self, content: Iterable[Content]) -> Dict: """Add content metadata to the storage (like `content_add`, but without inserting to the objstorage). Args: content (iterable): iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys: - length (int): content length (default: -1) - one key for each checksum algorithm in :data:`swh.model.hashutil.ALGORITHMS`, mapped to the corresponding checksum - status (str): one of visible, hidden, absent - reason (str): if status = absent, the reason why - origin (int): if status = absent, the origin we saw the content in - ctime (datetime): time of insertion in the archive Returns: Summary dict with the following key and associated values: content:add: New contents added skipped_content:add: New skipped contents (no data) added """ ... @remote_api_endpoint("content/data") def content_get(self, content): """Retrieve in bulk contents and their data. This generator yields exactly as many items than provided sha1 identifiers, but callers should not assume this will always be true. It may also yield `None` values in case an object was not found. Args: content: iterables of sha1 Yields: Dict[str, bytes]: Generates streams of contents as dict with their raw data: - sha1 (bytes): content id - data (bytes): content's raw data Raises: ValueError in case of too much contents are required. cf. BULK_BLOCK_CONTENT_LEN_MAX """ ... @deprecated @remote_api_endpoint("content/range") def content_get_range(self, start, end, limit=1000): """Retrieve contents within range [start, end] bound by limit. Note that this function may return more than one blob per hash. The limit is enforced with multiplicity (ie. two blobs with the same hash will count twice toward the limit). Args: **start** (bytes): Starting identifier range (expected smaller than end) **end** (bytes): Ending identifier range (expected larger than start) **limit** (int): Limit result (default to 1000) Returns: a dict with keys: - contents [dict]: iterable of contents in between the range. - next (bytes): There remains content in the range starting from this next sha1 """ ... @remote_api_endpoint("content/partition") def content_get_partition( self, partition_id: int, nb_partitions: int, limit: int = 1000, page_token: str = None, ): """Splits contents into nb_partitions, and returns one of these based on partition_id (which must be in [0, nb_partitions-1]) There is no guarantee on how the partitioning is done, or the result order. Args: partition_id (int): index of the partition to fetch nb_partitions (int): total number of partitions to split into limit (int): Limit result (default to 1000) page_token (Optional[str]): opaque token used for pagination. Returns: a dict with keys: - contents (List[dict]): iterable of contents in the partition. - **next_page_token** (Optional[str]): opaque token to be used as `page_token` for retrieving the next page. if absent, there is no more pages to gather. """ ... @remote_api_endpoint("content/metadata") def content_get_metadata(self, contents: List[bytes]) -> Dict[bytes, List[Dict]]: """Retrieve content metadata in bulk Args: content: iterable of content identifiers (sha1) Returns: a dict with keys the content's sha1 and the associated value either the existing content's metadata or None if the content does not exist. """ ... @remote_api_endpoint("content/missing") def content_missing(self, content, key_hash="sha1"): """List content missing from storage Args: content ([dict]): iterable of dictionaries whose keys are either 'length' or an item of :data:`swh.model.hashutil.ALGORITHMS`; mapped to the corresponding checksum (or length). key_hash (str): name of the column to use as hash id result (default: 'sha1') Returns: iterable ([bytes]): missing content ids (as per the key_hash column) Raises: TODO: an exception when we get a hash collision. """ ... @remote_api_endpoint("content/missing/sha1") def content_missing_per_sha1(self, contents): """List content missing from storage based only on sha1. Args: contents: Iterable of sha1 to check for absence. Returns: iterable: missing ids Raises: TODO: an exception when we get a hash collision. """ ... @remote_api_endpoint("content/missing/sha1_git") def content_missing_per_sha1_git(self, contents): """List content missing from storage based only on sha1_git. Args: contents (Iterable): An iterable of content id (sha1_git) Yields: missing contents sha1_git """ ... @remote_api_endpoint("content/present") def content_find(self, content): """Find a content hash in db. Args: content: a dictionary representing one content hash, mapping checksum algorithm names (see swh.model.hashutil.ALGORITHMS) to checksum values Returns: a triplet (sha1, sha1_git, sha256) if the content exist or None otherwise. Raises: ValueError: in case the key of the dictionary is not sha1, sha1_git nor sha256. """ ... @remote_api_endpoint("content/get_random") def content_get_random(self): """Finds a random content id. Returns: a sha1_git """ ... @remote_api_endpoint("content/skipped/add") def skipped_content_add(self, content: Iterable[SkippedContent]) -> Dict: """Add contents to the skipped_content list, which contains (partial) information about content missing from the archive. Args: contents (iterable): iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys: - length (Optional[int]): content length (default: -1) - one key for each checksum algorithm in :data:`swh.model.hashutil.ALGORITHMS`, mapped to the corresponding checksum; each is optional - status (str): must be "absent" - reason (str): the reason why the content is absent - origin (int): if status = absent, the origin we saw the content in Raises: The following exceptions can occur: - HashCollision in case of collision - Any other exceptions raise by the backend In case of errors, some content may have been stored in the DB and in the objstorage. Since additions to both idempotent, that should not be a problem. Returns: Summary dict with the following key and associated values: skipped_content:add: New skipped contents (no data) added """ ... @remote_api_endpoint("content/skipped/missing") def skipped_content_missing(self, contents): """List skipped_content missing from storage Args: content: iterable of dictionaries containing the data for each checksum algorithm. Returns: iterable: missing signatures """ ... @remote_api_endpoint("directory/add") def directory_add(self, directories: Iterable[Directory]) -> Dict: """Add directories to the storage Args: directories (iterable): iterable of dictionaries representing the individual directories to add. Each dict has the following keys: - id (sha1_git): the id of the directory to add - entries (list): list of dicts for each entry in the directory. Each dict has the following keys: - name (bytes) - type (one of 'file', 'dir', 'rev'): type of the directory entry (file, directory, revision) - target (sha1_git): id of the object pointed at by the directory entry - perms (int): entry permissions Returns: Summary dict of keys with associated count as values: directory:add: Number of directories actually added """ ... @remote_api_endpoint("directory/missing") def directory_missing(self, directories): """List directories missing from storage Args: directories (iterable): an iterable of directory ids Yields: missing directory ids """ ... @remote_api_endpoint("directory/ls") def directory_ls(self, directory, recursive=False): """Get entries for one directory. Args: - directory: the directory to list entries from. - recursive: if flag on, this list recursively from this directory. Returns: List of entries for such directory. If `recursive=True`, names in the path of a dir/file not at the root are concatenated with a slash (`/`). """ ... @remote_api_endpoint("directory/path") def directory_entry_get_by_path(self, directory, paths): """Get the directory entry (either file or dir) from directory with path. Args: - directory: sha1 of the top level directory - paths: path to lookup from the top level directory. From left (top) to right (bottom). Returns: The corresponding directory entry if found, None otherwise. """ ... @remote_api_endpoint("directory/get_random") def directory_get_random(self): """Finds a random directory id. Returns: a sha1_git """ ... @remote_api_endpoint("revision/add") def revision_add(self, revisions: Iterable[Revision]) -> Dict: """Add revisions to the storage Args: revisions (Iterable[dict]): iterable of dictionaries representing the individual revisions to add. Each dict has the following keys: - **id** (:class:`sha1_git`): id of the revision to add - **date** (:class:`dict`): date the revision was written - **committer_date** (:class:`dict`): date the revision got added to the origin - **type** (one of 'git', 'tar'): type of the revision added - **directory** (:class:`sha1_git`): the directory the revision points at - **message** (:class:`bytes`): the message associated with the revision - **author** (:class:`Dict[str, bytes]`): dictionary with keys: name, fullname, email - **committer** (:class:`Dict[str, bytes]`): dictionary with keys: name, fullname, email - **metadata** (:class:`jsonb`): extra information as dictionary - **synthetic** (:class:`bool`): revision's nature (tarball, directory creates synthetic revision`) - **parents** (:class:`list[sha1_git]`): the parents of this revision date dictionaries have the form defined in :mod:`swh.model`. Returns: Summary dict of keys with associated count as values revision:add: New objects actually stored in db """ ... @remote_api_endpoint("revision/missing") def revision_missing(self, revisions): """List revisions missing from storage Args: revisions (iterable): revision ids Yields: missing revision ids """ ... @remote_api_endpoint("revision") def revision_get(self, revisions): """Get all revisions from storage Args: revisions: an iterable of revision ids Returns: iterable: an iterable of revisions as dictionaries (or None if the revision doesn't exist) """ ... @remote_api_endpoint("revision/log") def revision_log(self, revisions, limit=None): """Fetch revision entry from the given root revisions. Args: revisions: array of root revision to lookup limit: limitation on the output result. Default to None. Yields: List of revision log from such revisions root. """ ... @remote_api_endpoint("revision/shortlog") def revision_shortlog(self, revisions, limit=None): """Fetch the shortlog for the given revisions Args: revisions: list of root revisions to lookup limit: depth limitation for the output Yields: a list of (id, parents) tuples. """ ... @remote_api_endpoint("revision/get_random") def revision_get_random(self): """Finds a random revision id. Returns: a sha1_git """ ... @remote_api_endpoint("release/add") def release_add(self, releases: Iterable[Release]) -> Dict: """Add releases to the storage Args: releases (Iterable[dict]): iterable of dictionaries representing the individual releases to add. Each dict has the following keys: - **id** (:class:`sha1_git`): id of the release to add - **revision** (:class:`sha1_git`): id of the revision the release points to - **date** (:class:`dict`): the date the release was made - **name** (:class:`bytes`): the name of the release - **comment** (:class:`bytes`): the comment associated with the release - **author** (:class:`Dict[str, bytes]`): dictionary with keys: name, fullname, email the date dictionary has the form defined in :mod:`swh.model`. Returns: Summary dict of keys with associated count as values release:add: New objects contents actually stored in db """ ... @remote_api_endpoint("release/missing") def release_missing(self, releases): """List releases missing from storage Args: releases: an iterable of release ids Returns: a list of missing release ids """ ... @remote_api_endpoint("release") def release_get(self, releases): """Given a list of sha1, return the releases's information Args: releases: list of sha1s Yields: dicts with the same keys as those given to `release_add` (or ``None`` if a release does not exist) """ ... @remote_api_endpoint("release/get_random") def release_get_random(self): """Finds a random release id. Returns: a sha1_git """ ... @remote_api_endpoint("snapshot/add") def snapshot_add(self, snapshots: Iterable[Snapshot]) -> Dict: """Add snapshots to the storage. Args: snapshot ([dict]): the snapshots to add, containing the following keys: - **id** (:class:`bytes`): id of the snapshot - **branches** (:class:`dict`): branches the snapshot contains, mapping the branch name (:class:`bytes`) to the branch target, itself a :class:`dict` (or ``None`` if the branch points to an unknown object) - **target_type** (:class:`str`): one of ``content``, ``directory``, ``revision``, ``release``, ``snapshot``, ``alias`` - **target** (:class:`bytes`): identifier of the target (currently a ``sha1_git`` for all object kinds, or the name of the target branch for aliases) Raises: ValueError: if the origin or visit id does not exist. Returns: Summary dict of keys with associated count as values snapshot:add: Count of object actually stored in db """ ... @remote_api_endpoint("snapshot/missing") def snapshot_missing(self, snapshots): """List snapshots missing from storage Args: snapshots (iterable): an iterable of snapshot ids Yields: missing snapshot ids """ ... @remote_api_endpoint("snapshot") def snapshot_get(self, snapshot_id): """Get the content, possibly partial, of a snapshot with the given id The branches of the snapshot are iterated in the lexicographical order of their names. .. warning:: At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the method :meth:`snapshot_get_branches` should be used instead. Args: snapshot_id (bytes): identifier of the snapshot Returns: dict: a dict with three keys: * **id**: identifier of the snapshot * **branches**: a dict of branches contained in the snapshot whose keys are the branches' names. * **next_branch**: the name of the first branch not returned or :const:`None` if the snapshot has less than 1000 branches. """ ... @remote_api_endpoint("snapshot/by_origin_visit") def snapshot_get_by_origin_visit(self, origin, visit): """Get the content, possibly partial, of a snapshot for the given origin visit The branches of the snapshot are iterated in the lexicographical order of their names. .. warning:: At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the method :meth:`snapshot_get_branches` should be used instead. Args: origin (int): the origin identifier visit (int): the visit identifier Returns: dict: None if the snapshot does not exist; a dict with three keys otherwise: * **id**: identifier of the snapshot * **branches**: a dict of branches contained in the snapshot whose keys are the branches' names. * **next_branch**: the name of the first branch not returned or :const:`None` if the snapshot has less than 1000 branches. """ ... @remote_api_endpoint("snapshot/count_branches") def snapshot_count_branches(self, snapshot_id): """Count the number of branches in the snapshot with the given id Args: snapshot_id (bytes): identifier of the snapshot Returns: dict: A dict whose keys are the target types of branches and values their corresponding amount """ ... @remote_api_endpoint("snapshot/get_branches") def snapshot_get_branches( self, snapshot_id, branches_from=b"", branches_count=1000, target_types=None ): """Get the content, possibly partial, of a snapshot with the given id The branches of the snapshot are iterated in the lexicographical order of their names. Args: snapshot_id (bytes): identifier of the snapshot branches_from (bytes): optional parameter used to skip branches whose name is lesser than it before returning them branches_count (int): optional parameter used to restrain the amount of returned branches target_types (list): optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are `'content', 'directory', 'revision', 'release', 'snapshot', 'alias'`) Returns: dict: None if the snapshot does not exist; a dict with three keys otherwise: * **id**: identifier of the snapshot * **branches**: a dict of branches contained in the snapshot whose keys are the branches' names. * **next_branch**: the name of the first branch not returned or :const:`None` if the snapshot has less than `branches_count` branches after `branches_from` included. """ ... @remote_api_endpoint("snapshot/get_random") def snapshot_get_random(self): """Finds a random snapshot id. Returns: a sha1_git """ ... @remote_api_endpoint("origin/visit/add") def origin_visit_add(self, visits: Iterable[OriginVisit]) -> Iterable[OriginVisit]: """Add visits to storage. If the visits have no id, they will be created and assigned one. The resulted visits are visits with their visit id set. Args: visits: Iterable of OriginVisit objects to add Raises: StorageArgumentException if some origin visit reference unknown origins Returns: Iterable[OriginVisit] stored """ ... @remote_api_endpoint("origin/visit_status/add") def origin_visit_status_add( self, visit_statuses: Iterable[OriginVisitStatus], ) -> None: """Add origin visit statuses. If there is already a status for the same origin and visit id at the same date, the new one will be either dropped or will replace the existing one (it is unspecified which one of these two behaviors happens). Args: visit_statuses: origin visit statuses to add Raises: StorageArgumentException if the origin of the visit status is unknown """ ... @remote_api_endpoint("origin/visit/get") def origin_visit_get( self, origin: str, last_visit: Optional[int] = None, limit: Optional[int] = None, order: str = "asc", ) -> Iterable[Dict[str, Any]]: """Retrieve all the origin's visit's information. Args: origin: The visited origin last_visit: Starting point from which listing the next visits Default to None limit: Number of results to return from the last visit. Default to None order: Order on visit id fields to list origin visits (default to asc) Yields: List of visits. """ ... @remote_api_endpoint("origin/visit/find_by_date") def origin_visit_find_by_date( self, origin: str, visit_date: datetime.datetime ) -> Optional[Dict[str, Any]]: """Retrieves the origin visit whose date is closest to the provided timestamp. In case of a tie, the visit with largest id is selected. Args: origin: origin (URL) visit_date: expected visit date Returns: A visit """ ... @remote_api_endpoint("origin/visit/getby") def origin_visit_get_by(self, origin: str, visit: int) -> Optional[Dict[str, Any]]: """Retrieve origin visit's information. Args: origin: origin (URL) visit: visit id Returns: The information on that particular (origin, visit) or None if it does not exist """ ... @remote_api_endpoint("origin/visit/get_latest") def origin_visit_get_latest( self, origin: str, type: Optional[str] = None, allowed_statuses: Optional[List[str]] = None, require_snapshot: bool = False, ) -> Optional[Dict[str, Any]]: """Get the latest origin visit for the given origin, optionally looking only for those with one of the given allowed_statuses or for those with a snapshot. Args: origin: origin URL type: Optional visit type to filter on (e.g git, tar, dsc, svn, hg, npm, pypi, ...) allowed_statuses: list of visit statuses considered to find the latest visit. For instance, ``allowed_statuses=['full']`` will only consider visits that have successfully run to completion. require_snapshot: If True, only a visit with a snapshot will be returned. Returns: dict: a dict with the following keys: - **origin**: the URL of the origin - **visit**: origin visit id - **type**: type of loader used for the visit - **date**: timestamp of such visit - **status**: Visit's new status - **metadata**: Data associated to the visit - **snapshot** (Optional[sha1_git]): identifier of the snapshot associated to the visit """ ... @remote_api_endpoint("origin/visit_status/get_latest") def origin_visit_status_get_latest( self, origin_url: str, visit: int, allowed_statuses: Optional[List[str]] = None, require_snapshot: bool = False, ) -> Optional[OriginVisitStatus]: """Get the latest origin visit status for the given origin visit, optionally looking only for those with one of the given allowed_statuses or with a snapshot. Args: origin: origin URL allowed_statuses: list of visit statuses considered to find the latest visit. Possible values are {created, ongoing, partial, full}. For instance, ``allowed_statuses=['full']`` will only consider visits that have successfully run to completion. require_snapshot: If True, only a visit with a snapshot will be returned. Returns: The OriginVisitStatus matching the criteria """ ... @remote_api_endpoint("origin/visit/get_random") def origin_visit_get_random(self, type: str) -> Optional[Dict[str, Any]]: """Randomly select one successful origin visit with <type> made in the last 3 months. Returns: dict representing an origin visit, in the same format as :py:meth:`origin_visit_get`. """ ... @remote_api_endpoint("object/find_by_sha1_git") def object_find_by_sha1_git(self, ids): """Return the objects found with the given ids. Args: ids: a generator of sha1_gits Returns: dict: a mapping from id to the list of objects found. Each object found is itself a dict with keys: - sha1_git: the input id - type: the type of object found """ ... @remote_api_endpoint("origin/get") def origin_get(self, origins): """Return origins, either all identified by their ids or all identified by tuples (type, url). If the url is given and the type is omitted, one of the origins with that url is returned. Args: origin: a list of dictionaries representing the individual origins to find. These dicts have the key url: - url (bytes): the url the origin points to Returns: dict: the origin dictionary with the keys: - id: origin's id - url: origin's url Raises: ValueError: if the url or the id don't exist. """ ... @remote_api_endpoint("origin/get_sha1") def origin_get_by_sha1(self, sha1s): """Return origins, identified by the sha1 of their URLs. Args: sha1s (list[bytes]): a list of sha1s Yields: dicts containing origin information as returned by :meth:`swh.storage.storage.Storage.origin_get`, or None if an origin matching the sha1 is not found. """ ... @deprecated @remote_api_endpoint("origin/get_range") def origin_get_range(self, origin_from=1, origin_count=100): """Retrieve ``origin_count`` origins whose ids are greater or equal than ``origin_from``. Origins are sorted by id before retrieving them. Args: origin_from (int): the minimum id of origins to retrieve origin_count (int): the maximum number of origins to retrieve Yields: dicts containing origin information as returned by :meth:`swh.storage.storage.Storage.origin_get`. """ ... @remote_api_endpoint("origin/list") def origin_list(self, page_token: Optional[str] = None, limit: int = 100) -> dict: """Returns the list of origins Args: page_token: opaque token used for pagination. limit: the maximum number of results to return Returns: dict: dict with the following keys: - **next_page_token** (str, optional): opaque token to be used as `page_token` for retrieving the next page. if absent, there is no more pages to gather. - **origins** (List[dict]): list of origins, as returned by `origin_get`. """ ... @remote_api_endpoint("origin/search") def origin_search( self, url_pattern, offset=0, limit=50, regexp=False, with_visit=False ): """Search for origins whose urls contain a provided string pattern or match a provided regular expression. The search is performed in a case insensitive way. Args: url_pattern (str): the string pattern to search for in origin urls offset (int): number of found origins to skip before returning results limit (int): the maximum number of found origins to return regexp (bool): if True, consider the provided pattern as a regular expression and return origins whose urls match it with_visit (bool): if True, filter out origins with no visit Yields: dicts containing origin information as returned by :meth:`swh.storage.storage.Storage.origin_get`. """ ... @deprecated @remote_api_endpoint("origin/count") def origin_count(self, url_pattern, regexp=False, with_visit=False): """Count origins whose urls contain a provided string pattern or match a provided regular expression. The pattern search in origin urls is performed in a case insensitive way. Args: url_pattern (str): the string pattern to search for in origin urls regexp (bool): if True, consider the provided pattern as a regular expression and return origins whose urls match it with_visit (bool): if True, filter out origins with no visit Returns: int: The number of origins matching the search criterion. """ ... @remote_api_endpoint("origin/add_multi") def origin_add(self, origins: Iterable[Origin]) -> Dict[str, int]: """Add origins to the storage Args: origins: list of dictionaries representing the individual origins, with the following keys: - type: the origin type ('git', 'svn', 'deb', ...) - url (bytes): the url the origin points to Returns: Summary dict of keys with associated count as values origin:add: Count of object actually stored in db """ ... @deprecated @remote_api_endpoint("origin/add") def origin_add_one(self, origin: Origin) -> str: """Add origin to the storage Args: origin: dictionary representing the individual origin to add. This dict has the following keys: - type (FIXME: enum TBD): the origin type ('git', 'wget', ...) - url (bytes): the url the origin points to Returns: the id of the added origin, or of the identical one that already exists. """ ... def stat_counters(self): """compute statistics about the number of tuples in various tables Returns: dict: a dictionary mapping textual labels (e.g., content) to integer values (e.g., the number of tuples in table content) """ ... def refresh_stat_counters(self): """Recomputes the statistics for `stat_counters`.""" ... @remote_api_endpoint("object_metadata/add") def object_metadata_add(self, metadata: Iterable[RawExtrinsicMetadata],) -> None: """Add extrinsic metadata on objects (contents, directories, ...). The authority and fetcher must be known to the storage before using this endpoint. If there is already metadata for the same object, authority, fetcher, and at the same date; the new one will be either dropped or will replace the existing one (it is unspecified which one of these two behaviors happens). Args: metadata: iterable of RawExtrinsicMetadata objects to be inserted. """ ... @remote_api_endpoint("object_metadata/get") def object_metadata_get( self, object_type: MetadataTargetType, id: Union[str, SWHID], authority: MetadataAuthority, after: Optional[datetime.datetime] = None, page_token: Optional[bytes] = None, limit: int = 1000, ) -> Dict[str, Union[Optional[bytes], List[RawExtrinsicMetadata]]]: """Retrieve list of all object_metadata entries for the id Args: object_type: one of the values of swh.model.model.MetadataTargetType id: an URL if object_type is 'origin', else a core SWHID authority: a dict containing keys `type` and `url`. after: minimum discovery_date for a result to be returned page_token: opaque token, used to get the next page of results limit: maximum number of results to be returned Returns: dict with keys `next_page_token` and `results`. `next_page_token` is an opaque token that is used to get the next page of results, or `None` if there are no more results. `results` is a list of RawExtrinsicMetadata objects: """ ... @remote_api_endpoint("metadata_fetcher/add") def metadata_fetcher_add(self, fetchers: Iterable[MetadataFetcher],) -> None: """Add new metadata fetchers to the storage. Their `name` and `version` together are unique identifiers of this fetcher; and `metadata` is an arbitrary dict of JSONable data with information about this fetcher, which must not be `None` (but may be empty). Args: fetchers: iterable of MetadataFetcher to be inserted """ ... @remote_api_endpoint("metadata_fetcher/get") def metadata_fetcher_get( self, name: str, version: str ) -> Optional[MetadataFetcher]: """Retrieve information about a fetcher Args: name: the name of the fetcher version: version of the fetcher Returns: a MetadataFetcher object (with a non-None metadata field) if it is known, else None. """ ... @remote_api_endpoint("metadata_authority/add") def metadata_authority_add(self, authorities: Iterable[MetadataAuthority]) -> None: """Add new metadata authorities to the storage. Their `type` and `url` together are unique identifiers of this authority; and `metadata` is an arbitrary dict of JSONable data with information about this authority, which must not be `None` (but may be empty). Args: authorities: iterable of MetadataAuthority to be inserted """ ... @remote_api_endpoint("metadata_authority/get") def metadata_authority_get( self, type: MetadataAuthorityType, url: str ) -> Optional[MetadataAuthority]: """Retrieve information about an authority Args: - type: one of "deposit", "forge", or "registry" + type: one of "deposit_client", "forge", or "registry" url: unique URI identifying the authority Returns: a MetadataAuthority object (with a non-None metadata field) if it is known, else None. """ ... @deprecated @remote_api_endpoint("algos/diff_directories") def diff_directories(self, from_dir, to_dir, track_renaming=False): """Compute the list of file changes introduced between two arbitrary directories (insertion / deletion / modification / renaming of files). Args: from_dir (bytes): identifier of the directory to compare from to_dir (bytes): identifier of the directory to compare to track_renaming (bool): whether or not to track files renaming Returns: A list of dict describing the introduced file changes (see :func:`swh.storage.algos.diff.diff_directories` for more details). """ ... @deprecated @remote_api_endpoint("algos/diff_revisions") def diff_revisions(self, from_rev, to_rev, track_renaming=False): """Compute the list of file changes introduced between two arbitrary revisions (insertion / deletion / modification / renaming of files). Args: from_rev (bytes): identifier of the revision to compare from to_rev (bytes): identifier of the revision to compare to track_renaming (bool): whether or not to track files renaming Returns: A list of dict describing the introduced file changes (see :func:`swh.storage.algos.diff.diff_directories` for more details). """ ... @deprecated @remote_api_endpoint("algos/diff_revision") def diff_revision(self, revision, track_renaming=False): """Compute the list of file changes introduced by a specific revision (insertion / deletion / modification / renaming of files) by comparing it against its first parent. Args: revision (bytes): identifier of the revision from which to compute the list of files changes track_renaming (bool): whether or not to track files renaming Returns: A list of dict describing the introduced file changes (see :func:`swh.storage.algos.diff.diff_directories` for more details). """ ... @remote_api_endpoint("clear/buffer") def clear_buffers(self, object_types: Optional[Iterable[str]] = None) -> None: """For backend storages (pg, storage, in-memory), this is a noop operation. For proxy storages (especially filter, buffer), this is an operation which cleans internal state. """ @remote_api_endpoint("flush") def flush(self, object_types: Optional[Iterable[str]] = None) -> Dict: """For backend storages (pg, storage, in-memory), this is expected to be a noop operation. For proxy storages (especially buffer), this is expected to trigger actual writes to the backend. """ ... diff --git a/swh/storage/sql/30-swh-schema.sql b/swh/storage/sql/30-swh-schema.sql index d267d380..bb7a8044 100644 --- a/swh/storage/sql/30-swh-schema.sql +++ b/swh/storage/sql/30-swh-schema.sql @@ -1,499 +1,499 @@ --- --- SQL implementation of the Software Heritage data model --- -- schema versions create table dbversion ( version int primary key, release timestamptz, description text ); comment on table dbversion is 'Details of current db version'; comment on column dbversion.version is 'SQL schema version'; comment on column dbversion.release is 'Version deployment timestamp'; comment on column dbversion.description is 'Release description'; -- latest schema version insert into dbversion(version, release, description) values(158, now(), 'Work In Progress'); -- a SHA1 checksum create domain sha1 as bytea check (length(value) = 20); -- a Git object ID, i.e., a Git-style salted SHA1 checksum create domain sha1_git as bytea check (length(value) = 20); -- a SHA256 checksum create domain sha256 as bytea check (length(value) = 32); -- a blake2 checksum create domain blake2s256 as bytea check (length(value) = 32); -- UNIX path (absolute, relative, individual path component, etc.) create domain unix_path as bytea; -- a set of UNIX-like access permissions, as manipulated by, e.g., chmod create domain file_perms as int; -- an SWHID create domain swhid as text check (value ~ '^swh:[0-9]+:.*'); -- Checksums about actual file content. Note that the content itself is not -- stored in the DB, but on external (key-value) storage. A single checksum is -- used as key there, but the other can be used to verify that we do not inject -- content collisions not knowingly. create table content ( sha1 sha1 not null, sha1_git sha1_git not null, sha256 sha256 not null, blake2s256 blake2s256 not null, length bigint not null, ctime timestamptz not null default now(), -- creation time, i.e. time of (first) injection into the storage status content_status not null default 'visible', object_id bigserial ); comment on table content is 'Checksums of file content which is actually stored externally'; comment on column content.sha1 is 'Content sha1 hash'; comment on column content.sha1_git is 'Git object sha1 hash'; comment on column content.sha256 is 'Content Sha256 hash'; comment on column content.blake2s256 is 'Content blake2s hash'; comment on column content.length is 'Content length'; comment on column content.ctime is 'First seen time'; comment on column content.status is 'Content status (absent, visible, hidden)'; comment on column content.object_id is 'Content identifier'; -- An origin is a place, identified by an URL, where software source code -- artifacts can be found. We support different kinds of origins, e.g., git and -- other VCS repositories, web pages that list tarballs URLs (e.g., -- http://www.kernel.org), indirect tarball URLs (e.g., -- http://www.example.org/latest.tar.gz), etc. The key feature of an origin is -- that it can be *fetched* from (wget, git clone, svn checkout, etc.) to -- retrieve all the contained software. create table origin ( id bigserial not null, url text not null ); comment on column origin.id is 'Artifact origin id'; comment on column origin.url is 'URL of origin'; -- Content blobs observed somewhere, but not ingested into the archive for -- whatever reason. This table is separate from the content table as we might -- not have the sha1 checksum of skipped contents (for instance when we inject -- git repositories, objects that are too big will be skipped here, and we will -- only know their sha1_git). 'reason' contains the reason the content was -- skipped. origin is a nullable column allowing to find out which origin -- contains that skipped content. create table skipped_content ( sha1 sha1, sha1_git sha1_git, sha256 sha256, blake2s256 blake2s256, length bigint not null, ctime timestamptz not null default now(), status content_status not null default 'absent', reason text not null, origin bigint, object_id bigserial ); comment on table skipped_content is 'Content blobs observed, but not ingested in the archive'; comment on column skipped_content.sha1 is 'Skipped content sha1 hash'; comment on column skipped_content.sha1_git is 'Git object sha1 hash'; comment on column skipped_content.sha256 is 'Skipped content sha256 hash'; comment on column skipped_content.blake2s256 is 'Skipped content blake2s hash'; comment on column skipped_content.length is 'Skipped content length'; comment on column skipped_content.ctime is 'First seen time'; comment on column skipped_content.status is 'Skipped content status (absent, visible, hidden)'; comment on column skipped_content.reason is 'Reason for skipping'; comment on column skipped_content.origin is 'Origin table identifier'; comment on column skipped_content.object_id is 'Skipped content identifier'; -- A file-system directory. A directory is a list of directory entries (see -- tables: directory_entry_{dir,file}). -- -- To list the contents of a directory: -- 1. list the contained directory_entry_dir using array dir_entries -- 2. list the contained directory_entry_file using array file_entries -- 3. list the contained directory_entry_rev using array rev_entries -- 4. UNION -- -- Synonyms/mappings: -- * git: tree create table directory ( id sha1_git not null, dir_entries bigint[], -- sub-directories, reference directory_entry_dir file_entries bigint[], -- contained files, reference directory_entry_file rev_entries bigint[], -- mounted revisions, reference directory_entry_rev object_id bigserial -- short object identifier ); comment on table directory is 'Contents of a directory, synonymous to tree (git)'; comment on column directory.id is 'Git object sha1 hash'; comment on column directory.dir_entries is 'Sub-directories, reference directory_entry_dir'; comment on column directory.file_entries is 'Contained files, reference directory_entry_file'; comment on column directory.rev_entries is 'Mounted revisions, reference directory_entry_rev'; comment on column directory.object_id is 'Short object identifier'; -- A directory entry pointing to a (sub-)directory. create table directory_entry_dir ( id bigserial, target sha1_git not null, -- id of target directory name unix_path not null, -- path name, relative to containing dir perms file_perms not null -- unix-like permissions ); comment on table directory_entry_dir is 'Directory entry for directory'; comment on column directory_entry_dir.id is 'Directory identifier'; comment on column directory_entry_dir.target is 'Target directory identifier'; comment on column directory_entry_dir.name is 'Path name, relative to containing directory'; comment on column directory_entry_dir.perms is 'Unix-like permissions'; -- A directory entry pointing to a file content. create table directory_entry_file ( id bigserial, target sha1_git not null, -- id of target file name unix_path not null, -- path name, relative to containing dir perms file_perms not null -- unix-like permissions ); comment on table directory_entry_file is 'Directory entry for file'; comment on column directory_entry_file.id is 'File identifier'; comment on column directory_entry_file.target is 'Target file identifier'; comment on column directory_entry_file.name is 'Path name, relative to containing directory'; comment on column directory_entry_file.perms is 'Unix-like permissions'; -- A directory entry pointing to a revision. create table directory_entry_rev ( id bigserial, target sha1_git not null, -- id of target revision name unix_path not null, -- path name, relative to containing dir perms file_perms not null -- unix-like permissions ); comment on table directory_entry_rev is 'Directory entry for revision'; comment on column directory_entry_dir.id is 'Revision identifier'; comment on column directory_entry_dir.target is 'Target revision in identifier'; comment on column directory_entry_dir.name is 'Path name, relative to containing directory'; comment on column directory_entry_dir.perms is 'Unix-like permissions'; -- A person referenced by some source code artifacts, e.g., a VCS revision or -- release metadata. create table person ( id bigserial, name bytea, -- advisory: not null if we managed to parse a name email bytea, -- advisory: not null if we managed to parse an email fullname bytea not null -- freeform specification; what is actually used in the checksums -- will usually be of the form 'name <email>' ); comment on table person is 'Person referenced in code artifact release metadata'; comment on column person.id is 'Person identifier'; comment on column person.name is 'Name'; comment on column person.email is 'Email'; comment on column person.fullname is 'Full name (raw name)'; -- The state of a source code tree at a specific point in time. -- -- Synonyms/mappings: -- * git / subversion / etc: commit -- * tarball: a specific tarball -- -- Revisions are organized as DAGs. Each revision points to 0, 1, or more (in -- case of merges) parent revisions. Each revision points to a directory, i.e., -- a file-system tree containing files and directories. create table revision ( id sha1_git not null, date timestamptz, date_offset smallint, committer_date timestamptz, committer_date_offset smallint, type revision_type not null, directory sha1_git, -- source code 'root' directory message bytea, author bigint, committer bigint, synthetic boolean not null default false, -- true iff revision has been created by Software Heritage metadata jsonb, -- extra metadata (tarball checksums, extra commit information, etc...) object_id bigserial, date_neg_utc_offset boolean, committer_date_neg_utc_offset boolean, extra_headers bytea[][] -- extra headers (used in hash computation) ); comment on table revision is 'A revision represents the state of a source code tree at a specific point in time'; comment on column revision.id is 'Git-style SHA1 commit identifier'; comment on column revision.date is 'Author timestamp as UNIX epoch'; comment on column revision.date_offset is 'Author timestamp timezone, as minute offsets from UTC'; comment on column revision.date_neg_utc_offset is 'True indicates a -0 UTC offset on author timestamp'; comment on column revision.committer_date is 'Committer timestamp as UNIX epoch'; comment on column revision.committer_date_offset is 'Committer timestamp timezone, as minute offsets from UTC'; comment on column revision.committer_date_neg_utc_offset is 'True indicates a -0 UTC offset on committer timestamp'; comment on column revision.type is 'Type of revision'; comment on column revision.directory is 'Directory identifier'; comment on column revision.message is 'Commit message'; comment on column revision.author is 'Author identity'; comment on column revision.committer is 'Committer identity'; comment on column revision.synthetic is 'True iff revision has been synthesized by Software Heritage'; comment on column revision.metadata is 'Extra revision metadata'; comment on column revision.object_id is 'Non-intrinsic, sequential object identifier'; comment on column revision.extra_headers is 'Extra revision headers; used in revision hash computation'; -- either this table or the sha1_git[] column on the revision table create table revision_history ( id sha1_git not null, parent_id sha1_git not null, parent_rank int not null default 0 -- parent position in merge commits, 0-based ); comment on table revision_history is 'Sequence of revision history with parent and position in history'; comment on column revision_history.id is 'Revision history git object sha1 checksum'; comment on column revision_history.parent_id is 'Parent revision git object identifier'; comment on column revision_history.parent_rank is 'Parent position in merge commits, 0-based'; -- Crawling history of software origins visited by Software Heritage. Each -- visit is a 3-way mapping between a software origin, a timestamp, and a -- snapshot object capturing the full-state of the origin at visit time. create table origin_visit ( origin bigint not null, visit bigint not null, date timestamptz not null, type text not null ); comment on column origin_visit.origin is 'Visited origin'; comment on column origin_visit.visit is 'Sequential visit number for the origin'; comment on column origin_visit.date is 'Visit timestamp'; comment on column origin_visit.type is 'Type of loader that did the visit (hg, git, ...)'; -- Crawling history of software origin visits by Software Heritage. Each -- visit see its history change through new origin visit status updates create table origin_visit_status ( origin bigint not null, visit bigint not null, date timestamptz not null, status origin_visit_state not null, metadata jsonb, snapshot sha1_git ); comment on column origin_visit_status.origin is 'Origin concerned by the visit update'; comment on column origin_visit_status.visit is 'Visit concerned by the visit update'; comment on column origin_visit_status.date is 'Visit update timestamp'; comment on column origin_visit_status.status is 'Visit status (ongoing, failed, full)'; comment on column origin_visit_status.metadata is 'Optional origin visit metadata'; comment on column origin_visit_status.snapshot is 'Optional, possibly partial, snapshot of the origin visit. It can be partial.'; -- A snapshot represents the entire state of a software origin as crawled by -- Software Heritage. This table is a simple mapping between (public) intrinsic -- snapshot identifiers and (private) numeric sequential identifiers. create table snapshot ( object_id bigserial not null, -- PK internal object identifier id sha1_git not null -- snapshot intrinsic identifier ); comment on table snapshot is 'State of a software origin as crawled by Software Heritage'; comment on column snapshot.object_id is 'Internal object identifier'; comment on column snapshot.id is 'Intrinsic snapshot identifier'; -- Each snapshot associate "branch" names to other objects in the Software -- Heritage Merkle DAG. This table describes branches as mappings between names -- and target typed objects. create table snapshot_branch ( object_id bigserial not null, -- PK internal object identifier name bytea not null, -- branch name, e.g., "master" or "feature/drag-n-drop" target bytea, -- target object identifier, e.g., a revision identifier target_type snapshot_target -- target object type, e.g., "revision" ); comment on table snapshot_branch is 'Associates branches with objects in Heritage Merkle DAG'; comment on column snapshot_branch.object_id is 'Internal object identifier'; comment on column snapshot_branch.name is 'Branch name'; comment on column snapshot_branch.target is 'Target object identifier'; comment on column snapshot_branch.target_type is 'Target object type'; -- Mapping between snapshots and their branches. create table snapshot_branches ( snapshot_id bigint not null, -- snapshot identifier, ref. snapshot.object_id branch_id bigint not null -- branch identifier, ref. snapshot_branch.object_id ); comment on table snapshot_branches is 'Mapping between snapshot and their branches'; comment on column snapshot_branches.snapshot_id is 'Snapshot identifier'; comment on column snapshot_branches.branch_id is 'Branch identifier'; -- A "memorable" point in time in the development history of a software -- project. -- -- Synonyms/mappings: -- * git: tag (of the annotated kind, otherwise they are just references) -- * tarball: the release version number create table release ( id sha1_git not null, target sha1_git, date timestamptz, date_offset smallint, name bytea, comment bytea, author bigint, synthetic boolean not null default false, -- true iff release has been created by Software Heritage object_id bigserial, target_type object_type not null, date_neg_utc_offset boolean ); comment on table release is 'Details of a software release, synonymous with a tag (git) or version number (tarball)'; comment on column release.id is 'Release git identifier'; comment on column release.target is 'Target git identifier'; comment on column release.date is 'Release timestamp'; comment on column release.date_offset is 'Timestamp offset from UTC'; comment on column release.name is 'Name'; comment on column release.comment is 'Comment'; comment on column release.author is 'Author'; comment on column release.synthetic is 'Indicates if created by Software Heritage'; comment on column release.object_id is 'Object identifier'; comment on column release.target_type is 'Object type (''content'', ''directory'', ''revision'', ''release'', ''snapshot'')'; comment on column release.date_neg_utc_offset is 'True indicates -0 UTC offset for release timestamp'; -- Tools create table metadata_fetcher ( id serial not null, name text not null, version text not null, metadata jsonb not null ); comment on table metadata_fetcher is 'Tools used to retrieve metadata'; comment on column metadata_fetcher.id is 'Internal identifier of the fetcher'; comment on column metadata_fetcher.name is 'Fetcher name'; comment on column metadata_fetcher.version is 'Fetcher version'; comment on column metadata_fetcher.metadata is 'Extra information about the fetcher'; create table metadata_authority ( id serial not null, type text not null, url text not null, metadata jsonb not null ); comment on table metadata_authority is 'Metadata authority information'; comment on column metadata_authority.id is 'Internal identifier of the authority'; -comment on column metadata_authority.type is 'Type of authority (deposit/forge/registry)'; +comment on column metadata_authority.type is 'Type of authority (deposit_client/forge/registry)'; comment on column metadata_authority.url is 'Authority''s uri'; comment on column metadata_authority.metadata is 'Other metadata about authority'; -- Extrinsic metadata on a DAG objects and origins. create table object_metadata ( type text not null, id text not null, -- metadata source authority_id bigint not null, fetcher_id bigint not null, discovery_date timestamptz not null, -- metadata itself format text not null, metadata bytea not null, -- context origin text, visit bigint, snapshot swhid, release swhid, revision swhid, path bytea, directory swhid ); comment on table object_metadata is 'keeps all metadata found concerning an object'; comment on column object_metadata.type is 'the type of object (content/directory/revision/release/snapshot/origin) the metadata is on'; comment on column object_metadata.id is 'the SWHID or origin URL for which the metadata was found'; comment on column object_metadata.discovery_date is 'the date of retrieval'; comment on column object_metadata.authority_id is 'the metadata provider: github, openhub, deposit, etc.'; comment on column object_metadata.fetcher_id is 'the tool used for extracting metadata: loaders, crawlers, etc.'; comment on column object_metadata.format is 'name of the format of metadata, used by readers to interpret it.'; comment on column object_metadata.metadata is 'original metadata in opaque format'; -- Keep a cache of object counts create table object_counts ( object_type text, -- table for which we're counting objects (PK) value bigint, -- count of objects in the table last_update timestamptz, -- last update for the object count in this table single_update boolean -- whether we update this table standalone (true) or through bucketed counts (false) ); comment on table object_counts is 'Cache of object counts'; comment on column object_counts.object_type is 'Object type (''content'', ''directory'', ''revision'', ''release'', ''snapshot'')'; comment on column object_counts.value is 'Count of objects in the table'; comment on column object_counts.last_update is 'Last update for object count'; comment on column object_counts.single_update is 'standalone (true) or bucketed counts (false)'; create table object_counts_bucketed ( line serial not null, -- PK object_type text not null, -- table for which we're counting objects identifier text not null, -- identifier across which we're bucketing objects bucket_start bytea, -- lower bound (inclusive) for the bucket bucket_end bytea, -- upper bound (exclusive) for the bucket value bigint, -- count of objects in the bucket last_update timestamptz -- last update for the object count in this bucket ); comment on table object_counts_bucketed is 'Bucketed count for objects ordered by type'; comment on column object_counts_bucketed.line is 'Auto incremented idenitfier value'; comment on column object_counts_bucketed.object_type is 'Object type (''content'', ''directory'', ''revision'', ''release'', ''snapshot'')'; comment on column object_counts_bucketed.identifier is 'Common identifier for bucketed objects'; comment on column object_counts_bucketed.bucket_start is 'Lower bound (inclusive) for the bucket'; comment on column object_counts_bucketed.bucket_end is 'Upper bound (exclusive) for the bucket'; comment on column object_counts_bucketed.value is 'Count of objects in the bucket'; comment on column object_counts_bucketed.last_update is 'Last update for the object count in this bucket'; diff --git a/swh/storage/tests/storage_data.py b/swh/storage/tests/storage_data.py index ccf25ab8..3c8e2337 100644 --- a/swh/storage/tests/storage_data.py +++ b/swh/storage/tests/storage_data.py @@ -1,597 +1,597 @@ # Copyright (C) 2015-2019 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import datetime import attr from swh.model.hashutil import hash_to_bytes, hash_to_hex from swh.model import from_disk from swh.model.identifiers import parse_swhid from swh.model.model import ( MetadataAuthority, MetadataAuthorityType, MetadataFetcher, RawExtrinsicMetadata, MetadataTargetType, ) class StorageData: def __getattr__(self, key): try: v = globals()[key] except KeyError as e: raise AttributeError(e.args[0]) if hasattr(v, "copy"): return v.copy() return v data = StorageData() cont = { "data": b"42\n", "length": 3, "sha1": hash_to_bytes("34973274ccef6ab4dfaaf86599792fa9c3fe4689"), "sha1_git": hash_to_bytes("d81cc0710eb6cf9efd5b920a8453e1e07157b6cd"), "sha256": hash_to_bytes( "673650f936cb3b0a2f93ce09d81be10748b1b203c19e8176b4eefc1964a0cf3a" ), "blake2s256": hash_to_bytes( "d5fe1939576527e42cfd76a9455a2432fe7f56669564577dd93c4280e76d661d" ), "status": "visible", } cont2 = { "data": b"4242\n", "length": 5, "sha1": hash_to_bytes("61c2b3a30496d329e21af70dd2d7e097046d07b7"), "sha1_git": hash_to_bytes("36fade77193cb6d2bd826161a0979d64c28ab4fa"), "sha256": hash_to_bytes( "859f0b154fdb2d630f45e1ecae4a862915435e663248bb8461d914696fc047cd" ), "blake2s256": hash_to_bytes( "849c20fad132b7c2d62c15de310adfe87be94a379941bed295e8141c6219810d" ), "status": "visible", } cont3 = { "data": b"424242\n", "length": 7, "sha1": hash_to_bytes("3e21cc4942a4234c9e5edd8a9cacd1670fe59f13"), "sha1_git": hash_to_bytes("c932c7649c6dfa4b82327d121215116909eb3bea"), "sha256": hash_to_bytes( "92fb72daf8c6818288a35137b72155f507e5de8d892712ab96277aaed8cf8a36" ), "blake2s256": hash_to_bytes( "76d0346f44e5a27f6bafdd9c2befd304aff83780f93121d801ab6a1d4769db11" ), "status": "visible", "ctime": "2019-12-01 00:00:00Z", } contents = (cont, cont2, cont3) missing_cont = { "length": 8, "sha1": hash_to_bytes("f9c24e2abb82063a3ba2c44efd2d3c797f28ac90"), "sha1_git": hash_to_bytes("33e45d56f88993aae6a0198013efa80716fd8919"), "sha256": hash_to_bytes( "6bbd052ab054ef222c1c87be60cd191addedd24cc882d1f5f7f7be61dc61bb3a" ), "blake2s256": hash_to_bytes( "306856b8fd879edb7b6f1aeaaf8db9bbecc993cd7f776c333ac3a782fa5c6eba" ), "reason": "Content too long", "status": "absent", } skipped_cont = { "length": 1024 * 1024 * 200, "sha1_git": hash_to_bytes("33e45d56f88993aae6a0198013efa80716fd8920"), "sha1": hash_to_bytes("43e45d56f88993aae6a0198013efa80716fd8920"), "sha256": hash_to_bytes( "7bbd052ab054ef222c1c87be60cd191addedd24cc882d1f5f7f7be61dc61bb3a" ), "blake2s256": hash_to_bytes( "ade18b1adecb33f891ca36664da676e12c772cc193778aac9a137b8dc5834b9b" ), "reason": "Content too long", "status": "absent", "origin": "file:///dev/zero", } skipped_cont2 = { "length": 1024 * 1024 * 300, "sha1_git": hash_to_bytes("44e45d56f88993aae6a0198013efa80716fd8921"), "sha1": hash_to_bytes("54e45d56f88993aae6a0198013efa80716fd8920"), "sha256": hash_to_bytes( "8cbd052ab054ef222c1c87be60cd191addedd24cc882d1f5f7f7be61dc61bb3a" ), "blake2s256": hash_to_bytes( "9ce18b1adecb33f891ca36664da676e12c772cc193778aac9a137b8dc5834b9b" ), "reason": "Content too long", "status": "absent", } skipped_contents = (skipped_cont, skipped_cont2) dir = { "id": hash_to_bytes("34f335a750111ca0a8b64d8034faec9eedc396be"), "entries": ( { "name": b"foo", "type": "file", "target": hash_to_bytes("d81cc0710eb6cf9efd5b920a8453e1e07157b6cd"), # cont "perms": from_disk.DentryPerms.content, }, { "name": b"bar\xc3", "type": "dir", "target": b"12345678901234567890", "perms": from_disk.DentryPerms.directory, }, ), } dir2 = { "id": hash_to_bytes("8505808532953da7d2581741f01b29c04b1cb9ab"), "entries": ( { "name": b"oof", "type": "file", "target": hash_to_bytes( # cont2 "36fade77193cb6d2bd826161a0979d64c28ab4fa" ), "perms": from_disk.DentryPerms.content, }, ), } dir3 = { "id": hash_to_bytes("4ea8c6b2f54445e5dd1a9d5bb2afd875d66f3150"), "entries": ( { "name": b"foo", "type": "file", "target": hash_to_bytes("d81cc0710eb6cf9efd5b920a8453e1e07157b6cd"), # cont "perms": from_disk.DentryPerms.content, }, { "name": b"subdir", "type": "dir", "target": hash_to_bytes("34f335a750111ca0a8b64d8034faec9eedc396be"), # dir "perms": from_disk.DentryPerms.directory, }, { "name": b"hello", "type": "file", "target": b"12345678901234567890", "perms": from_disk.DentryPerms.content, }, ), } dir4 = { "id": hash_to_bytes("377aa5fcd944fbabf502dbfda55cd14d33c8c3c6"), "entries": ( { "name": b"subdir1", "type": "dir", "target": hash_to_bytes("4ea8c6b2f54445e5dd1a9d5bb2afd875d66f3150"), # dir3 "perms": from_disk.DentryPerms.directory, }, ), } directories = (dir, dir2, dir3, dir4) minus_offset = datetime.timezone(datetime.timedelta(minutes=-120)) plus_offset = datetime.timezone(datetime.timedelta(minutes=120)) revision = { "id": hash_to_bytes("066b1b62dbfa033362092af468bf6cfabec230e7"), "message": b"hello", "author": { "name": b"Nicolas Dandrimont", "email": b"nicolas@example.com", "fullname": b"Nicolas Dandrimont <nicolas@example.com> ", }, "date": { "timestamp": {"seconds": 1234567890, "microseconds": 0}, "offset": 120, "negative_utc": False, }, "committer": { "name": b"St\xc3fano Zacchiroli", "email": b"stefano@example.com", "fullname": b"St\xc3fano Zacchiroli <stefano@example.com>", }, "committer_date": { "timestamp": {"seconds": 1123456789, "microseconds": 0}, "offset": 0, "negative_utc": True, }, "parents": (b"01234567890123456789", b"23434512345123456789"), "type": "git", "directory": hash_to_bytes("34f335a750111ca0a8b64d8034faec9eedc396be"), # dir "metadata": { "checksums": {"sha1": "tarball-sha1", "sha256": "tarball-sha256",}, "signed-off-by": "some-dude", }, "extra_headers": ( (b"gpgsig", b"test123"), (b"mergetag", b"foo\\bar"), (b"mergetag", b"\x22\xaf\x89\x80\x01\x00"), ), "synthetic": True, } revision2 = { "id": hash_to_bytes("df7a6f6a99671fb7f7343641aff983a314ef6161"), "message": b"hello again", "author": { "name": b"Roberto Dicosmo", "email": b"roberto@example.com", "fullname": b"Roberto Dicosmo <roberto@example.com>", }, "date": { "timestamp": {"seconds": 1234567843, "microseconds": 220000,}, "offset": -720, "negative_utc": False, }, "committer": { "name": b"tony", "email": b"ar@dumont.fr", "fullname": b"tony <ar@dumont.fr>", }, "committer_date": { "timestamp": {"seconds": 1123456789, "microseconds": 0}, "offset": 0, "negative_utc": False, }, "parents": (b"01234567890123456789",), "type": "git", "directory": hash_to_bytes("8505808532953da7d2581741f01b29c04b1cb9ab"), # dir2 "metadata": None, "extra_headers": (), "synthetic": False, } revision3 = { "id": hash_to_bytes("2cbd7bb22c653bbb23a29657852a50a01b591d46"), "message": b"a simple revision with no parents this time", "author": { "name": b"Roberto Dicosmo", "email": b"roberto@example.com", "fullname": b"Roberto Dicosmo <roberto@example.com>", }, "date": { "timestamp": {"seconds": 1234567843, "microseconds": 220000,}, "offset": -720, "negative_utc": False, }, "committer": { "name": b"tony", "email": b"ar@dumont.fr", "fullname": b"tony <ar@dumont.fr>", }, "committer_date": { "timestamp": {"seconds": 1127351742, "microseconds": 0}, "offset": 0, "negative_utc": False, }, "parents": (), "type": "git", "directory": hash_to_bytes("8505808532953da7d2581741f01b29c04b1cb9ab"), # dir2 "metadata": None, "extra_headers": (), "synthetic": True, } revision4 = { "id": hash_to_bytes("88cd5126fc958ed70089d5340441a1c2477bcc20"), "message": b"parent of self.revision2", "author": { "name": b"me", "email": b"me@soft.heri", "fullname": b"me <me@soft.heri>", }, "date": { "timestamp": {"seconds": 1244567843, "microseconds": 220000,}, "offset": -720, "negative_utc": False, }, "committer": { "name": b"committer-dude", "email": b"committer@dude.com", "fullname": b"committer-dude <committer@dude.com>", }, "committer_date": { "timestamp": {"seconds": 1244567843, "microseconds": 220000,}, "offset": -720, "negative_utc": False, }, "parents": ( hash_to_bytes("2cbd7bb22c653bbb23a29657852a50a01b591d46"), ), # revision3 "type": "git", "directory": hash_to_bytes("34f335a750111ca0a8b64d8034faec9eedc396be"), # dir "metadata": None, "extra_headers": (), "synthetic": False, } revisions = (revision, revision2, revision3, revision4) origin = { "url": "file:///dev/null", } origin2 = { "url": "file:///dev/zero", } origins = (origin, origin2) metadata_authority = MetadataAuthority( - type=MetadataAuthorityType.DEPOSIT, + type=MetadataAuthorityType.DEPOSIT_CLIENT, url="http://hal.inria.example.com/", metadata={"location": "France"}, ) metadata_authority2 = MetadataAuthority( type=MetadataAuthorityType.REGISTRY, url="http://wikidata.example.com/", metadata={}, ) metadata_fetcher = MetadataFetcher( name="swh-deposit", version="0.0.1", metadata={"sword_version": "2"}, ) metadata_fetcher2 = MetadataFetcher(name="swh-example", version="0.0.1", metadata={},) date_visit1 = datetime.datetime(2015, 1, 1, 23, 0, 0, tzinfo=datetime.timezone.utc) type_visit1 = "git" date_visit2 = datetime.datetime(2017, 1, 1, 23, 0, 0, tzinfo=datetime.timezone.utc) type_visit2 = "hg" date_visit3 = datetime.datetime(2018, 1, 1, 23, 0, 0, tzinfo=datetime.timezone.utc) type_visit3 = "deb" origin_visit = { "origin": origin["url"], "visit": 1, "date": date_visit1, "type": type_visit1, } origin_visit2 = { "origin": origin["url"], "visit": 2, "date": date_visit2, "type": type_visit1, } origin_visit3 = { "origin": origin2["url"], "visit": 1, "date": date_visit1, "type": type_visit2, } origin_visits = [origin_visit, origin_visit2, origin_visit3] release = { "id": hash_to_bytes("a673e617fcc6234e29b2cad06b8245f96c415c61"), "name": b"v0.0.1", "author": { "name": b"olasd", "email": b"nic@olasd.fr", "fullname": b"olasd <nic@olasd.fr>", }, "date": { "timestamp": {"seconds": 1234567890, "microseconds": 0}, "offset": 42, "negative_utc": False, }, "target": b"43210987654321098765", "target_type": "revision", "message": b"synthetic release", "synthetic": True, } release2 = { "id": hash_to_bytes("6902bd4c82b7d19a421d224aedab2b74197e420d"), "name": b"v0.0.2", "author": { "name": b"tony", "email": b"ar@dumont.fr", "fullname": b"tony <ar@dumont.fr>", }, "date": { "timestamp": {"seconds": 1634366813, "microseconds": 0}, "offset": -120, "negative_utc": False, }, "target": b"432109\xa9765432\xc309\x00765", "target_type": "revision", "message": b"v0.0.2\nMisc performance improvements + bug fixes", "synthetic": False, } release3 = { "id": hash_to_bytes("3e9050196aa288264f2a9d279d6abab8b158448b"), "name": b"v0.0.2", "author": { "name": b"tony", "email": b"tony@ardumont.fr", "fullname": b"tony <tony@ardumont.fr>", }, "date": { "timestamp": {"seconds": 1634336813, "microseconds": 0}, "offset": 0, "negative_utc": False, }, "target": hash_to_bytes("df7a6f6a99671fb7f7343641aff983a314ef6161"), "target_type": "revision", "message": b"yet another synthetic release", "synthetic": True, } releases = (release, release2, release3) snapshot = { "id": hash_to_bytes("409ee1ff3f10d166714bc90581debfd0446dda57"), "branches": { b"master": { "target": hash_to_bytes("066b1b62dbfa033362092af468bf6cfabec230e7"), "target_type": "revision", }, }, } empty_snapshot = { "id": hash_to_bytes("1a8893e6a86f444e8be8e7bda6cb34fb1735a00e"), "branches": {}, } complete_snapshot = { "id": hash_to_bytes("a56ce2d81c190023bb99a3a36279307522cb85f6"), "branches": { b"directory": { "target": hash_to_bytes("1bd0e65f7d2ff14ae994de17a1e7fe65111dcad8"), "target_type": "directory", }, b"directory2": { "target": hash_to_bytes("1bd0e65f7d2ff14ae994de17a1e7fe65111dcad8"), "target_type": "directory", }, b"content": { "target": hash_to_bytes("fe95a46679d128ff167b7c55df5d02356c5a1ae1"), "target_type": "content", }, b"alias": {"target": b"revision", "target_type": "alias",}, b"revision": { "target": hash_to_bytes("aafb16d69fd30ff58afdd69036a26047f3aebdc6"), "target_type": "revision", }, b"release": { "target": hash_to_bytes("7045404f3d1c54e6473c71bbb716529fbad4be24"), "target_type": "release", }, b"snapshot": { "target": hash_to_bytes("1a8893e6a86f444e8be8e7bda6cb34fb1735a00e"), "target_type": "snapshot", }, b"dangling": None, }, } snapshots = (snapshot, empty_snapshot, complete_snapshot) content_metadata = RawExtrinsicMetadata( type=MetadataTargetType.CONTENT, id=parse_swhid(f"swh:1:cnt:{hash_to_hex(cont['sha1_git'])}"), origin=origin["url"], discovery_date=datetime.datetime( 2015, 1, 1, 21, 0, 0, tzinfo=datetime.timezone.utc ), authority=attr.evolve(metadata_authority, metadata=None), fetcher=attr.evolve(metadata_fetcher, metadata=None), format="json", metadata=b'{"foo": "bar"}', ) content_metadata2 = RawExtrinsicMetadata( type=MetadataTargetType.CONTENT, id=parse_swhid(f"swh:1:cnt:{hash_to_hex(cont['sha1_git'])}"), origin=origin2["url"], discovery_date=datetime.datetime( 2017, 1, 1, 22, 0, 0, tzinfo=datetime.timezone.utc ), authority=attr.evolve(metadata_authority, metadata=None), fetcher=attr.evolve(metadata_fetcher, metadata=None), format="yaml", metadata=b"foo: bar", ) content_metadata3 = RawExtrinsicMetadata( type=MetadataTargetType.CONTENT, id=parse_swhid(f"swh:1:cnt:{hash_to_hex(cont['sha1_git'])}"), discovery_date=datetime.datetime( 2017, 1, 1, 22, 0, 0, tzinfo=datetime.timezone.utc ), authority=attr.evolve(metadata_authority2, metadata=None), fetcher=attr.evolve(metadata_fetcher2, metadata=None), format="yaml", metadata=b"foo: bar", origin=origin["url"], visit=42, snapshot=parse_swhid(f"swh:1:snp:{hash_to_hex(snapshot['id'])}"), release=parse_swhid(f"swh:1:rel:{hash_to_hex(release['id'])}"), revision=parse_swhid(f"swh:1:rev:{hash_to_hex(revision['id'])}"), directory=parse_swhid(f"swh:1:dir:{hash_to_hex(dir['id'])}"), path=b"/foo/bar", ) origin_metadata = RawExtrinsicMetadata( type=MetadataTargetType.ORIGIN, id=origin["url"], discovery_date=datetime.datetime( 2015, 1, 1, 21, 0, 0, tzinfo=datetime.timezone.utc ), authority=attr.evolve(metadata_authority, metadata=None), fetcher=attr.evolve(metadata_fetcher, metadata=None), format="json", metadata=b'{"foo": "bar"}', ) origin_metadata2 = RawExtrinsicMetadata( type=MetadataTargetType.ORIGIN, id=origin["url"], discovery_date=datetime.datetime( 2017, 1, 1, 22, 0, 0, tzinfo=datetime.timezone.utc ), authority=attr.evolve(metadata_authority, metadata=None), fetcher=attr.evolve(metadata_fetcher, metadata=None), format="yaml", metadata=b"foo: bar", ) origin_metadata3 = RawExtrinsicMetadata( type=MetadataTargetType.ORIGIN, id=origin["url"], discovery_date=datetime.datetime( 2017, 1, 1, 22, 0, 0, tzinfo=datetime.timezone.utc ), authority=attr.evolve(metadata_authority2, metadata=None), fetcher=attr.evolve(metadata_fetcher2, metadata=None), format="yaml", metadata=b"foo: bar", ) person = { "name": b"John Doe", "email": b"john.doe@institute.org", "fullname": b"John Doe <john.doe@institute.org>", } objects = { "content": contents, "skipped_content": skipped_contents, "directory": directories, "revision": revisions, "origin": origins, "release": releases, "snapshot": snapshots, }