diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 250b735e..1ead4950 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -1,41 +1,41 @@ repos: - repo: https://github.com/pre-commit/pre-commit-hooks - rev: v4.1.0 + rev: v4.3.0 hooks: - id: trailing-whitespace - id: check-json - id: check-yaml - - repo: https://gitlab.com/pycqa/flake8 - rev: 4.0.1 + - repo: https://github.com/pycqa/flake8 + rev: 5.0.4 hooks: - id: flake8 - additional_dependencies: [flake8-bugbear==22.3.23] + additional_dependencies: [flake8-bugbear==22.9.23] - repo: https://github.com/codespell-project/codespell - rev: v2.1.0 + rev: v2.2.2 hooks: - id: codespell name: Check source code spelling args: [-L iff, -L gae, -L sur] stages: [commit] - repo: local hooks: - id: mypy name: mypy entry: mypy args: [swh] pass_filenames: false language: system types: [python] - repo: https://github.com/PyCQA/isort rev: 5.10.1 hooks: - id: isort - repo: https://github.com/python/black - rev: 22.3.0 + rev: 22.10.0 hooks: - id: black diff --git a/PKG-INFO b/PKG-INFO index 58c4f10c..d94d6ef5 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,246 +1,246 @@ Metadata-Version: 2.1 Name: swh.storage -Version: 1.7.0 +Version: 1.7.1 Summary: Software Heritage storage manager Home-page: https://forge.softwareheritage.org/diffusion/DSTO/ Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-storage Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-storage/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing Provides-Extra: journal License-File: LICENSE License-File: AUTHORS swh-storage =========== Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata. See the [documentation](https://docs.softwareheritage.org/devel/swh-storage/index.html) for more details. ## Quick start ### Dependencies Python tests for this module include tests that cannot be run without a local Postgresql database, so you need the Postgresql server executable on your machine (no need to have a running Postgresql server). They also expect a cassandra server. #### Debian-like host ``` $ sudo apt install libpq-dev postgresql-11 cassandra ``` #### Non Debian-like host The tests expects the path to `cassandra` to either be unspecified, it is then looked up at `/usr/sbin/cassandra`, either specified through the environment variable `SWH_CASSANDRA_BIN`. Optionally, you can avoid running the cassandra tests. ``` (swh) :~/swh-storage$ tox -- -m 'not cassandra' ``` ### Installation It is strongly recommended to use a virtualenv. In the following, we consider you work in a virtualenv named `swh`. See the [developer setup guide](https://docs.softwareheritage.org/devel/developer-setup.html#developer-setup) for a more details on how to setup a working environment. You can install the package directly from [pypi](https://pypi.org/p/swh.storage): ``` (swh) :~$ pip install swh.storage [...] ``` Or from sources: ``` (swh) :~$ git clone https://forge.softwareheritage.org/source/swh-storage.git [...] (swh) :~$ cd swh-storage (swh) :~/swh-storage$ pip install . [...] ``` Then you can check it's properly installed: ``` (swh) :~$ swh storage --help Usage: swh storage [OPTIONS] COMMAND [ARGS]... Software Heritage Storage tools. Options: -h, --help Show this message and exit. Commands: rpc-serve Software Heritage Storage RPC server. ``` ## Tests The best way of running Python tests for this module is to use [tox](https://tox.readthedocs.io/). ``` (swh) :~$ pip install tox ``` ### tox From the sources directory, simply use tox: ``` (swh) :~/swh-storage$ tox [...] ========= 315 passed, 6 skipped, 15 warnings in 40.86 seconds ========== _______________________________ summary ________________________________ flake8: commands succeeded py3: commands succeeded congratulations :) ``` Note: it is possible to set the `JAVA_HOME` environment variable to specify the version of the JVM to be used by Cassandra. For example, at the time of writing this, Cassandra does not support java 14, so one may want to use for example java 11: ``` (swh) :~/swh-storage$ export JAVA_HOME=/usr/lib/jvm/java-14-openjdk-amd64/bin/java (swh) :~/swh-storage$ tox [...] ``` ## Development The storage server can be locally started. It requires a configuration file and a running Postgresql database. ### Sample configuration A typical configuration `storage.yml` file is: ``` storage: cls: postgresql db: "dbname=softwareheritage-dev user= password=" objstorage: cls: pathslicing root: /tmp/swh-storage/ slicing: 0:2/2:4/4:6 ``` which means, this uses: - a local storage instance whose db connection is to `softwareheritage-dev` local instance, - the objstorage uses a local objstorage instance whose: - `root` path is /tmp/swh-storage, - slicing scheme is `0:2/2:4/4:6`. This means that the identifier of the content (sha1) which will be stored on disk at first level with the first 2 hex characters, the second level with the next 2 hex characters and the third level with the next 2 hex characters. And finally the complete hash file holding the raw content. For example: 00062f8bd330715c4f819373653d97b3cd34394c will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c Note that the `root` path should exist on disk before starting the server. ### Starting the storage server If the python package has been properly installed (e.g. in a virtual env), you should be able to use the command: ``` (swh) :~/swh-storage$ swh storage rpc-serve storage.yml ``` This runs a local swh-storage api at 5002 port. ``` (swh) :~/swh-storage$ curl http://127.0.0.1:5002 Software Heritage storage server

You have reached the Software Heritage storage server.
See its documentation and API for more information

``` ### And then what? In your upper layer ([loader-git](https://forge.softwareheritage.org/source/swh-loader-git/), [loader-svn](https://forge.softwareheritage.org/source/swh-loader-svn/), etc...), you can define a remote storage with this snippet of yaml configuration. ``` storage: cls: remote url: http://localhost:5002/ ``` You could directly define a postgresql storage with the following snippet: ``` storage: cls: postgresql db: service=swh-dev objstorage: cls: pathslicing root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` ## Cassandra As an alternative to PostgreSQL, swh-storage can use Cassandra as a database backend. It can be used like this: ``` storage: cls: cassandra hosts: - localhost objstorage: cls: pathslicing root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` The Cassandra swh-storage implementation supports both Cassandra >= 4.0-alpha2 and ScyllaDB >= 4.4 (and possibly earlier versions, but this is untested). While the main code supports both transparently, running tests or configuring the schema requires specific code when using ScyllaDB, enabled by setting the `SWH_USE_SCYLLADB=1` environment variable. diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst index 9d225230..d1b0b8e4 100644 --- a/docs/extrinsic-metadata-specification.rst +++ b/docs/extrinsic-metadata-specification.rst @@ -1,362 +1,365 @@ :orphan: .. _extrinsic-metadata-specification: Extrinsic metadata specification ================================ :term:`Extrinsic metadata` is information about software that is not part of the source code itself but still closely related to the software. Typical sources for extrinsic metadata are: the hosting place of a repository, which can offer metadata via its web view or API; external registries like collaborative curation initiatives; and out-of-band information available at source code archival time. Since they are not part of the source code, a dedicated mechanism to fetch and store them is needed. This specification assumes the reader is familiar with Software Heritage's :ref:`architecture` and :ref:`data-model`. Metadata sources ---------------- Authorities ^^^^^^^^^^^ Metadata authorities are entities that provide metadata about an :term:`origin`. Metadata authorities include: code hosting places, :term:`deposit` submitters, and registries (eg. Wikidata). An authority is uniquely defined by these properties: * its type, representing the kind of authority, which is one of these values: * ``deposit_client``, for metadata pushed to Software Heritage at the same time as a software artifact * ``forge``, for metadata pulled from the same source as the one hosting the software artifacts (which includes package managers) * ``registry``, for metadata pulled from a third-party * its URL, which unambiguously identifies an instance of the authority type. Examples: =============== ================================= type url =============== ================================= deposit_client https://hal.archives-ouvertes.fr/ deposit_client https://hal.inria.fr/ deposit_client https://software.intel.com/ forge https://gitlab.com/ forge https://gitlab.inria.fr/ forge https://0xacab.org/ forge https://github.com/ registry https://www.wikidata.org/ registry https://swmath.org/ registry https://ascl.net/ =============== ================================= Metadata fetchers ^^^^^^^^^^^^^^^^^ Metadata fetchers are software components used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive. A metadata fetcher is uniquely defined by these properties: * its type * its version Examples: * :term:`loaders `, which may either discover metadata as a side-effect of loading source code, or be dedicated to fetching metadata. * :term:`listers `, which may discover metadata as a side-effect of discovering origins. * :term:`deposit` submitters, which push metadata to SWH from a third-party; usually at the same time as a :term:`software artifact` * crawlers, which fetch metadata from an authority in a way that is none of the above (eg. by querying a specific API of the origin's forge). Storage API ----------- Authorities and metadata fetchers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Data model ~~~~~~~~~~ The :term:`storage` API uses these structures to represent metadata authorities and metadata fetchers (simplified Python code):: class MetadataAuthorityType(Enum): DEPOSIT_CLIENT = "deposit_client" FORGE = "forge" REGISTRY = "registry" class MetadataAuthority(BaseModel): """Represents an entity that provides metadata about an origin or software artifact.""" object_type = "metadata_authority" type: MetadataAuthorityType url: str class MetadataFetcher(BaseModel): """Represents a software component used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive.""" object_type = "metadata_fetcher" name: str version: str Storage API ~~~~~~~~~~~ * ``metadata_authority_add(authorities: List[MetadataAuthority])`` which adds a list of ``MetadataAuthority`` to the storage. * ``metadata_authority_get(type: MetadataAuthorityType, url: str) -> Optional[MetadataAuthority]`` which looks up a known authority (there is at most one) and if it is known, returns the corresponding ``MetadataAuthority`` * ``metadata_fetcher_add(fetchers: List[MetadataFetcher])`` which adds a list of ``MetadataFetcher`` to the storage. * ``metadata_fetcher_get(name: str, version: str) -> Optional[MetadataFetcher]`` which looks up a known fetcher (there is at most one) and if it is known, returns the corresponding ``MetadataFetcher`` Artifact metadata ^^^^^^^^^^^^^^^^^ Data model ~~~~~~~~~~ The storage database stores metadata on origins, and all software artifacts supported by the data model. They are represented using this structure (simplified Python code):: class RawExtrinsicMetadata(HashableObject, BaseModel): object_type = "raw_extrinsic_metadata" # target object target: ExtendedSWHID # source discovery_date: datetime.datetime authority: MetadataAuthority fetcher: MetadataFetcher # the metadata itself format: str metadata: bytes # context origin: Optional[str] = None visit: Optional[int] = None snapshot: Optional[CoreSWHID] = None release: Optional[CoreSWHID] = None revision: Optional[CoreSWHID] = None path: Optional[bytes] = None directory: Optional[CoreSWHID] = None id: Sha1Git The ``target`` may be: * a regular :ref:`core SWHID `, * a SWHID-like string with type ``ori`` and the SHA1 of an origin URL * a SWHID-like string with type ``emd`` and the SHA1 of an other ``RawExtrinsicMetadata`` object (to represent metadata on metadata objects) ``id`` is a sha1 hash of the ``RawExtrinsicMetadata`` object itself; it may be used in other ``RawExtrinsicMetadata`` as target. ``discovery_date`` is a Python datetime. ``authority`` must be a dict containing keys ``type`` and ``url``, and ``fetcher`` a dict containing keys ``name`` and ``version``. The authority and fetcher must be known to the storage before using this endpoint. ``format`` is a text field indicating the format of the content of the ``metadata`` byte string, see `extrinsic-metadata-formats`_. ``metadata`` is a byte array. Its format is specific to each authority; and is treated as an opaque value by the storage. Unifying these various formats into a common language is outside the scope of this specification. Finally, the remaining fields allow metadata can be given on a specific artifact within a specified context (for example: a directory in a specific revision from a specific visit on a specific origin) which will be stored along the metadata itself. For example, two origins may develop the same file independently; the information about authorship, licensing or even description may vary about the same artifact in a different context. This is why it is important to qualify the metadata with the complete context for which it is intended, if any. The allowed context fields for each ``target`` type are: * for ``emd`` (extrinsic metadata) and ``ori`` (origin): none * for ``snp`` (snapshot): ``origin`` (a URL) and ``visit`` (an integer) * for ``rel`` (release): those above, plus ``snapshot`` (the core SWHID of a snapshot) * for ``rev`` (revision): all those above, plus ``release`` (the core SWHID of a release) * for ``dir`` (directory): all those above, plus ``revision`` (the core SWHID of a revision) and ``path`` (a byte string), representing the path to this directory from the root of the ``revision`` * for ``cnt`` (content): all those above, plus ``directory`` (the core SWHID of a directory) All keys are optional, but should be provided whenever possible. The dictionary may be empty, if metadata is fully independent from context. In all cases, ``visit`` should only be provided if ``origin`` is (as visit ids are only unique with respect to an origin). Storage API ~~~~~~~~~~~ The storage API offers three endpoints to manipulate origin metadata: * Adding metadata:: raw_extrinsic_metadata_add(metadata: List[RawExtrinsicMetadata]) which adds a list of ``RawExtrinsicMetadata`` objects, whose ``metadata`` field is a byte string obtained from a given authority and associated to the ``target``. * Getting all metadata:: raw_extrinsic_metadata_get( target: ExtendedSWHID, authority: MetadataAuthority, after: Optional[datetime.datetime] = None, page_token: Optional[bytes] = None, limit: int = 1000, ) -> PagedResult[RawExtrinsicMetadata]: returns a list of ``RawExtrinsicMetadata`` with the given ``target`` and from the given ``authority``. If ``after`` is provided, only objects whose discovery date is more recent are returnered. ``PagedResult`` is a structure containing the results themselves, and a ``next_page_token`` used to fetch the next list of results, if any .. _extrinsic-metadata-formats: Extrinsic metadata formats -------------------------- Formats are identified by an opaque string. When possible, it should be the MIME type already in use to describe the metadata format outside Software Heritage. Otherwise it should be unambiguous, printable ASCII without spaces, and human-readable. Here is a list of all the metadata format stored: ``application/vnd.github.v3+json`` The metadata is the response of an API call to GitHub. +``cpan-release-json`` + The metadata is the response of an API call to `CPAN's /release/{author}/{release} + `_ endpoint. ``gitlab-project-json`` The metadata is the response of an API call to `Gitlab's /api/v4/projects/:id - ` endpoint. + `_ endpoint. ``gitea-project-json`` The metadata is the response of an API call to `Gitea's /api/v1/repos/{owner}/{repo} `_ endpoint. ``gogs-project-json`` The metadata is the response of an API call to `Gogs's /api/v1/repos/:owner/:repo `_ endpoint. ``pypi-project-json`` The metadata is a release entry from a PyPI project's JSON file, extracted and re-serialized. ``replicate-npm-package-json`` ditto, but from a replicate.npmjs.com project ``nixguix-sources-json`` ditto, but from https://nix-community.github.io/nixpkgs-swh/ ``original-artifacts-json`` tarball data, see below ``sword-v2-atom-codemeta`` XML Atom document, with Codemeta metadata, as sent by a deposit client, see the :ref:`Deposit protocol reference `. ``sword-v2-atom-codemeta-v2`` Deprecated alias of ``sword-v2-atom-codemeta`` ``sword-v2-atom-codemeta-v2-in-json`` Deprecated, JSON serialization of a ``sword-v2-atom-codemeta`` document. ``xml-deposit-info`` Information about a deposit, to identify the provenance of a metadata object sent via swh-deposit, see below Details on some of these formats: .. _extrinsic-metadata-original-artifacts-json: original-artifacts-json ^^^^^^^^^^^^^^^^^^^^^^^ This is a loosely defined format, originally used as a ``metadata`` column on the ``revision`` table that changed over the years. It is a JSON array, and each entry is a JSON object representing an archive (tarball, zipball, ...) that was unpackaged by the SWH loader before loading its content in Software Heritage. When writing this specification, it was stabilized to this format:: [ { "length": , "filename": "", "checksums": { "sha1": "", "sha256": "", }, "url": "" }, ... ] Older ``original-artifacts-json`` were migrated to use this format, but may be missing some of the keys. xml-deposit-info ^^^^^^^^^^^^^^^^ Deposits with code objects are loaded as their own origin, so we can look them up in the deposit database from their metadata (which hold the origin as a context). This is not true for metadata-only deposits, because we don't create an origin for them; so we need to store this information somewhere. The naive solution would be to insert them in the Atom entry provided by the client, but it means altering a document before we archive it, which potentially corrupts it or loses part of the data. Therefore, on each metadata-only deposit, the deposit creates an extra "metametadata" object, with the original metadata object as target, and using this format:: {{ deposit.id }} {{ deposit.client.provider_url }} {{ deposit.collection.name }} diff --git a/swh.storage.egg-info/PKG-INFO b/swh.storage.egg-info/PKG-INFO index 58c4f10c..d94d6ef5 100644 --- a/swh.storage.egg-info/PKG-INFO +++ b/swh.storage.egg-info/PKG-INFO @@ -1,246 +1,246 @@ Metadata-Version: 2.1 Name: swh.storage -Version: 1.7.0 +Version: 1.7.1 Summary: Software Heritage storage manager Home-page: https://forge.softwareheritage.org/diffusion/DSTO/ Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-storage Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-storage/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing Provides-Extra: journal License-File: LICENSE License-File: AUTHORS swh-storage =========== Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata. See the [documentation](https://docs.softwareheritage.org/devel/swh-storage/index.html) for more details. ## Quick start ### Dependencies Python tests for this module include tests that cannot be run without a local Postgresql database, so you need the Postgresql server executable on your machine (no need to have a running Postgresql server). They also expect a cassandra server. #### Debian-like host ``` $ sudo apt install libpq-dev postgresql-11 cassandra ``` #### Non Debian-like host The tests expects the path to `cassandra` to either be unspecified, it is then looked up at `/usr/sbin/cassandra`, either specified through the environment variable `SWH_CASSANDRA_BIN`. Optionally, you can avoid running the cassandra tests. ``` (swh) :~/swh-storage$ tox -- -m 'not cassandra' ``` ### Installation It is strongly recommended to use a virtualenv. In the following, we consider you work in a virtualenv named `swh`. See the [developer setup guide](https://docs.softwareheritage.org/devel/developer-setup.html#developer-setup) for a more details on how to setup a working environment. You can install the package directly from [pypi](https://pypi.org/p/swh.storage): ``` (swh) :~$ pip install swh.storage [...] ``` Or from sources: ``` (swh) :~$ git clone https://forge.softwareheritage.org/source/swh-storage.git [...] (swh) :~$ cd swh-storage (swh) :~/swh-storage$ pip install . [...] ``` Then you can check it's properly installed: ``` (swh) :~$ swh storage --help Usage: swh storage [OPTIONS] COMMAND [ARGS]... Software Heritage Storage tools. Options: -h, --help Show this message and exit. Commands: rpc-serve Software Heritage Storage RPC server. ``` ## Tests The best way of running Python tests for this module is to use [tox](https://tox.readthedocs.io/). ``` (swh) :~$ pip install tox ``` ### tox From the sources directory, simply use tox: ``` (swh) :~/swh-storage$ tox [...] ========= 315 passed, 6 skipped, 15 warnings in 40.86 seconds ========== _______________________________ summary ________________________________ flake8: commands succeeded py3: commands succeeded congratulations :) ``` Note: it is possible to set the `JAVA_HOME` environment variable to specify the version of the JVM to be used by Cassandra. For example, at the time of writing this, Cassandra does not support java 14, so one may want to use for example java 11: ``` (swh) :~/swh-storage$ export JAVA_HOME=/usr/lib/jvm/java-14-openjdk-amd64/bin/java (swh) :~/swh-storage$ tox [...] ``` ## Development The storage server can be locally started. It requires a configuration file and a running Postgresql database. ### Sample configuration A typical configuration `storage.yml` file is: ``` storage: cls: postgresql db: "dbname=softwareheritage-dev user= password=" objstorage: cls: pathslicing root: /tmp/swh-storage/ slicing: 0:2/2:4/4:6 ``` which means, this uses: - a local storage instance whose db connection is to `softwareheritage-dev` local instance, - the objstorage uses a local objstorage instance whose: - `root` path is /tmp/swh-storage, - slicing scheme is `0:2/2:4/4:6`. This means that the identifier of the content (sha1) which will be stored on disk at first level with the first 2 hex characters, the second level with the next 2 hex characters and the third level with the next 2 hex characters. And finally the complete hash file holding the raw content. For example: 00062f8bd330715c4f819373653d97b3cd34394c will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c Note that the `root` path should exist on disk before starting the server. ### Starting the storage server If the python package has been properly installed (e.g. in a virtual env), you should be able to use the command: ``` (swh) :~/swh-storage$ swh storage rpc-serve storage.yml ``` This runs a local swh-storage api at 5002 port. ``` (swh) :~/swh-storage$ curl http://127.0.0.1:5002 Software Heritage storage server

You have reached the Software Heritage storage server.
See its documentation and API for more information

``` ### And then what? In your upper layer ([loader-git](https://forge.softwareheritage.org/source/swh-loader-git/), [loader-svn](https://forge.softwareheritage.org/source/swh-loader-svn/), etc...), you can define a remote storage with this snippet of yaml configuration. ``` storage: cls: remote url: http://localhost:5002/ ``` You could directly define a postgresql storage with the following snippet: ``` storage: cls: postgresql db: service=swh-dev objstorage: cls: pathslicing root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` ## Cassandra As an alternative to PostgreSQL, swh-storage can use Cassandra as a database backend. It can be used like this: ``` storage: cls: cassandra hosts: - localhost objstorage: cls: pathslicing root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` The Cassandra swh-storage implementation supports both Cassandra >= 4.0-alpha2 and ScyllaDB >= 4.4 (and possibly earlier versions, but this is untested). While the main code supports both transparently, running tests or configuring the schema requires specific code when using ScyllaDB, enabled by setting the `SWH_USE_SCYLLADB=1` environment variable. diff --git a/swh/storage/tests/test_retry.py b/swh/storage/tests/test_retry.py index dde53861..33800a1f 100644 --- a/swh/storage/tests/test_retry.py +++ b/swh/storage/tests/test_retry.py @@ -1,267 +1,270 @@ # Copyright (C) 2020 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from unittest.mock import call import attr import psycopg2 import pytest from swh.core.api import TransientRemoteException from swh.storage.exc import HashCollision, StorageArgumentException from swh.storage.utils import now @pytest.fixture -def monkeypatch_sleep(monkeypatch, swh_storage): - """In test context, we don't want to wait, make test faster""" - from swh.storage.proxies.retry import RetryingProxyStorage - - for method_name, method in RetryingProxyStorage.__dict__.items(): +def storage_retry_sleep_mocks(mocker, swh_storage): + """In test context, we don't want to wait, make test faster by mocking + the sleep function from retryable storage methods""" + mocks = {} + for method_name in dir( + swh_storage + ): # swh_storage is an instance of RetryingProxyStorage if "_add" in method_name or "_update" in method_name: - monkeypatch.setattr(method.retry, "sleep", lambda x: None) + method = getattr(swh_storage, method_name) + mocks[method_name] = mocker.patch.object(method.retry, "sleep") - return monkeypatch + return mocks @pytest.fixture def fake_hash_collision(sample_data): return HashCollision("sha1", "38762cf7f55934b34d179ae6a4c80cadccbb7f0a", []) @pytest.fixture def swh_storage_backend_config(): yield { "cls": "pipeline", "steps": [ {"cls": "retry"}, {"cls": "memory"}, ], } def test_retrying_proxy_storage_content_add(swh_storage, sample_data): """Standard content_add works as before""" sample_content = sample_data.content content = swh_storage.content_get_data(sample_content.sha1) assert content is None s = swh_storage.content_add([sample_content]) assert s == { "content:add": 1, "content:add:bytes": sample_content.length, } content = swh_storage.content_get_data(sample_content.sha1) assert content == sample_content.data def test_retrying_proxy_storage_content_add_with_retry( - monkeypatch_sleep, + storage_retry_sleep_mocks, swh_storage, sample_data, mocker, fake_hash_collision, ): """Multiple retries for hash collision and psycopg2 error but finally ok""" mock_memory = mocker.patch("swh.storage.in_memory.InMemoryStorage.content_add") mock_memory.side_effect = [ # first try goes ko fake_hash_collision, # second try goes ko psycopg2.IntegrityError("content already inserted"), # ok then! {"content:add": 1}, ] sample_content = sample_data.content - sleep = mocker.patch("time.sleep") + sleep = storage_retry_sleep_mocks["content_add"] content = swh_storage.content_get_data(sample_content.sha1) assert content is None s = swh_storage.content_add([sample_content]) assert s == {"content:add": 1} mock_memory.assert_has_calls( [ call([sample_content]), call([sample_content]), call([sample_content]), ] ) assert len(sleep.mock_calls) == 2 - (_name, args1, _kwargs) = sleep.mock_calls[0] - (_name, args2, _kwargs) = sleep.mock_calls[1] + (_, args1, _) = sleep.mock_calls[0] + (_, args2, _) = sleep.mock_calls[1] assert 0 < args1[0] < 1 assert 0 < args2[0] < 2 def test_retrying_proxy_storage_content_add_with_retry_of_transient( - monkeypatch_sleep, + storage_retry_sleep_mocks, swh_storage, sample_data, mocker, ): """Multiple retries for hash collision and psycopg2 error but finally ok after many attempts""" mock_memory = mocker.patch("swh.storage.in_memory.InMemoryStorage.content_add") mock_memory.side_effect = [ TransientRemoteException("temporary failure"), TransientRemoteException("temporary failure"), # ok then! {"content:add": 1}, ] sample_content = sample_data.content content = swh_storage.content_get_data(sample_content.sha1) assert content is None - sleep = mocker.patch("time.sleep") + sleep = storage_retry_sleep_mocks["content_add"] s = swh_storage.content_add([sample_content]) assert s == {"content:add": 1} mock_memory.assert_has_calls( [ call([sample_content]), call([sample_content]), call([sample_content]), ] ) assert len(sleep.mock_calls) == 2 - (_name, args1, _kwargs) = sleep.mock_calls[0] - (_name, args2, _kwargs) = sleep.mock_calls[1] + (_, args1, _) = sleep.mock_calls[0] + (_, args2, _) = sleep.mock_calls[1] assert 10 < args1[0] < 11 assert 10 < args2[0] < 12 def test_retrying_proxy_swh_storage_content_add_failure( swh_storage, sample_data, mocker ): """Unfiltered errors are raising without retry""" mock_memory = mocker.patch("swh.storage.in_memory.InMemoryStorage.content_add") mock_memory.side_effect = StorageArgumentException("Refuse to add content always!") sample_content = sample_data.content content = swh_storage.content_get_data(sample_content.sha1) assert content is None with pytest.raises(StorageArgumentException, match="Refuse to add"): swh_storage.content_add([sample_content]) assert mock_memory.call_count == 1 def test_retrying_proxy_storage_content_add_metadata(swh_storage, sample_data): """Standard content_add_metadata works as before""" sample_content = sample_data.content content = attr.evolve(sample_content, data=None) pk = content.sha1 content_metadata = swh_storage.content_get([pk]) assert content_metadata == [None] s = swh_storage.content_add_metadata([attr.evolve(content, ctime=now())]) assert s == { "content:add": 1, } content_metadata = swh_storage.content_get([pk]) assert len(content_metadata) == 1 assert content_metadata[0].sha1 == pk def test_retrying_proxy_storage_content_add_metadata_with_retry( - monkeypatch_sleep, swh_storage, sample_data, mocker, fake_hash_collision + storage_retry_sleep_mocks, swh_storage, sample_data, mocker, fake_hash_collision ): """Multiple retries for hash collision and psycopg2 error but finally ok""" mock_memory = mocker.patch( "swh.storage.in_memory.InMemoryStorage.content_add_metadata" ) mock_memory.side_effect = [ # first try goes ko fake_hash_collision, # second try goes ko psycopg2.IntegrityError("content_metadata already inserted"), # ok then! {"content:add": 1}, ] sample_content = sample_data.content content = attr.evolve(sample_content, data=None) s = swh_storage.content_add_metadata([content]) assert s == {"content:add": 1} mock_memory.assert_has_calls( [ call([content]), call([content]), call([content]), ] ) def test_retrying_proxy_swh_storage_content_add_metadata_failure( swh_storage, sample_data, mocker ): """Unfiltered errors are raising without retry""" mock_memory = mocker.patch( "swh.storage.in_memory.InMemoryStorage.content_add_metadata" ) mock_memory.side_effect = StorageArgumentException( "Refuse to add content_metadata!" ) sample_content = sample_data.content content = attr.evolve(sample_content, data=None) with pytest.raises(StorageArgumentException, match="Refuse to add"): swh_storage.content_add_metadata([content]) assert mock_memory.call_count == 1 def test_retrying_proxy_swh_storage_keyboardinterrupt(swh_storage, sample_data, mocker): """Unfiltered errors are raising without retry""" mock_memory = mocker.patch("swh.storage.in_memory.InMemoryStorage.content_add") mock_memory.side_effect = KeyboardInterrupt() sample_content = sample_data.content content = swh_storage.content_get_data(sample_content.sha1) assert content is None with pytest.raises(KeyboardInterrupt): swh_storage.content_add([sample_content]) assert mock_memory.call_count == 1 def test_retrying_proxy_swh_storage_content_add_metadata_retry_failed( swh_storage, sample_data, mocker ): """When retrying fails every time, the last exception gets re-raised instead of a RetryError""" mock_memory = mocker.patch( "swh.storage.in_memory.InMemoryStorage.content_add_metadata" ) mock_memory.side_effect = [ ValueError(f"This will get raised eventually (attempt {i})") for i in range(1, 10) ] sample_content = sample_data.content content = attr.evolve(sample_content, data=None) with pytest.raises(ValueError, match="(attempt 3)"): swh_storage.content_add_metadata([content]) assert mock_memory.call_count == 3 diff --git a/tox.ini b/tox.ini index 7d233f81..49a9fcdc 100644 --- a/tox.ini +++ b/tox.ini @@ -1,84 +1,85 @@ [tox] envlist=black,flake8,mypy,py3 [testenv] extras = testing deps = pytest-cov dev: ipdb passenv = SWH_CASSANDRA_BIN SWH_CASSANDRA_LOG JAVA_HOME commands = pytest \ !slow: --hypothesis-profile=fast \ slow: --hypothesis-profile=slow \ --cov={envsitepackagesdir}/swh/storage \ {envsitepackagesdir}/swh/storage \ --doctest-modules \ --cov-branch {posargs} [testenv:black] skip_install = true deps = - black==22.3.0 + black==22.10.0 commands = {envpython} -m black --check swh [testenv:flake8] skip_install = true deps = - flake8==4.0.1 - flake8-bugbear==22.3.23 + flake8==5.0.4 + flake8-bugbear==22.9.23 + pycodestyle==2.9.1 commands = {envpython} -m flake8 [testenv:mypy] extras = testing deps = mypy==0.942 commands = mypy swh # build documentation outside swh-environment using the current # git HEAD of swh-docs, is executed on CI for each diff to prevent # breaking doc build [testenv:sphinx] whitelist_externals = make usedevelop = true extras = testing deps = # fetch and install swh-docs in develop mode -e git+https://forge.softwareheritage.org/source/swh-docs#egg=swh.docs pifpaf setenv = SWH_PACKAGE_DOC_TOX_BUILD = 1 # turn warnings into errors SPHINXOPTS = -W commands = {envpython} -m pifpaf run postgresql -- make -I ../.tox/sphinx/src/swh-docs/swh/ -C docs # build documentation only inside swh-environment using local state # of swh-docs package [testenv:sphinx-dev] whitelist_externals = make usedevelop = true extras = testing deps = # install swh-docs in develop mode -e ../swh-docs pifpaf setenv = SWH_PACKAGE_DOC_TOX_BUILD = 1 # turn warnings into errors SPHINXOPTS = -W commands = {envpython} -m pifpaf run postgresql -- make -I ../.tox/sphinx-dev/src/swh-docs/swh/ -C docs