diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..39d60022 --- /dev/null +++ b/.gitignore @@ -0,0 +1,15 @@ +*.pyc +*.sw? +*~ +/.coverage +/.coverage.* +.eggs/ +__pycache__ +build/ +dist/ +*.egg-info +version.txt +.vscode/ +.hypothesis/ +/.tox/ +.mypy_cache/ diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 00000000..7ee9db8a --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,48 @@ +repos: +- repo: https://github.com/pre-commit/pre-commit-hooks + rev: v2.4.0 + hooks: + - id: trailing-whitespace + - id: flake8 + - id: check-json + - id: check-yaml + +- repo: https://github.com/codespell-project/codespell + rev: v1.16.0 + hooks: + - id: codespell + exclude: TODO + args: [-L iff] + +- repo: local + hooks: + - id: mypy + name: mypy + entry: mypy + args: [swh] + pass_filenames: false + language: system + types: [python] + + - id: check-bumped-dbversion + name: check-bumped-dbversion + files: 'sql/upgrades/.*\.sql' + entry: grep + args: ['insert into dbversion'] + language: system + +- repo: https://github.com/python/black + rev: 19.10b0 + hooks: + - id: black + +# unfortunately, we are far from being able to enable this... +#- repo: https://github.com/PyCQA/pydocstyle.git +# rev: 4.0.0 +# hooks: +# - id: pydocstyle +# name: pydocstyle +# description: pydocstyle is a static analysis tool for checking compliance with Python docstring conventions. +# entry: pydocstyle --convention=google +# language: python +# types: [python] diff --git a/AUTHORS b/AUTHORS new file mode 100644 index 00000000..2d0a34af --- /dev/null +++ b/AUTHORS @@ -0,0 +1,3 @@ +Copyright (C) 2015 The Software Heritage developers + +See http://www.softwareheritage.org/ for more information. diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 00000000..0ad22b51 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,78 @@ +# Software Heritage Code of Conduct + +## Our Pledge + +In the interest of fostering an open and welcoming environment, we as Software +Heritage contributors and maintainers pledge to making participation in our +project and our community a harassment-free experience for everyone, regardless +of age, body size, disability, ethnicity, sex characteristics, gender identity +and expression, level of experience, education, socio-economic status, +nationality, personal appearance, race, religion, or sexual identity and +orientation. + +## Our Standards + +Examples of behavior that contributes to creating a positive environment +include: + +* Using welcoming and inclusive language +* Being respectful of differing viewpoints and experiences +* Gracefully accepting constructive criticism +* Focusing on what is best for the community +* Showing empathy towards other community members + +Examples of unacceptable behavior by participants include: + +* The use of sexualized language or imagery and unwelcome sexual attention or + advances +* Trolling, insulting/derogatory comments, and personal or political attacks +* Public or private harassment +* Publishing others' private information, such as a physical or electronic + address, without explicit permission +* Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Our Responsibilities + +Project maintainers are responsible for clarifying the standards of acceptable +behavior and are expected to take appropriate and fair corrective action in +response to any instances of unacceptable behavior. + +Project maintainers have the right and responsibility to remove, edit, or +reject comments, commits, code, wiki edits, issues, and other contributions +that are not aligned to this Code of Conduct, or to ban temporarily or +permanently any contributor for other behaviors that they deem inappropriate, +threatening, offensive, or harmful. + +## Scope + +This Code of Conduct applies within all project spaces, and it also applies when +an individual is representing the project or its community in public spaces. +Examples of representing a project or community include using an official +project e-mail address, posting via an official social media account, or acting +as an appointed representative at an online or offline event. Representation of +a project may be further defined and clarified by project maintainers. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported by contacting the project team at `conduct@softwareheritage.org`. All +complaints will be reviewed and investigated and will result in a response that +is deemed necessary and appropriate to the circumstances. The project team is +obligated to maintain confidentiality with regard to the reporter of an +incident. swh-storage
===========

Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata.

See the [documentation](https://docs.softwareheritage.org/devel/swh-storage/index.html) for more details. See the [documentation](https://docs.softwareheritage.org/devel/swh-storage/index.html) for more details. ## Quick start ### Dependencies Python tests for this module include tests that cannot be run without a local Postgresql database, so you need the Postgresql server executable on your machine (no need to have a running Postgresql server). They also expect a cassandra server. #### Debian-like host ``` $ sudo apt install libpq-dev postgresql-11 cassandra ``` #### Non Debian-like host The tests expects `/usr/sbin/cassandra` to exist. Optionally, you can avoid running the cassandra tests. ``` (swh) :~/swh-storage$ tox -- -m 'not cassandra' ``` ### Installation It is strongly recommended to use a virtualenv. In the following, we consider you work in a virtualenv named `swh`. See the [developer setup guide](https://docs.softwareheritage.org/devel/developer-setup.html#developer-setup) for a more details on how to setup a working environment. You can install the package directly from [pypi](https://pypi.org/p/swh.storage): ``` (swh) :~$ pip install swh.storage [...] ``` Or from sources: ``` (swh) :~$ git clone https://forge.softwareheritage.org/source/swh-storage.git [...] (swh) :~$ cd swh-storage (swh) :~/swh-storage$ pip install . [...] ``` Then you can check it's properly installed: ``` (swh) :~$ swh storage --help Usage: swh storage [OPTIONS] COMMAND [ARGS]... Software Heritage Storage tools. Options: -h, --help Show this message and exit. Commands: rpc-serve Software Heritage Storage RPC server. ``` ## Tests The best way of running Python tests for this module is to use [tox](https://tox.readthedocs.io/). ``` (swh) :~$ pip install tox ``` ### tox From the sources directory, simply use tox: ``` (swh) :~/swh-storage$ tox [...] ========= 315 passed, 6 skipped, 15 warnings in 40.86 seconds ========== _______________________________ summary ________________________________ flake8: commands succeeded py3: commands succeeded congratulations :) ``` ## Development The storage server can be locally started. It requires a configuration file and a running Postgresql database. ### Sample configuration A typical configuration `storage.yml` file is: ``` storage: cls: local args: db: "dbname=softwareheritage-dev user= password=" objstorage: cls: pathslicing args: root: /tmp/swh-storage/ slicing: 0:2/2:4/4:6 ``` which means, this uses: - a local storage instance whose db connection is to `softwareheritage-dev` local instance, - the objstorage uses a local objstorage instance whose: - `root` path is /tmp/swh-storage, - slicing scheme is `0:2/2:4/4:6`. This means that the identifier of the content (sha1) which will be stored on disk at first level with the first 2 hex characters, the second level with the next 2 hex characters and the third level with the next 2 hex characters. And finally the complete hash file holding the raw content. For example: 00062f8bd330715c4f819373653d97b3cd34394c will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c Note that the `root` path should exist on disk before starting the server. ### Starting the storage server If the python package has been properly installed (e.g. in a virtual env), you should be able to use the command: ``` (swh) :~/swh-storage$ swh storage rpc-serve storage.yml ``` This runs a local swh-storage api at 5002 port. ``` (swh) :~/swh-storage$ curl Software Heritage storage server

You have reached the Software Heritage storage server.
See its documentation and API for more information

``` ### And then what? In your upper layer ([loader-git](https://forge.softwareheritage.org/source/swh-loader-git/), [loader-svn](https://forge.softwareheritage.org/source/swh-loader-svn/), etc...), you can define a remote storage with this snippet of yaml configuration. ``` storage: cls: remote args: url: http://localhost:5002/ ``` You could directly define a local storage with the following snippet: ``` storage: cls: local args: db: service=swh-dev objstorage: cls: pathslicing args: root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing Provides-Extra: schemata Provides-Extra: journal diff --git a/docs/.gitignore b/docs/.gitignore new file mode 100644 index 00000000..58a761ea --- /dev/null +++ b/docs/.gitignore @@ -0,0 +1,3 @@ +_build/ +apidoc/ +*-stamp diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 00000000..b97c7532 --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,2 @@ +include ../../swh-docs/Makefile.sphinx +-include Makefile.local diff --git a/docs/Makefile.local b/docs/Makefile.local new file mode 100644 index 00000000..ed0b7b2a --- /dev/null +++ b/docs/Makefile.local @@ -0,0 +1,26 @@ +sphinx/html: sql-autodoc images +sphinx/clean: clean-sql-autodoc clean-images +assets: sql-autodoc images + +sql-autodoc: + make -C ../sql/ doc + +images: + make -C images/ +clean-images: + make -C images/ clean + +clean: clean-sql-autodoc clean-images +clean-sql-autodoc: + make -C ../sql/ clean + +distclean: clean distclean-sql-autodoc +distclean-sql-autodoc: + make -C ../sql/ distclean + +.PHONY: sql-autodoc clean-sql-autodoc images clean-images + + +# Local Variables: +# mode: makefile +# End: diff --git a/docs/_static/.placeholder b/docs/_static/.placeholder new file mode 100644 index 00000000..e69de29b diff --git a/docs/_templates/.placeholder b/docs/_templates/.placeholder new file mode 100644 index 00000000..e69de29b diff --git a/docs/archive-copies.rst b/docs/archive-copies.rst new file mode 100644 index 00000000..09f2ea40 --- /dev/null +++ b/docs/archive-copies.rst @@ -0,0 +1,48 @@ +:orphan: + +.. _archive-copies: + +Archive copies +============== + +.. _swh-storage-copies-layout: +.. figure:: images/swh-archive-copies.svg + :width: 1024px + :align: center + + Layout of Software Heritage archive copies (click to zoom). + +The Software Heritage archive exists in several copies, to minimize the risk of +losing archived source code artifacts. The layout of existing copies, their +relationships, as well as their geographical and administrative domains are +shown in the layout diagram above. + +We recall that the archive is conceptually organized as a graph, and +specifically a Merkle DAG, see :ref:`data model ` for more +information. + +Ingested source code artifacts land directly on the **primary copy**, which is +updated live and also used as reference for deduplication purposes. There, +different parts of the Merkle DAG as stored using different backend +technologies. The leaves of the graph, i.e., *content objects* (or "blobs"), +are stored in a key-value object storage, using their SHA1 identifiers as keys +(see :ref:`persistent identifiers `). SHA1 collision +avoidance is enforced by the :mod:`swh.storage` module. The *rest of the graph* +is stored in a Postgres database (see :ref:`SQL storage `). + +At the time of writing, the primary object storage contains about 5 billion +blobs with a median size of 3 KB---yes, that is *a lot of very small +files*---for a total compressed size of about 200 TB. The Postgres database +takes about 8 TB, half of which required by indexes. In terms of graph metrics, +the Merkle DAG has about 10 B nodes and 100 B edges. + +The **secondary copy** is hosted on Microsoft Azure cloud, using its native +blob storage for the object storage and a large virtual machine to run a +Postgres instance there. The database is kept up-to-date w.r.t. the primary +copy using Postgres WAL replication. The object storage is kept up-to-date +using :mod:`swh.archiver`. + +Archive copies (as opposed to archive mirrors) are operated by the Software +Heritage Team at Inria. The primary archived copy is geographically located at +Rocquencourt, France; the secondary copy hosted in the Europe West region of +the Azure cloud. diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 00000000..190deb7e --- /dev/null +++ b/docs/conf.py @@ -0,0 +1 @@ +from swh.docs.sphinx.conf import * # NoQA diff --git a/docs/extrinsic-metadata-specification.rst b/docs/extrinsic-metadata-specification.rst new file mode 100644 index 00000000..d82bb55a --- /dev/null +++ b/docs/extrinsic-metadata-specification.rst @@ -0,0 +1,251 @@ +:orphan: + +.. _extrinsic-metadata-specification: + +Extrinsic metadata specification +================================ + +:term:`Extrinsic metadata` is information about software that is not part +of the source code itself but still closely related to the software. +Typical sources for extrinsic metadata are: the hosting place of a +repository, which can offer metadata via its web view or API; external +registries like collaborative curation initiatives; and out-of-band +information available at source code archival time. + +Since they are not part of the source code, a dedicated mechanism to fetch +and store them is needed. + +This specification assumes the reader is familiar with Software Heritage's +:ref:`architecture` and :ref:`data-model`. + + +Metadata sources +---------------- + +Authorities +^^^^^^^^^^^ + +Metadata authorities are entities that provide metadata about an +:term:`origin`. Metadata authorities include: code hosting places, +:term:`deposit` submitters, and registries (eg. Wikidata). + +An authority is uniquely defined by these properties: + + * its type, representing the kind of authority, which is one of these values: + * `deposit`, for metadata pushed to Software Heritage at the same time + as a software artifact + * `forge`, for metadata pulled from the same source as the one hosting + the software artifacts (which includes package managers) + * `registry`, for metadata pulled from a third-party + * its URL, which unambiguously identifies an instance of the authority type. + +Examples: + +=============== ================================= +type url +=============== ================================= +deposit https://hal.archives-ouvertes.fr/ +deposit https://hal.inria.fr/ +deposit https://software.intel.com/ +forge https://gitlab.com/ +forge https://gitlab.inria.fr/ +forge https://0xacab.org/ +forge https://github.com/ +registry https://www.wikidata.org/ +registry https://swmath.org/ +registry https://ascl.net/ +=============== ================================= + +Metadata fetchers +^^^^^^^^^^^^^^^^^ + +Metadata fetchers are software components used to fetch metadata from +a metadata authority, and ingest them into the Software Heritage archive. + +A metadata fetcher is uniquely defined by these properties: + +* its type +* its version + +Examples: + +* :term:`loaders `, which may either discover metadata as a + side-effect of loading source code, or be dedicated to fetching metadata. + +* :term:`listers `, which may discover metadata as a side-effect + of discovering origins. + +* :term:`deposit` submitters, which push metadata to SWH from a + third-party; usually at the same time as a :term:`software artifact` + +* crawlers, which fetch metadata from an authority in a way that is + none of the above (eg. by querying a specific API of the origin's forge). + + +Storage API +----------- + +Authorities and metadata fetchers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The :term:`storage` API offers these endpoints to manipulate metadata +authorities and metadata fetchers: + +* ``metadata_authority_add(type, url, metadata)`` + which adds a new metadata authority to the storage. + +* ``metadata_authority_get(type, url)`` + which looks up a known authority (there is at most one) and if it is + known, returns a dictionary with keys ``type``, ``url``, and ``metadata``. + +* ``metadata_fetcher_add(name, version, metadata)`` + which adds a new metadata fetcher to the storage. + +* ``metadata_fetcher_get(name, version)`` + which looks up a known fetcher (there is at most one) and if it is + known, returns a dictionary with keys ``name``, ``version``, and + ``metadata``. + +These `metadata` fields contain JSON-encodable dictionaries +with information about the authority/fetcher, in a format specific to each +authority/fetcher. +With authority, the `metadata` field is reserved for information describing +and qualifying the authority. +With fetchers, the `metadata` field is reserved for configuration metadata +and other technical usage. + +Origin metadata +^^^^^^^^^^^^^^^ + +Extrinsic metadata are stored in SWH's :term:`storage database`. +The storage API offers three endpoints to manipulate origin metadata: + +* Adding metadata:: + + origin_metadata_add(origin_url, discovery_date, + authority, fetcher, + format, metadata) + + which adds a new `metadata` byte string obtained from a given authority + and associated to the origin. + `discovery_date` is a Python datetime. + `authority` must be a dict containing keys `type` and `url`, and + `fetcher` a dict containing keys `name` and `version`. + The authority and fetcher must be known to the storage before using this + endpoint. + `format` is a text field indicating the format of the content of the + `metadata` byte string. + +* Getting latest metadata:: + + origin_metadata_get_latest(origin_url, authority) + + where `authority` must be a dict containing keys `type` and `url`, + which returns a dictionary corresponding to the latest metadata entry + added from this origin, in the format:: + + { + 'origin_url': ..., + 'authority': {'type': ..., 'url': ...}, + 'fetcher': {'name': ..., 'version': ...}, + 'discovery_date': ..., + 'format': '...', + 'metadata': b'...' + } + + +* Getting all metadata:: + + origin_metadata_get(origin_url, + authority, + page_token, limit) + + where `authority` must be a dict containing keys `type` and `url` + which returns a dictionary with keys: + + * `next_page_token`, which is an opaque token to be used as + `page_token` for retrieving the next page. if absent, there is + no more pages to gather. + * `results`: list of dictionaries, one for each metadata item + deposited, corresponding to the given origin and obtained from the + specified authority. + + Each of these dictionaries is in the following format:: + + { + 'authority': {'type': ..., 'url': ...}, + 'fetcher': {'name': ..., 'version': ...}, + 'discovery_date': ..., + 'format': '...', + 'metadata': b'...' + } + +The parameters ``page_token`` and ``limit`` are used for pagination based on +an arbitrary order. An initial query to ``origin_metadata_get`` must set +``page_token`` to ``None``, and further query must use the value from the +previous query's ``next_page_token`` to get the next page of results. + +``metadata`` is a bytes array (eventually encoded using Base64). +Its format is specific to each authority; and is treated as an opaque value +by the storage. +Unifying these various formats into a common language is outside the scope +of this specification. + +Artifact metadata +^^^^^^^^^^^^^^^^^ + +In addition to origin metadata, the storage database stores metadata on +all software artifacts supported by the data model. + +This works similarly to origin metadata, with one major difference: +extrinsic metadata can be given on a specific artifact within a specified +context (for example: a directory in a specific revision from a specific +visit on a specific origin) which will be stored along the metadata itself. + +For example, two origins may develop the same file independently; +the information about authorship, licensing or even description may vary +about the same artifact in a different context. +This is why it is important to qualify the metadata with the complete +context for which it is intended, if any. + +for each artifact type ````, there are two endpoints +to manipulate metadata associated with artifacts of that type: + +* Adding metadata:: + + _metadata_add(id, context, discovery_date, + authority, fetcher, + format, metadata) + + +* Getting all metadata:: + + _metadata_get(id, + authority, + after, + page_token, limit) + + +definited similarly to ``origin_metadata_add`` and ``origin_metadata_get``, +but where ``id`` is a core SWHID (with type matching ````), +and with an extra ``context`` (argument when adding metadata, and dictionary +key when getting them) that is a dictionary with keys +depending on the artifact type ````: + +* for ``snapshot``: ``origin`` (a URL) and ``visit`` (an integer) +* for ``release``: those above, plus ``snapshot`` + (the core SWHID of a snapshot) +* for ``revision``: all those above, plus ``release`` + (the core SWHID of a release) +* for ``directory``: all those above, plus ``revision`` + (the core SWHID of a revision) + and ``path`` (a byte string), representing the path to this directory + from the root of the ``revision`` +* for ``content``: all those above, plus ``directory`` + (the core SWHID of a directory) + +All keys are optional, but should be provided whenever possible. +The dictionary may be empty, if metadata is fully independent from context. + +In all cases, ``visit`` should only be provided if ``origin`` is +(as visit ids are only unique with respect to an origin). diff --git a/docs/images/.gitignore b/docs/images/.gitignore new file mode 100644 index 00000000..542dcd32 --- /dev/null +++ b/docs/images/.gitignore @@ -0,0 +1,2 @@ +swh-archive-copies.pdf +swh-archive-copies.svg diff --git a/docs/images/Makefile b/docs/images/Makefile new file mode 100644 index 00000000..59782050 --- /dev/null +++ b/docs/images/Makefile @@ -0,0 +1,16 @@ + +BUILD_TARGETS = +BUILD_TARGETS += swh-archive-copies.pdf swh-archive-copies.svg + +all: $(BUILD_TARGETS) + + +%.svg: %.dia + inkscape -l $@ $< + +%.pdf: %.dia + inkscape -A $@ $< + + +clean: + -rm -f $(BUILD_TARGETS) diff --git a/docs/images/swh-archive-copies.dia b/docs/images/swh-archive-copies.dia new file mode 100644 index 00000000..bb64fb00 Binary files /dev/null and b/docs/images/swh-archive-copies.dia differ diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 00000000..502967a3 --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,45 @@ +.. _swh-storage: + +Software Heritage - Storage +=========================== + +Abstraction layer over the archive, allowing to access all stored source code +artifacts as well as their metadata + + +The Software Heritage storage consist of a high-level storage layer +(:mod:`swh.storage`) that exposes a client/server API +(:mod:`swh.storage.api`). The API is exposed by a server +(:mod:`swh.storage.api.server`) and accessible via a client +(:mod:`swh.storage.api.client`). + +The low-level implementation of the storage is split between an object storage +(:ref:`swh.objstorage `), which stores all "blobs" (i.e., the +leaves of the :ref:`data-model`) and a SQL representation of the rest of the +graph (:mod:`swh.storage.storage`). + + +Database schema +--------------- + +* :ref:`sql-storage` + + +Archive copies +-------------- + +* :ref:`archive-copies` + +Specifications +-------------- + +* :ref:`extrinsic-metadata-specification` + + +Reference Documentation +----------------------- + +.. toctree:: + :maxdepth: 2 + + /apidoc/swh.storage diff --git a/docs/sql-storage.rst b/docs/sql-storage.rst new file mode 100644 index 00000000..01cc2e61 --- /dev/null +++ b/docs/sql-storage.rst @@ -0,0 +1,16 @@ +:orphan: + +.. _sql-storage: + +SQL storage +=========== + +Postgres DB schema +------------------ + +.. _swh-storage-db-schema: +.. figure:: ../sql/doc/sql/db-schema.svg + :width: 1024px + :align: center + + Postgres DB schema of high-level Software Heritage storage (click to zoom). diff --git a/mypy.ini b/mypy.ini new file mode 100644 index 00000000..99c0bcc6 --- /dev/null +++ b/mypy.ini @@ -0,0 +1,60 @@ +[mypy] +namespace_packages = True + +# due to the conditional import logic on swh.journal, in some cases a specific +# type: ignore is needed, in other it isn't... +warn_unused_ignores = False + +# support for sqlalchemy magic: see https://github.com/dropbox/sqlalchemy-stubs +plugins = sqlmypy + + +# 3rd party libraries without stubs (yet) + +[mypy-cassandra.*] +ignore_missing_imports = True + +[mypy-confluent_kafka.*] +ignore_missing_imports = True + +[mypy-deprecated.*] +ignore_missing_imports = True + +# only shipped indirectly via hypothesis +[mypy-django.*] +ignore_missing_imports = True + +[mypy-msgpack.*] +ignore_missing_imports = True + +[mypy-multiprocessing.util] +ignore_missing_imports = True + +[mypy-pkg_resources.*] +ignore_missing_imports = True + +[mypy-psycopg2.*] +ignore_missing_imports = True + +[mypy-pytest.*] +ignore_missing_imports = True + +[mypy-pytest_cov.*] +ignore_missing_imports = True + +[mypy-pytest_kafka.*] +ignore_missing_imports = True + +[mypy-systemd.daemon.*] +ignore_missing_imports = True + +[mypy-tenacity.*] +ignore_missing_imports = True + +# temporary work-around for landing typing support in spite of the current +# journal<->storage dependency loop +[mypy-swh.journal.*] +ignore_missing_imports = True + +[mypy-pytest_postgresql.*] +ignore_missing_imports = True diff --git a/setup.py b/setup.py index 1f37b14f..72480105 100755 --- a/setup.py +++ b/setup.py @@ -1,77 +1,79 @@ #!/usr/bin/env python3 -# Copyright (C) 2015-2018 The Software Heritage developers +# Copyright (C) 2015-2020 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from setuptools import setup, find_packages from os import path from io import open here = path.abspath(path.dirname(__file__)) # Get the long description from the README file with open(path.join(here, "README.md"), encoding="utf-8") as f: long_description = f.read() def parse_requirements(name=None): if name: reqf = "requirements-%s.txt" % name else: reqf = "requirements.txt" requirements = [] if not path.exists(reqf): return requirements with open(reqf) as f: for line in f.readlines(): line = line.strip() if not line or line.startswith("#"): continue requirements.append(line) return requirements setup( name="swh.storage", description="Software Heritage storage manager", long_description=long_description, long_description_content_type="text/markdown", python_requires=">=3.7", author="Software Heritage developers", author_email="swh-devel@inria.fr", url="https://forge.softwareheritage.org/diffusion/DSTO/", + setup_requires=["setuptools-scm"], packages=find_packages(), + use_scm_version=True, scripts=["bin/swh-storage-add-dir",], entry_points=""" [console_scripts] swh-storage=swh.storage.cli:main [swh.cli.subcommands] storage=swh.storage.cli:storage + [pytest11] + pytest_swh_storage=swh.storage.pytest_plugin """, - install_requires=parse_requirements() + parse_requirements("swh"), See the [documentation](https://docs.softwareheritage.org/devel/swh-storage/index.html) for more details. ## Quick start ### Dependencies Python tests for this module include tests that cannot be run without a local Postgresql database, so you need the Postgresql server executable on your machine (no need to have a running Postgresql server). They also expect a cassandra server. #### Debian-like host ``` $ sudo apt install libpq-dev postgresql-11 cassandra ``` #### Non Debian-like host The tests expects `/usr/sbin/cassandra` to exist. Optionally, you can avoid running the cassandra tests. ``` (swh) :~/swh-storage$ tox -- -m 'not cassandra' ``` ### Installation It is strongly recommended to use a virtualenv. In the following, we consider you work in a virtualenv named `swh`. See the [developer setup guide](https://docs.softwareheritage.org/devel/developer-setup.html#developer-setup) for a more details on how to setup a working environment. You can install the package directly from [pypi](https://pypi.org/p/swh.storage): ``` (swh) :~$ pip install swh.storage [...] ``` Or from sources: ``` (swh) :~$ git clone https://forge.softwareheritage.org/source/swh-storage.git [...] (swh) :~$ cd swh-storage (swh) :~/swh-storage$ pip install . [...] ``` Then you can check it's properly installed: ``` (swh) :~$ swh storage --help Usage: swh storage [OPTIONS] COMMAND [ARGS]... Software Heritage Storage tools. Options: -h, --help Show this message and exit. Commands: rpc-serve Software Heritage Storage RPC server. ``` ## Tests The best way of running Python tests for this module is to use [tox](https://tox.readthedocs.io/). ``` (swh) :~$ pip install tox ``` ### tox From the sources directory, simply use tox: ``` (swh) :~/swh-storage$ tox [...] ========= 315 passed, 6 skipped, 15 warnings in 40.86 seconds ========== _______________________________ summary ________________________________ flake8: commands succeeded py3: commands succeeded congratulations :) ``` ## Development The storage server can be locally started. It requires a configuration file and a running Postgresql database. ### Sample configuration A typical configuration `storage.yml` file is: ``` storage: cls: local args: db: "dbname=softwareheritage-dev user= password=" objstorage: cls: pathslicing args: root: /tmp/swh-storage/ slicing: 0:2/2:4/4:6 ``` which means, this uses: - a local storage instance whose db connection is to `softwareheritage-dev` local instance, - the objstorage uses a local objstorage instance whose: - `root` path is /tmp/swh-storage, - slicing scheme is `0:2/2:4/4:6`. This means that the identifier of the content (sha1) which will be stored on disk at first level with the first 2 hex characters, the second level with the next 2 hex characters and the third level with the next 2 hex characters. And finally the complete hash file holding the raw content. For example: 00062f8bd330715c4f819373653d97b3cd34394c will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c Note that the `root` path should exist on disk before starting the server. ### Starting the storage server If the python package has been properly installed (e.g. in a virtual env), you should be able to use the command: ``` (swh) :~/swh-storage$ swh storage rpc-serve storage.yml ``` This runs a local swh-storage api at 5002 port. ``` (swh) :~/swh-storage$ curl Software Heritage storage server

You have reached the Software Heritage storage server.
See its documentation and API for more information

``` ### And then what? In your upper layer ([loader-git](https://forge.softwareheritage.org/source/swh-loader-git/), [loader-svn](https://forge.softwareheritage.org/source/swh-loader-svn/), etc...), you can define a remote storage with this snippet of yaml configuration. ``` storage: cls: remote args: url: http://localhost:5002/ ``` You could directly define a local storage with the following snippet: ``` storage: cls: local args: db: service=swh-dev objstorage: cls: pathslicing args: root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6 ``` Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing Provides-Extra: schemata Provides-Extra: journal diff --git a/swh.storage.egg-info/SOURCES.txt 