diff --git a/docs/architecture/mirror.rst b/docs/architecture/mirror.rst index 7885df3..03643d0 100644 --- a/docs/architecture/mirror.rst +++ b/docs/architecture/mirror.rst @@ -1,133 +1,133 @@ .. _mirror: Mirroring ========= Description ----------- A mirror is a full copy of the |swh| archive, operated independently from the Software Heritage initiative. A minimal mirror consists of two parts: - the graph storage (typically an instance of :ref:`swh.storage `), which contains the Merkle DAG structure of the archive, *except* the actual content of source code files (AKA blobs), - the object storage (typically an instance of :ref:`swh.objstorage `), which contains all the blobs corresponding to archived source code files. However, a usable mirror needs also to be accessible by others. As such, a proper mirror should also allow to: - navigate the archive copy using a Web browser and/or the Web API (typically using the :ref:`the web application `), - retrieve data from the copy of the archive (typically using the :ref:`the vault service `) A mirror is initially populated and maintained up-to-date by consuming data from the |swh| Kafka-based :ref:`journal ` and retrieving the blob objects (file content) from the |swh| :ref:`object storage `. .. note:: It is not required that a mirror is deployed using the |swh| software stack. Other technologies, including different storage methods, can be used. But we will focus in this documentation to the case of mirror deployment using the |swh| software stack. -.. thumbnail:: images/mirror-architecture.svg +.. thumbnail:: ../images/mirror-architecture.svg General view of the |swh| mirroring architecture. Mirroring the Graph Storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The replication of the graph is based on a journal using Kafka_ as event streaming platform. On the Software Heritage side, every addition made to the archive consist of the addition of a :ref:`data-model` object. The new object is also serialized as a msgpack_ bytestring which is used as the value of a message added to a Kafka topic dedicated to the object type. The main Kafka topics for the |swh| :ref:`data-model` are: - `swh.journal.objects.content` - `swh.journal.objects.directory` - `swh.journal.objects.metadata_authority` - `swh.journal.objects.metadata_fetcher` - `swh.journal.objects.origin_visit_status` - `swh.journal.objects.origin_visit` - `swh.journal.objects.origin` - `swh.journal.objects.raw_extrinsic_metadata` - `swh.journal.objects.release` - `swh.journal.objects.revision` - `swh.journal.objects.skipped_content` - `swh.journal.objects.snapshot` In order to set up a mirror of the graph, one needs to deploy a stack capable of retrieving all these topics and store their content reliably. For example a Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| would do the job (albeit not in a very useful manner by itself). A more useful mirror can be set up using the :ref:`storage ` component with the help of the special service named `replayer` provided by the -:doc:`apidoc/swh.storage.replay` module. +:mod:`swh.storage.replay` module. .. TODO: replace this previous link by a link to the 'swh storage replay' command once available, and ideally once https://github.com/sphinx-doc/sphinx/issues/880 is fixed Mirroring the Object Storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File content (blobs) are *not* directly stored in messages of the `swh.journal.objects.content` Kafka topic, which only contains metadata about them, such as various kinds of cryptographic hashes. A separate component is in charge of replicating blob objects from the archive and stored them in the local object storage instance. A separate `swh-journal` client should subscribe to the `swh.journal.objects.content` topic to get the stream of blob objects identifiers, then retrieve corresponding blobs from the main Software Heritage object storage, and store them in the local object storage. A reference implementation for this component is available in :ref:`content replayer `. Installation ------------ When using the |swh| software stack to deploy a mirror, a number of |swh| software components must be installed (cf. architecture diagram above): - a database to store the graph of the |swh| archive, - the :ref:`swh-storage` component, - an object storage solution (can be cloud-based or on local filesystem like ZFS pools), - the :ref:`swh-objstorage` component, -- the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage` +- the :mod:`swh.storage.replay` service (part of the :ref:`swh-storage` package) -- the :ref:`swh.objstorage.replayer.replay` service (from the +- the :mod:`swh.objstorage.replayer.replay` service (from the :ref:`swh-objstorage-replayer` package). A `docker-swarm `_ based deployment solution is provided as a working example of the mirror stack: https://forge.softwareheritage.org/source/swh-docker It is strongly recommended to start from there before planning a production-like deployment. See the `README `_ file of the `swh-docker `_ repository for details. .. _Kafka: https://kafka.apache.org/ .. _msgpack: https://msgpack.org diff --git a/docs/architecture/overview.rst b/docs/architecture/overview.rst index df8f8b7..d9000ee 100644 --- a/docs/architecture/overview.rst +++ b/docs/architecture/overview.rst @@ -1,275 +1,275 @@ .. _architecture-overview: Software Architecture Overview ============================== From an end-user point of view, the |swh| platform consists in the :term:`archive`, which can be accessed using the web interface or its REST API. Behind the scene (and the web app) are several components/services that expose different aspects of the |swh| :term:`archive` as internal RPC APIs. These internal APIs have a dedicated database, usually PostgreSQL_. A global (and incomplete) view of this architecture looks like: .. thumbnail:: ../images/general-architecture.svg General view of the |swh| architecture. .. _architecture-tier-1: Core components --------------- The following components are the foundation of the entire |swh| architecture, as they fetch data, store it, and make it available to every other service. Data storage ^^^^^^^^^^^^ The :ref:`Storage ` provides an API to store and retrieve elements of the :ref:`graph `, such as directory structure, revision history, and their respective metadata. It relies on the :ref:`Object Storage ` service to store the content of source code file themselves. Both the Storage and Object Storage are designed as abstractions over possible backends. The former supports both PostgreSQL (the current solution in production) and Cassandra (a more scalable option we are exploring). The latter supports a large variety of "cloud" object storage as backends, as well as a simple local filesystem. Task management ^^^^^^^^^^^^^^^ The :ref:`Scheduler ` manages the entire choreography of jobs/tasks in |swh|, from detecting and ingesting repositories, to extracting metadata from them, to repackaging repositories into small downloadable archives. It does this by managing its own database of tasks that need to run (either periodically or only once), and passing them to celery_ for execution on dedicated workers. Listers ^^^^^^^ :term:`Listers ` are type of task, run by the Scheduler, aiming at scraping a web site, a forge, etc. to gather all the source code repositories it can find, also known as :term:`origins `. For each found source code repository, a :term:`loader` task is created. The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab_ forge, but any other supported source type would be very similar. -.. thumbnail:: images/tasks-lister.svg +.. thumbnail:: ../images/tasks-lister.svg As one might observe in this diagram, it does two things: - it asks the forge (a gitlab_ instance in this case) the list of known repositories, and - it insert one :term:`loader` task for each source code repository that will be in charge of importing the content of that repository. Note that most listers usually work in incremental mode, meaning they store in a dedicated database the current state of the listing of the forge. Then, on a subsequent execution of the lister, it will ask only for new repositories. Also note that if the lister inserts a new loading task for a repository for which a loading task already exists, the existing task will be updated (if needed) instead of creating a new task. Loaders ^^^^^^^ :term:`Loaders ` are also a type of task, but aim at importing or updating a source code repository. It is the one that inserts :term:`blob` objects in the :term:`object storage`, and inserts nodes and edges in the :ref:`graph `. The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar. -.. thumbnail:: images/tasks-git-loader.svg +.. thumbnail:: ../images/tasks-git-loader.svg Journal ^^^^^^^ The last core component is the :term:`Journal `, which is a persistent logger of every change in the archive, with publish-subscribe_ support, using Kafka. The Storage writes to it every time a new object is added to the archive; and many components read from it to be notified of these changes. For example, it allows the Scheduler to know how often software repositories are updated by their developers, to decide when next to visit these repositories. It is also the foundation of the :ref:`mirror` infrastructure, as it allows mirrors to stay up to date. .. _architecture-tier-2: Other major components ---------------------- All the components we saw above are critical to the |swh| archive as they are in charge of archiving source code. But are not enough to provide another important features of |swh|: making this archive accessible and searchable by anyone. Archive website and API ^^^^^^^^^^^^^^^^^^^^^^^ First of all, the archive website and API, also known as :ref:`swh-web `, is the main entry point of the archive. This is the component that serves https://archive.softwareheritage.org/, which is the window into the entire archive, as it provides access to it through a web browser or the HTTP API. It does so by querying most of the internal APIs of |swh|: the Data Storage (to display source code repositories and their content), the Scheduler (to allow manual scheduling of loader tasks through the `Save Code Now `_ feature), and many of the other services we will see below. Internal data mining ^^^^^^^^^^^^^^^^^^^^ :term:`Indexers ` are a type of task aiming at crawling the content of the :term:`archive` to extract derived information. It ranges from detecting the MIME type or license of individual files, to reading all types of metadata files at the root of repositories and storing them together in a unified format, CodeMeta_. All results computed by Indexers are stored in a PostgreSQL database, the Indexer Storage. Vault ^^^^^ The :term:`Vault ` is an internal API, in charge of cooking compressed archive (zip or tgz) of archived objects on request (via swh-web). These compressed objects are typically directories or repositories. Since this can be a rather long process, it is delegated to an asynchronous (celery) task, through the Scheduler. .. _architecture-tier-3: Extra services -------------- Finally, |swh| provides additional tools that, although not necessary to operate the archive, provide convenient interfaces or performance benefits. It is therefore possible to have a fully-functioning archive without any of these services (our :ref:`development Docker environment ` disables most of these by default). Search ^^^^^^ The :ref:`swh-search ` service complements both the Storage and the Indexer Storage, to provide efficient advanced reverse-index search queries, such as full-text search on origin URLs and metadata. This service is a recent addition to the |swh| architecture based on ElasticSearch, and is currently in use only for URL search. Graph ^^^^^ :ref:`swh-graph ` is also a recent addition to the architecture designed to complement the Storage using a specialized backend. It leverages WebGraph_ to store a compressed in-memory representation of the entire graph, and provides fast implementations of graph traversal algorithms. Counters ^^^^^^^^ The `archive's landing page `_ features counts of the total number of files/directories/revisions/... in the archive. Perhaps surprisingly, counting unique objects at |swh|'s scale is hard, and a performance bottleneck when implemented purely in the Storage's SQL database. :ref:`swh-counters ` provides an alternative design to solve this issue, by reading new objects from the Journal and counting them using Redis_' HyperLogLog_ feature; and keeps the history of these counters over time using Prometheus_. Deposit ^^^^^^^ The :ref:`Deposit ` is an alternative way to add content to the archive. While listers and loaders, as we saw above, **discover** repositories and **pull** artifacts into the archive, the Deposit allows trusted partners to **push** the content of their repository directly to the archive, and is internally loaded by the :mod:`Deposit Loader ` The Deposit is centered on the SWORDv2_ protocol, which allows depositing archives (usually TAR or ZIP) along with metadata in XML. The Deposit has its own HTTP interface, independent of swh-web. It also has its own SWORD client, which is specialized to interact with the Deposit server. Authentication ^^^^^^^^^^^^^^ While the archive itself is public, |swh| reserves some features to authenticated clients, such as higher rate limits, access to experimental APIs (currently: the Graph service), or the Deposit. This is managed centrally by :ref:`swh-auth ` using KeyCloak. Web Client, Fuse, Scanner ^^^^^^^^^^^^^^^^^^^^^^^^^ SWH provides a few tools to access the archive via the API: * :ref:`swh-web-client`, a command-line interface to authenticate with SWH and a library to access the API from Python programs * :ref:`swh-fuse`, a Filesystem in USErspace implementation, that exposes the entire archive as a regular directory on your computer * :ref:`swh-scanner`, a work-in-progress to check which of the files in a project are already in the archive, without submitting them Replayers and backfillers ^^^^^^^^^^^^^^^^^^^^^^^^^ As the Journal and various databases may be out of sync for various reasons (scrub of either of them, migration, database addition, ...), and because some databases need to follow the content of the Journal (mirrors), some places of the |swh| codebase contains tools known as "replayers" and "backfillers", designed to keep them in sync: -* the :ref:`Object Storage Replayer ` copies the content +* the :mod:`Object Storage Replayer ` copies the content of an objects storage to another one. It first performs a full copy, then streams new objects using the Journal to stay up to date * the Storage Replayer loads the entire content of the Journal into a Storage database, and also keeps them in sync. This is used for mirrors, and when creating a new database. * the Storage Backfiller, which does the opposite. This was initially used to populate the Journal from the database; and is occasionally when one needs to clear a topic in the Journal and recreate it. .. _celery: https://www.celeryproject.org .. _CodeMeta: https://codemeta.github.io/ .. _gitlab: https://gitlab.com .. _PostgreSQL: https://www.postgresql.org/ .. _Prometheus: https://prometheus.io/ .. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern .. _Redis: https://redis.io/ .. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html .. _HyperLogLog: https://redislabs.com/redis-best-practices/counting/hyperloglog/ .. _WebGraph: https://webgraph.di.unimi.it/ diff --git a/docs/archive-changelog.rst b/docs/archive-changelog.rst index 115d583..509dc94 100644 --- a/docs/archive-changelog.rst +++ b/docs/archive-changelog.rst @@ -1,132 +1,132 @@ .. _archive-changelog: Software Heritage --- Archive ChangeLog ======================================= Below you can find a time-indexed list of notable events and changes to archival policies in the Software Heritage Archive. Each of them might have (had) an impact on how content is archived and explain apparent statistical anomalies or other changes in archival behavior over time. They are collected in this document for historical reasons. 2020 ---- * **2020-10-06 - 2020-11-23:** source code crawlers have been paused to avoid an out of disk condition, due to an unexpected delay in the arrival of new storage hardware. Push archival (both deposit_ and `save code now`_) remained in operation. (tracking: `T2656 `_) * **2020-09-15:** completed first archival of, and added to regular crawling `GNU Guix System`_ (tracking: `T2594 `_) * **2020-06-11:** completed integration with the IPOL_ journal, allowing paper authors to explicitly deposit_ source code to the archive (`announcement - `_) + `__) * **2020-05-25:** completed first archival of, and added to regular crawling NixOS_ (tracking: `T2411 `_) 2019 ---- * **2019-09-10:** completed first archival of Bitbucket_ Git repositories and added Bitbucket as a regularly crawled forge (tracking: `T592 `_) * **2019-06-30:** completed first archival of, and added to regular crawling, several GitLab_ instances: `0xacab.org `_, `framagit.org `_, `gite.lirmm.fr `_, `gitlab.common-lisp.net `_, `gitlab.freedesktop.org `_, `gitlab.gnome.org `_, `gitlab.inria.fr `_, `salsa.debian.org `_ * **2019-06-12:** completed first archival of CRAN_ packages and added CRAN as a regularly crawled package repository (tracking: `T1709 `_) * **2019-06-11:** completed a full archival of GNU_ source code releases from `ftp.gnu.org`_, and added it to regular crawling (tracking: `T1722 `_) * **2019-05-27:** completed a full archival of NPM_ packages and added it as a regularly crawled package repository (tracking: `T1378 `_) * **2019-01-10:** enabled the `save code now`_ service, allowing users to explicitly request archival of a specific source code repository (`announcement - `_) + `__) 2018 ---- * **2018-10-10:** completed first archival of PyPI_ packages and added PyPI as a regularly crawled package repository (`announcement - `_) + `__) * **2018-09-25:** completed integration with HAL_, allowing paper authors to explicitly deposit_ source code to the archive (`announcement - `_) + `__) * **2018-08-31:** completed first archival of public GitLab_ repositories from `gitlab.com `_ and added it as a regularly crawled forge (tracking: `T1111 `_) * **2018-03-21:** completed archival of `Google Code`_ Mercurial repositories. (tracking: `T682 `_) * **2018-02-20:** completed archival of Debian_ packages and added Debian as a regularly crawled distribution (`announcement - `_) + `__) 2017 ---- * **2017-10-02:** completed archival of `Google Code`_ Subversion repositories (tracking: `T617 `_) * **2017-06-06:** completed archival of `Google Code`_ Git repositories (tracking: `T673 `_) 2016 ---- * **2016-04-04:** completed archival of the Gitorious_ (tracking: `T312 `_) 2015 ---- * **2015-11-06:** archived all GNU_ source code releases from `ftp.gnu.org`_ (tracking: `T90 `_) * **2015-07-28:** started archiving public GitHub_ repositories .. _Bitbucket: https://bitbucket.org .. _CRAN: https://cran.r-project.org .. _Debian: https://www.debian.org .. _GNU Guix System: https://guix.gnu.org/ .. _GNU: https://en.wikipedia.org/wiki/Google_Code .. _GitHub: https://github.com .. _GitLab: https://gitlab.com .. _Gitorious: https://en.wikipedia.org/wiki/Gitorious .. _Google Code: https://en.wikipedia.org/wiki/Google_Code .. _HAL: https://hal.archives-ouvertes.fr .. _IPOL: http://www.ipol.im .. _NPM: https://www.npmjs.com .. _NixOS: https://nixos.org/ .. _PyPI: https://pypi.org .. _deposit: https://deposit.softwareheritage.org .. _ftp.gnu.org: http://ftp.gnu.org .. _save code now: https://save.softwareheritage.org diff --git a/docs/contributing/code-review.rst b/docs/contributing/code-review.rst index 94f1ec6..9d85ba7 100644 --- a/docs/contributing/code-review.rst +++ b/docs/contributing/code-review.rst @@ -1,52 +1,52 @@ .. _code-review: Code Review =========== This page documents code review practices used for Software Heritage development. Guidelines ---------- Please adhere to the following guidelines to perform and obtain code reviews (CRs) in the context of Software Heritage development: 1. **CRs are strongly recommended** for any non-trivial code change, but not mandatory (nor enforced at the VCS level). 2. The CR :ref:`workflow ` is implemented using Phabricator/Differential. 3. Explicitly **suggest reviewer(s)** when submitting new CR requests: either the most knowledgeable person(s) for the target code or the general `reviewers `_ (which is the `default `_). 4. **Review anything you want**: no matter the suggested reviewer(s), feel free to review any outstanding CR. 5. **One LGTM is enough**: feel free to approve any outstanding CR. 6. **Review every day**: CRs should be timely as fellow developers will wait for them. To make CRs sustainable each developer should strive to dedicate a fixed minimum amount of CR time every (work) day. For more detailed suggestions (and much more) on the motivational and practical aspects of code reviews see Good reads below. Good reads ---------- Good reads on various angles of code review: -* `Best practices `_ (Palantir) ← comprehensive and recommended read, especially if you're short on time -* `Best practices `_ (Thoughtbot) -* `Best practices `_ (Smart Bear) +* `Best practices (Palantir) `_ ← comprehensive and recommended read, especially if you're short on time +* `Best practices (Thoughtbot) `_ +* `Best practices (Smart Bear) `_ * `Review checklist `_ (Code Project) * `Motivation: code quality `_ (Coding Horror) * `Motivation: team culture `_ (Google & FullStory) * `Motivation: humanizing peer reviews `_ (Wiegers) * `Motivation: sharing knowledge `_ (Atlassian) See also -------- * :ref:`patch-submission` * :ref:`python-style-guide` * :ref:`git-style-guide` diff --git a/docs/journal.rst b/docs/journal.rst index 25746c4..f8db95a 100644 --- a/docs/journal.rst +++ b/docs/journal.rst @@ -1,673 +1,673 @@ .. _journal-specs: Software Heritage Journal --- Specifications ============================================ The |swh| journal is a Kafka_-based stream of events for every added object in the |swh| Archive and some of its related services, especially indexers. Each topic_ will stream added elements for a given object type according to the topic name. Objects streamed in a topic are serialized versions of objects stored in the |swh| Archive specified by the main |swh| :py:mod:`data model ` or the :py:mod:`indexer object model `. In this document we will describe expected messages in each topic, so a potential consumer can easily cope with the |swh| journal without having to read the source code or the |swh| :ref:`data model ` in details (it is however recommended to familiarize yourself with this later). Kafka message values are dictionary structures serialized as msgpack_, with a few custom encodings. See the section `Kafka message format`_ below for a complete description of the serialization format. Note that each example given below show the dictionary before being serialized as a msgpack_ chunk. Topics ------ There are several groups of topics: - main storage Merkle-DAG related topics, - other storage objects (not part of the Merkle DAG), - indexer related objects (not yet documented below). Topics prefix can be either `swh.journal.objects` or `swh.journal.objects_privileged` (see below). Anonymized topics +++++++++++++++++ For topics that transport messages with user information (name and email address), namely `swh.journal.objects.release`_ and `swh.journal.objects.revision`_, there are 2 versions of those: one is an anonymized topic, in which user information are obfuscated, and a pristine version with clear data. Access to pristine topics depends on ACLs linked to credentials used to connect to the Kafka cluster. List of topics ++++++++++++++ - `swh.journal.objects.origin`_ - `swh.journal.objects.origin_visit`_ - `swh.journal.objects.origin_visit_status`_ - `swh.journal.objects.snapshot`_ - `swh.journal.objects.release`_ - `swh.journal.objects.privileged_release `_ - `swh.journal.objects.revision`_ - `swh.journal.objects.privileged_revision `_ - `swh.journal.objects.directory`_ - `swh.journal.objects.content`_ -- `swh.journal.objects.skippedcontent`_ +- `swh.journal.objects.skipped_content`_ - `swh.journal.objects.metadata_authority`_ - `swh.journal.objects.metadata_fetcher`_ - `swh.journal.objects.raw_extrinsic_metadata`_ Topics for Merkle-DAG objects ----------------------------- These topics are for the various objects stored in the |swh| Merkle DAG, see the :ref:`data model ` for more details. `swh.journal.objects.snapshot` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Snapshot` objects. Message format: - `branches` [dict] branches present in this snapshot, - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Snapshot` object with `branches` being a dictionary which keys are branch names [bytes], and values a dictionary of: - `target` [bytes] intrinsic identifier of the targeted object - `target_type` [string] the type of the targeted object (can be "content", "directory", "revision", "release", "snapshot" or "alias"). Example: .. code:: python { 'branches': { b'refs/pull/1/head': { 'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c', 'target_type': 'revision' }, b'refs/pull/2/head': { 'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ', 'target_type': 'revision' }, b'refs/heads/master': { 'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT', 'target_type': 'revision' }, b'HEAD': { 'target': b'refs/heads/master', 'target_type': 'alias' } }, 'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^' } `swh.journal.objects.release` +++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Release` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.release`. Message format: - `name` [bytes] name (typically the version) of the release - `message` [bytes] message of the release - `target` [bytes] identifier of the target object - `target_type` [string] type of the target, can be "content", "directory", "revision", "release" or "snapshot" - `synthetic` [bool] True if the :py:class:`swh.model.model.Release` object has been forged by the loading process; this flag is not used for the id computation, - `author` [dict] the author of the release - `date` [gitdate] the date of the release - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Release` object Example: .. code:: python { 'name': b'0.3', 'message': b'', 'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d', 'target_type': 'revision', 'synthetic': False, 'author': { 'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9', 'name': None, 'email': None }, 'date': { 'timestamp': { 'seconds': 1480432642, 'microseconds': 0 }, 'offset': 180, 'negative_utc': False }, 'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86' } `swh.journal.objects.revision` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Revision` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.revision`. Message format: - `message` [bytes] the commit message for the revision - `author` [dict] the author of the revision - `committer` [dict] the committer of the revision - `date` [gitdate] the revision date - `committer_date` [gitdate] the revision commit date - `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg") - `directory` [bytes] the intrinsic identifier of the directory this revision links to - `synthetic` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not, - `metadata` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the intrinsic identifier computation), - `parents` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers - `id` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision` - `extra_headers` [list[(bytes, bytes)]] TODO Example: .. code:: python { 'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n', 'author': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'committer': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'committer_date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'type': 'svn', 'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe', 'synthetic': True, 'metadata': None, 'parents': [ b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c', ], 'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8', 'perms': 33188}, {'name': b'lib', 'type': 'dir', 'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U', 'perms': 16384}, {'name': b'package.json', 'type': 'file', 'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x', 'perms': 33188} ], 'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P' } Other Objects Topics -------------------- These topics are for objects of the |swh| archive that are not part of the Merkle DAG but are essential parts of the archive; see the :ref:`data model ` for more details. `swh.journal.objects.origin` ++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Origin` objects. Message format: - `url` [string] URL of the :py:class:`swh.model.model.Origin` Example: .. code:: python { "url": "https://github.com/vujkovicm/pml" } `swh.journal.objects.origin_visit` ++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisit` objects. Message format: - `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` - `date` [timestamp] date of the visit - `type` [string] type of the loader used to perform the visit - `visit` [int] number of the visit for this `origin` Example: .. code:: python { 'origin': 'https://pypi.org/project/wasp-eureka/', 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'type': 'pypi', 'visit': 505} } `swh.journal.objects.origin_visit_status` +++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisitStatus` objects. Message format: - `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` - `visit` [int] number of the visit for this `origin` this status concerns - `date` [timestamp] date of the visit status update - `status` [string] status (can be "created", "ongoing", "full" or "partial"), - `snapshot` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this visit resulted in (if `status` is "full" or "partial") - `metadata`: deprecated Example: .. code:: python { 'origin': 'https://pypi.org/project/stricttype/', 'visit': 524, 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'status': 'full', 'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7", 'metadata': None } Extrinsic Metadata related Topics --------------------------------- Extrinsic metadata is information about software that is not part of the source code itself but still closely related to the software. See :ref:`extrinsic-metadata-specification` for more details on the Extrinsic Metadata model. `swh.journal.objects.metadata_authority` ++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataAuthority` objects. Message format: - `type` [string] - `url` [string] - `metadata` [dict] Examples: .. code:: python { 'type': 'forge', 'url': 'https://guix.gnu.org/sources.json', 'metadata': {} } { 'type': 'deposit_client', 'url': 'https://www.softwareheritage.org', 'metadata': {'name': 'swh'} } `swh.journal.objects.metadata_fetcher` ++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataFetcher` objects. Message format: - `type` [string] - `version` [string] - `metadata` [dict] Example: .. code:: python { 'name': 'swh.loader.package.cran.loader.CRANLoader', 'version': '0.15.0', 'metadata': {} } `swh.journal.objects.raw_extrinsic_metadata` ++++++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.RawExtrinsicMetadata` objects. Message format: - `type` [string] - `target` [string] - `discovery_date` [timestamp] - `authority` [dict] - `fetcher` [dict] - `format` [string] - `metadata` [bytes] - `origin` [string] - `visit` [int] - `snapshot` [SWHID] - `release` [SWHID] - `revision` [SWHID] - `path` [bytes] - `directory` [SWHID] Example: .. code:: python { 'type': 'snapshot', 'id': 'swh:1:snp:f3b180979283d4931d3199e6171840a3241829a3', 'discovery_date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'authority': { 'type': 'forge', 'url': 'https://pypi.org/', 'metadata': {} }, 'fetcher': { 'name': 'swh.loader.package.pypi.loader.PyPILoader', 'version': '0.10.0', 'metadata': {} }, 'format': 'pypi-project-json', 'metadata': b'{"info":{"author":"Signaltonsalat","author_email":"signaltonsalat@gmail.com"}]}', 'origin': 'https://pypi.org/project/schwurbler/' } Kafka message format -------------------- Each value of a Kafka message in a topic is a dictionary-like structure encoded as a msgpack_ byte string. Keys are ASCII strings. All values are encoded using default msgpack type system except for long integers for which we use a custom format using msgpack `extended type`_ to prevent overflow while packing some objects. Integer +++++++ For long integers (that do not fit in the `[-(2**63), 2 ** 64 - 1]` range), a custom `extended type`_ based encoding scheme is used. The `type` information can be: - `1` for positive (possibly long) integers, - `2` for negative (possibly long) integers. The payload is simply the bytes (big endian) representation of the absolute value (always positive). For example (adapted to standard integers for the sake of readability; these values are small so they will actually be encoded using the default msgpack format for integers): - `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`) - `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`) Datetime ++++++++ There are 2 type of date that can be encoded in a Kafka message: - dates for git-like objects (:py:class:`swh.model.model.Revision` and :py:class:`swh.model.model.Release`): these dates are part of the hash computation used as identifier in the Merkle DAG. In order to fully support git repositories, a custom encoding is required. These dates (coming from the git data model) are encoded as a dictionary with: - `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys (`seconds` and `microseconds`) - `offset` [int] offset of the date (in minutes) - `negative_utc` [bool] only True for the very edge case where the date has a zero but negative offset value (which does not makes much sense, but technically the git format permits) Example: .. code:: python { 'timestamp': {'seconds': 1480432642, 'microseconds': 0}, 'offset': 180, 'negative_utc': False } These are denoted as `gitdate` below. - other dates (resulting of the |swh| processing stack) are encoded using msgpack's Timestamp_ extended type. These are denoted as `timestamp` below. Note that these dates used to be encoded as a dictionary (beware: keys are bytes): .. code:: python { b"swhtype": "datetime", b"d": '2020-09-15T16:19:13.037809+00:00' } Person ++++++ :py:class:`swh.model.model.Person` objects represent a person in the |swh| Merkle DAG, namely a :py:class:`swh.model.model.Revision` author or committer, or a :py:class:`swh.model.model.Release` author. :py:class:`swh.model.model.Person` objects are serialized as a dictionary like: .. code:: python { 'fullname': 'John Doe ', 'name': 'John Doe', 'email': 'john.doe@example.com' } For anonymized topics, :py:class:`swh.model.model.Person` entities have seen anonymized prior to being serialized. The anonymized :py:class:`swh.model.model.Person` object is a dictionary like: .. code:: python { 'fullname': , 'name': null, 'email': null } where the `` is computed from original values as a sha256 of the original's `fullname`. .. _Kafka: https://kafka.apache.org .. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms .. _msgpack: https://msgpack.org/ .. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types .. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type diff --git a/requirements-swh-dev.txt b/requirements-swh-dev.txt index 8347359..3c06a0e 100644 --- a/requirements-swh-dev.txt +++ b/requirements-swh-dev.txt @@ -1,30 +1,30 @@ # Add here internal Software Heritage dependencies, one per line. # Dependencies need to be ordered in a way that ensure only # development versions will be used (not the release ones hosted on PyPI). # # This is NOT in alphabetical order ../swh-core[http,db,logging] ../swh-auth[django] ../swh-model ../swh-journal ../swh-counters ../swh-objstorage[testing] ../swh-storage ../swh-objstorage-replayer -../swh-scheduler +../swh-scheduler[simulator] ../swh-deposit ../swh-graph ../swh-icinga-plugins ../swh-indexer ../swh-lister ../swh-loader-core ../swh-loader-git ../swh-loader-mercurial ../swh-loader-svn ../swh-search ../swh-vault ../swh-web ../swh-web-client ../swh-scanner ../swh-fuse diff --git a/requirements-swh.txt b/requirements-swh.txt index fe0d7ba..a693bc1 100644 --- a/requirements-swh.txt +++ b/requirements-swh.txt @@ -1,24 +1,24 @@ # Add here internal Software Heritage dependencies, one per line. swh.auth[django] swh.core[db,http] swh.counters swh.deposit[server] swh.fuse swh.graph swh.indexer swh.journal swh.lister swh.loader.core swh.loader.git swh.loader.mercurial swh.loader.svn swh.model swh.objstorage[testing] swh.objstorage.replayer swh.scanner -swh.scheduler +swh.scheduler[simulator] swh.search swh.storage swh.vault swh.web swh.web.client diff --git a/requirements.txt b/requirements.txt index 4f5324f..33e22ae 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,12 +1,13 @@ # Add here external Python modules dependencies, one per line. Module names # should match https://pypi.python.org/pypi names. For the full spec or # dependency lines, see https://pip.readthedocs.org/en/1.1/requirements.html sphinx sphinxcontrib-httpdomain sphinxcontrib-images sphinxcontrib-programoutput sphinx-tabs sphinx-reredirects sphinx_rtd_theme sphinx-click myst-parser +sphinx-celery diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py index 892ac56..9314acb 100755 --- a/swh/docs/sphinx/conf.py +++ b/swh/docs/sphinx/conf.py @@ -1,182 +1,186 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- # import os from typing import Dict import django # General information about the project. project = "Software Heritage - Development Documentation" copyright = "2015-2021 The Software Heritage developers" author = "The Software Heritage developers" # -- General configuration ------------------------------------------------ # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ "sphinx.ext.autodoc", "sphinx.ext.napoleon", "sphinxcontrib.httpdomain", "sphinx.ext.extlinks", "sphinxcontrib.images", "sphinxcontrib.programoutput", "sphinx.ext.viewcode", "sphinx_tabs.tabs", "sphinx_rtd_theme", "sphinx.ext.graphviz", "sphinx_click.ext", "myst_parser", "sphinx.ext.todo", "sphinx_reredirects", "swh.docs.sphinx.view_in_phabricator", + + # swh.scheduler inherits some attribute descriptions from celery that use + # custom crossrefs (eg. :setting:`task_ignore_result`) + "sphinx_celery.setting_crossref", ] # Add any paths that contain templates here, relative to this directory. templates_path = ["_templates"] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # source_suffix = ".rst" # The master toctree document. master_doc = "index" # A string of reStructuredText that will be included at the beginning of every # source file that is read. # A bit hackish but should work both for each swh package and the whole swh-doc rst_prolog = """ .. include:: /../../swh-docs/docs/swh_substitutions """ # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = "" # The full version, including alpha/beta/rc tags. release = "" # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = "en" # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This patterns also effect to html_static_path and html_extra_path exclude_patterns = ["_build", "swh-icinga-plugins/index.rst"] # The name of the Pygments (syntax highlighting) style to use. pygments_style = "sphinx" # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = True # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # html_theme = "sphinx_rtd_theme" html_favicon = "_static/favicon.ico" # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. # html_theme_options = { "collapse_navigation": True, "sticky_navigation": True, } html_logo = "_static/software-heritage-logo-title-motto-vertical-white.png" # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ["_static"] # make logo actually appear, avoiding gotcha due to alabaster default conf. # https://github.com/bitprophet/alabaster/issues/97#issuecomment-303722935 html_sidebars = { "**": [ "about.html", "globaltoc.html", "relations.html", "sourcelink.html", "searchbox.html", ] } # If not None, a 'Last updated on:' timestamp is inserted at every page # bottom, using the given strftime format. # The empty string is equivalent to '%b %d, %Y'. html_last_updated_fmt = "%Y-%m-%d %H:%M:%S %Z" # refer to the Python standard library. intersphinx_mapping = {"python": ("https://docs.python.org/3", None)} # Redirects for pages that were moved, so we don't break external links. # Uses sphinx-reredirects redirects = { "swh-deposit/spec-api": "api/api-documentation.html", "swh-deposit/metadata": "api/metadata.html", "swh-deposit/specs/blueprint": "../api/use-cases.html", "swh-deposit/user-manual": "api/user-manual.html", "architecture": "architecture/overview.html", "mirror": "architecture/mirror.html", } # -- autodoc configuration ---------------------------------------------- autodoc_default_flags = [ "members", "undoc-members", "private-members", "special-members", ] autodoc_member_order = "bysource" autodoc_mock_imports = ["rados"] modindex_common_prefix = ["swh."] # For the todo extension. Todo and todolist produce output only if this is True todo_include_todos = True # for the extlinks extension, sub-projects should fill that dict extlinks: Dict = {} # XXX Kill this ASA this PR is accepted and released # https://github.com/sphinx-contrib/httpdomain/pull/19 def register_routingtable_as_label(app, document): from sphinx.locale import _ # noqa labels = app.env.domaindata["std"]["labels"] labels["routingtable"] = "http-routingtable", "", _("HTTP Routing Table") anonlabels = app.env.domaindata["std"]["anonlabels"] anonlabels["routingtable"] = "http-routingtable", "" # hack to set the adequate django settings when building global swh doc # to avoid autodoc build errors def setup(app): os.environ.setdefault("DJANGO_SETTINGS_MODULE", "swh.docs.django_settings") django.setup() from distutils.version import StrictVersion # noqa import pkg_resources # noqa httpdomain = pkg_resources.get_distribution("sphinxcontrib-httpdomain") if StrictVersion(httpdomain.version) <= StrictVersion("1.7.0"): app.connect("doctree-read", register_routingtable_as_label)