diff --git a/docs/contributing/tutorial-docs-contribution.rst b/docs/contributing/tutorial-docs-contribution.rst index 1d2347b..fbd593b 100644 --- a/docs/contributing/tutorial-docs-contribution.rst +++ b/docs/contributing/tutorial-docs-contribution.rst @@ -1,192 +1,192 @@ .. _doc-contribution: Tutorial: Best practices when writing SWH docs ============================================== .. admonition:: Intended audience :class: important Members of the Software Heritage staff and external contributors who wish to contribute by writing documentation. - + A tutorial on how to contribute documentation into the Software Heritage world. Step 1: Identify your audience ------------------------------ #. Ask yourself: Who are the readers of the documentation that you are writing? In the Software Heritage community, three general types of personas are distinguished: * **visitors**: people who want to know what is the SWH initiative and archive * **users**: people who want to use the SWH features * as a service * as a module by running a local instance * **contributors**: people who are contributing to SWH (either external or swh staff) * as developers * as sys-admins * as support role #. use the persona type to determine the document location in step 2 #. add the intended audience on the top of the page Step 2: Determine the documentation location -------------------------------------------- Information should have a permanent home as documentation. Elements that are work in progress can live in the forge on issues or in hedgedoc, but these are not permanent locations. #. Choose high-level location: Possible permanent locations include: * The WordPress website: for visitors * The archive web-app: for visitors and users (of the interface or API) * The Sphinx docs: * *devel* for contributors * *users* for users of the infrastructure and all the different services * *sysadm* for sys-admins #. For contributors documentation in devel: #. Choose if the subject is a high level (cross-module) section or in a specific module * if the document is relative to only one module, go and add it in the */docs* directory in the module * for cross-module documentation, use the swh-docs repository and the appropriate sub-directory (e.g architecture) #. Decide if a subsection is needed with multiple pages (tutorials, how-tos, reference or explanation). #. For sys-admin (in */sysadm* folder) and user documentation (in */users* folder): #. Check if an existing section is already describing the theme that you want to document. #. Decide if a subsection is needed with multiple pages (tutorials, how-tos, reference or explanation). Step 3: Choose documentation type --------------------------------- We are following Divio's approach with four major types of documentation: * Tutorial: allowing newcomers to get started and ease the onboarding contributors and users. * How to: how to solve a specific problem in a step-by-step practical manual. * Reference: theoretical/technical knowledge which is information oriented. * Explanation: theoretical knowledge understanding-oriented to analyze, discuss and explain different decisions, including background and context. For more information see `the divio documentation `_ and/or `Daniele Procida's presentation `_ .. note:: We propose using in the following naming scheme depending on the type of document: * Tutorial: Tutorial name] * How to ... * Reference: [Reference name] * Explanation: [Explanation name] Step 4: Create a page or sub-section with multiple pages -------------------------------------------------------- #. Create a *.rst* file with a short name of your doc in the appropriate directory (see step 2). If this is a sub-section, the first file should be an *index.rst* file containing the list of the current sub-section files. #. For not yet ready page, you can create simply create an empty page using the template below. The template starts with a reference, so that you can link to this new page from elsewhere. The page name should follow the step 3. scheme. #. For existing page, you can link the new page with the existing one containing the desired information. Empty page template ^^^^^^^^^^^^^^^^^^^ .. code-block:: rst .. _empty_page: Empty page ========== .. admonition:: Intended audience :class: important add the audience target(s) of this page - + .. todo:: This page is a work in progress. For now, please refer to the `existing documentation `_. Empty subsection template ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: rst .. _empty_subsection: Empty subsection ================ .. toctree:: :titlesonly: tutorial-my-first-tuto howto-do-things howto-test-stuff howto-dance reference-info reference-best-practices README in module ^^^^^^^^^^^^^^^^ We want to reduce redundancy in documentation as much as possible. The option we should strive for is adding a symlink to docs/README.rst in the repo's module. Furthermore, docs/README.rst should include docs/index.rst, as following: .. code-block:: rst .. _swh-fuse: .. include:: README.rst .. toctree:: :maxdepth: 1 :caption: Overview cli configuration Design notes Tutorial Step 5: Add link to page/sub-section from an index.rst ------------------------------------------------------ Add the file-name to the menu of the parent index.rst Step 6: Commit change for code review ------------------------------------- You should open a diff for a documentation change following the instructions in :ref:`code-review` diff --git a/docs/journal.rst b/docs/journal.rst index c5b02a1..83e554a 100644 --- a/docs/journal.rst +++ b/docs/journal.rst @@ -1,673 +1,673 @@ .. _journal-specs: Journal Specification ===================== The |swh| journal is a Kafka_-based stream of events for every added object in the |swh| Archive and some of its related services, especially indexers. Each topic_ will stream added elements for a given object type according to the topic name. Objects streamed in a topic are serialized versions of objects stored in the |swh| Archive specified by the main |swh| :py:mod:`data model ` or the :py:mod:`indexer object model `. In this document we will describe expected messages in each topic, so a potential consumer can easily cope with the |swh| journal without having to read the source code or the |swh| :ref:`data model ` in details (it is however recommended to familiarize yourself with this later). Kafka message values are dictionary structures serialized as msgpack_, with a few custom encodings. See the section `Kafka message format`_ below for a complete description of the serialization format. Note that each example given below show the dictionary before being serialized as a msgpack_ chunk. Topics ------ There are several groups of topics: - main storage Merkle-DAG related topics, - other storage objects (not part of the Merkle DAG), - indexer related objects (not yet documented below). Topics prefix can be either `swh.journal.objects` or `swh.journal.objects_privileged` (see below). Anonymized topics +++++++++++++++++ For topics that transport messages with user information (name and email address), namely `swh.journal.objects.release`_ and `swh.journal.objects.revision`_, there are 2 versions of those: one is an anonymized topic, in which user information are obfuscated, and a pristine version with clear data. Access to pristine topics depends on ACLs linked to credentials used to connect to the Kafka cluster. List of topics ++++++++++++++ - `swh.journal.objects.origin`_ - `swh.journal.objects.origin_visit`_ - `swh.journal.objects.origin_visit_status`_ - `swh.journal.objects.snapshot`_ - `swh.journal.objects.release`_ - `swh.journal.objects.privileged_release `_ - `swh.journal.objects.revision`_ - `swh.journal.objects.privileged_revision `_ - `swh.journal.objects.directory`_ - `swh.journal.objects.content`_ - `swh.journal.objects.skipped_content`_ - `swh.journal.objects.metadata_authority`_ - `swh.journal.objects.metadata_fetcher`_ - `swh.journal.objects.raw_extrinsic_metadata`_ Topics for Merkle-DAG objects ----------------------------- These topics are for the various objects stored in the |swh| Merkle DAG, see the :ref:`data model ` for more details. `swh.journal.objects.snapshot` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Snapshot` objects. Message format: - `branches` [dict] branches present in this snapshot, - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Snapshot` object with `branches` being a dictionary which keys are branch names [bytes], and values a dictionary of: - `target` [bytes] intrinsic identifier of the targeted object - `target_type` [string] the type of the targeted object (can be "content", "directory", "revision", "release", "snapshot" or "alias"). Example: .. code:: python { 'branches': { b'refs/pull/1/head': { 'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c', 'target_type': 'revision' }, b'refs/pull/2/head': { 'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ', 'target_type': 'revision' }, b'refs/heads/master': { 'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT', 'target_type': 'revision' }, b'HEAD': { 'target': b'refs/heads/master', 'target_type': 'alias' } }, 'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^' } `swh.journal.objects.release` +++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Release` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.release`. Message format: - `name` [bytes] name (typically the version) of the release - `message` [bytes] message of the release - `target` [bytes] identifier of the target object - `target_type` [string] type of the target, can be "content", "directory", "revision", "release" or "snapshot" - `synthetic` [bool] True if the :py:class:`swh.model.model.Release` object has been forged by the loading process; this flag is not used for the id computation, - `author` [dict] the author of the release - `date` [gitdate] the date of the release - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Release` object Example: .. code:: python { 'name': b'0.3', 'message': b'', 'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d', 'target_type': 'revision', 'synthetic': False, 'author': { 'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9', 'name': None, 'email': None }, 'date': { 'timestamp': { 'seconds': 1480432642, 'microseconds': 0 }, 'offset': 180, 'negative_utc': False }, 'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86' } `swh.journal.objects.revision` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Revision` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.revision`. Message format: - `message` [bytes] the commit message for the revision - `author` [dict] the author of the revision - `committer` [dict] the committer of the revision - `date` [gitdate] the revision date - `committer_date` [gitdate] the revision commit date - `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg") - `directory` [bytes] the intrinsic identifier of the directory this revision links to - `synthetic` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not, - `metadata` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the intrinsic identifier computation), - `parents` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers - `id` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision` - `extra_headers` [list[(bytes, bytes)]] TODO Example: .. code:: python { 'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n', 'author': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'committer': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'committer_date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'type': 'svn', 'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe', 'synthetic': True, 'metadata': None, 'parents': [ b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c', ], 'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8', 'perms': 33188}, {'name': b'lib', 'type': 'dir', 'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U', 'perms': 16384}, {'name': b'package.json', 'type': 'file', 'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x', 'perms': 33188} ], 'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P' } Other Objects Topics -------------------- These topics are for objects of the |swh| archive that are not part of the Merkle DAG but are essential parts of the archive; see the :ref:`data model ` for more details. `swh.journal.objects.origin` ++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Origin` objects. Message format: - `url` [string] URL of the :py:class:`swh.model.model.Origin` Example: .. code:: python { "url": "https://github.com/vujkovicm/pml" } `swh.journal.objects.origin_visit` ++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisit` objects. Message format: - `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` - `date` [timestamp] date of the visit - `type` [string] type of the loader used to perform the visit - `visit` [int] number of the visit for this `origin` Example: .. code:: python { 'origin': 'https://pypi.org/project/wasp-eureka/', 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'type': 'pypi', 'visit': 505} } `swh.journal.objects.origin_visit_status` +++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisitStatus` objects. Message format: - `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` - `visit` [int] number of the visit for this `origin` this status concerns - `date` [timestamp] date of the visit status update - `status` [string] status (can be "created", "ongoing", "full" or "partial"), - `snapshot` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this visit resulted in (if `status` is "full" or "partial") - `metadata`: deprecated Example: .. code:: python { 'origin': 'https://pypi.org/project/stricttype/', 'visit': 524, 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'status': 'full', 'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7", 'metadata': None } Extrinsic Metadata related Topics --------------------------------- Extrinsic metadata is information about software that is not part of the source code itself but still closely related to the software. See :ref:`extrinsic-metadata-specification` for more details on the Extrinsic Metadata model. `swh.journal.objects.metadata_authority` ++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataAuthority` objects. Message format: - `type` [string] - `url` [string] - `metadata` [dict] Examples: .. code:: python { 'type': 'forge', 'url': 'https://guix.gnu.org/sources.json', 'metadata': {} } { 'type': 'deposit_client', 'url': 'https://www.softwareheritage.org', 'metadata': {'name': 'swh'} } `swh.journal.objects.metadata_fetcher` ++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataFetcher` objects. Message format: - `type` [string] - `version` [string] - `metadata` [dict] Example: .. code:: python { 'name': 'swh.loader.package.cran.loader.CRANLoader', 'version': '0.15.0', 'metadata': {} } `swh.journal.objects.raw_extrinsic_metadata` ++++++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.RawExtrinsicMetadata` objects. Message format: - `type` [string] - `target` [string] - `discovery_date` [timestamp] - `authority` [dict] - `fetcher` [dict] - `format` [string] - `metadata` [bytes] - `origin` [string] - `visit` [int] - `snapshot` [SWHID] - `release` [SWHID] - `revision` [SWHID] - `path` [bytes] - `directory` [SWHID] Example: .. code:: python { 'type': 'snapshot', 'id': 'swh:1:snp:f3b180979283d4931d3199e6171840a3241829a3', 'discovery_date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'authority': { 'type': 'forge', 'url': 'https://pypi.org/', 'metadata': {} }, 'fetcher': { 'name': 'swh.loader.package.pypi.loader.PyPILoader', 'version': '0.10.0', 'metadata': {} }, 'format': 'pypi-project-json', 'metadata': b'{"info":{"author":"Signaltonsalat","author_email":"signaltonsalat@gmail.com"}]}', 'origin': 'https://pypi.org/project/schwurbler/' } Kafka message format -------------------- Each value of a Kafka message in a topic is a dictionary-like structure encoded as a msgpack_ byte string. Keys are ASCII strings. All values are encoded using default msgpack type system except for long integers for which we use a custom format using msgpack `extended type`_ to prevent overflow while packing some objects. Integer +++++++ For long integers (that do not fit in the `[-(2**63), 2 ** 64 - 1]` range), a custom `extended type`_ based encoding scheme is used. The `type` information can be: - `1` for positive (possibly long) integers, - `2` for negative (possibly long) integers. The payload is simply the bytes (big endian) representation of the absolute value (always positive). For example (adapted to standard integers for the sake of readability; these values are small so they will actually be encoded using the default msgpack format for integers): - `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`) - `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`) Datetime ++++++++ There are 2 type of date that can be encoded in a Kafka message: - dates for git-like objects (:py:class:`swh.model.model.Revision` and :py:class:`swh.model.model.Release`): these dates are part of the hash computation used as identifier in the Merkle DAG. In order to fully support git repositories, a custom encoding is required. These dates (coming from the git data model) are encoded as a dictionary with: - `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys (`seconds` and `microseconds`) - `offset` [int] offset of the date (in minutes) - `negative_utc` [bool] only True for the very edge case where the date has a zero but negative offset value (which does not makes much sense, but technically the git format permits) Example: .. code:: python { 'timestamp': {'seconds': 1480432642, 'microseconds': 0}, 'offset': 180, 'negative_utc': False } These are denoted as `gitdate` below. - other dates (resulting of the |swh| processing stack) are encoded using msgpack's Timestamp_ extended type. These are denoted as `timestamp` below. Note that these dates used to be encoded as a dictionary (beware: keys are bytes): .. code:: python { b"swhtype": "datetime", b"d": '2020-09-15T16:19:13.037809+00:00' } Person ++++++ :py:class:`swh.model.model.Person` objects represent a person in the |swh| Merkle DAG, namely a :py:class:`swh.model.model.Revision` author or committer, or a :py:class:`swh.model.model.Release` author. :py:class:`swh.model.model.Person` objects are serialized as a dictionary like: .. code:: python { 'fullname': 'John Doe ', 'name': 'John Doe', 'email': 'john.doe@example.com' } For anonymized topics, :py:class:`swh.model.model.Person` entities have seen anonymized prior to being serialized. The anonymized :py:class:`swh.model.model.Person` object is a dictionary like: .. code:: python { 'fullname': , 'name': null, 'email': null } where the `` is computed from original values as a sha256 of the original's `fullname`. .. _Kafka: https://kafka.apache.org .. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms .. _msgpack: https://msgpack.org/ .. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types .. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type diff --git a/sysadm/user-management/how-to-manage-creds-store.rst b/sysadm/user-management/how-to-manage-creds-store.rst index 67b280a..c34dcbc 100644 --- a/sysadm/user-management/how-to-manage-creds-store.rst +++ b/sysadm/user-management/how-to-manage-creds-store.rst @@ -1,9 +1,9 @@ .. _how_to_manage_creds_store: How to manage the credentials store =================================== .. todo:: - This page is a work in progress. For now, please refer to the `existing documentation + This page is a work in progress. For now, please refer to the `existing documentation `_. diff --git a/sysadm/user-management/keycloak/authentification.rst b/sysadm/user-management/keycloak/authentication.rst similarity index 67% rename from sysadm/user-management/keycloak/authentification.rst rename to sysadm/user-management/keycloak/authentication.rst index b3bcb9a..3f3af1f 100644 --- a/sysadm/user-management/keycloak/authentification.rst +++ b/sysadm/user-management/keycloak/authentication.rst @@ -1,9 +1,9 @@ -.. _authentification: +.. _authentication: -Reference: Authentification services +Reference: Authentication services ==================================== .. todo:: - This page is a work in progress. For now, please refer to the `existing documentation + This page is a work in progress. For now, please refer to the `existing documentation `_. diff --git a/sysadm/user-management/keycloak/how-to-user-perms.rst b/sysadm/user-management/keycloak/how-to-user-perms.rst index 8e480e9..3fc3661 100644 --- a/sysadm/user-management/keycloak/how-to-user-perms.rst +++ b/sysadm/user-management/keycloak/how-to-user-perms.rst @@ -1,9 +1,9 @@ .. _how_to_user_perms: How to set user permissions in keycloak ======================================= .. todo:: - This page is a work in progress. For now, please refer to the `existing documentation + This page is a work in progress. For now, please refer to the `existing documentation `_. diff --git a/sysadm/user-management/keycloak/index.rst b/sysadm/user-management/keycloak/index.rst index 0fa09b1..ea9932c 100644 --- a/sysadm/user-management/keycloak/index.rst +++ b/sysadm/user-management/keycloak/index.rst @@ -1,9 +1,9 @@ Keycloak -------- .. toctree:: :titlesonly: how-to-user-perms - authentification + authentication diff --git a/sysadm/user-management/onboarding.rst b/sysadm/user-management/onboarding.rst index 6dd11c9..6c77be4 100644 --- a/sysadm/user-management/onboarding.rst +++ b/sysadm/user-management/onboarding.rst @@ -1,9 +1,9 @@ .. _onboarding: Reference: Onboarding checklist =============================== .. todo:: - This page is a work in progress. For now, please refer to the `existing documentation + This page is a work in progress. For now, please refer to the `existing documentation `_. diff --git a/sysadm/user-management/outboarding.rst b/sysadm/user-management/outboarding.rst index 76e736d..01772f1 100644 --- a/sysadm/user-management/outboarding.rst +++ b/sysadm/user-management/outboarding.rst @@ -1,9 +1,9 @@ .. _outboarding: Reference: Outboarding checklist ================================ .. todo:: - This page is a work in progress. For now, please refer to the `existing documentation + This page is a work in progress. For now, please refer to the `existing documentation `_.