diff --git a/docs/contributing/tutorial-docs-contribution.rst b/docs/contributing/tutorial-docs-contribution.rst
index 1d2347b..fbd593b 100644
--- a/docs/contributing/tutorial-docs-contribution.rst
+++ b/docs/contributing/tutorial-docs-contribution.rst
@@ -1,192 +1,192 @@
.. _doc-contribution:
Tutorial: Best practices when writing SWH docs
==============================================
.. admonition:: Intended audience
:class: important
Members of the Software Heritage staff and external contributors
who wish to contribute by writing documentation.
-
+
A tutorial on how to contribute documentation into the Software Heritage world.
Step 1: Identify your audience
------------------------------
#. Ask yourself: Who are the readers of the documentation that you are writing?
In the Software Heritage community, three general types of personas are
distinguished:
* **visitors**: people who want to know what is the SWH initiative and archive
* **users**: people who want to use the SWH features
* as a service
* as a module by running a local instance
* **contributors**: people who are contributing to SWH (either external or swh
staff)
* as developers
* as sys-admins
* as support role
#. use the persona type to determine the document location in step 2
#. add the intended audience on the top of the page
Step 2: Determine the documentation location
--------------------------------------------
Information should have a permanent home as documentation. Elements that are work in
progress can live in the forge on issues or in hedgedoc, but these are not permanent
locations.
#. Choose high-level location:
Possible permanent locations include:
* The WordPress website: for visitors
* The archive web-app: for visitors and users (of the interface or API)
* The Sphinx docs:
* *devel* for contributors
* *users* for users of the infrastructure and all the different services
* *sysadm* for sys-admins
#. For contributors documentation in devel:
#. Choose if the subject is a high level (cross-module) section or in a specific
module
* if the document is relative to only one module, go and add it in the */docs*
directory in the module
* for cross-module documentation, use the swh-docs repository and the appropriate
sub-directory (e.g architecture)
#. Decide if a subsection is needed with multiple pages (tutorials, how-tos,
reference or explanation).
#. For sys-admin (in */sysadm* folder) and user documentation (in */users* folder):
#. Check if an existing section is already describing the theme that you want to
document.
#. Decide if a subsection is needed with multiple pages (tutorials, how-tos,
reference or explanation).
Step 3: Choose documentation type
---------------------------------
We are following Divio's approach with four major types of documentation:
* Tutorial: allowing newcomers to get started and ease the onboarding contributors and
users.
* How to: how to solve a specific problem in a step-by-step practical manual.
* Reference: theoretical/technical knowledge which is information oriented.
* Explanation: theoretical knowledge understanding-oriented to analyze, discuss and
explain different decisions, including background and context.
For more information see `the divio documentation `_
and/or `Daniele Procida's presentation `_
.. note::
We propose using in the following naming scheme depending on the type of document:
* Tutorial: Tutorial name]
* How to ...
* Reference: [Reference name]
* Explanation: [Explanation name]
Step 4: Create a page or sub-section with multiple pages
--------------------------------------------------------
#. Create a *.rst* file with a short name of your doc in the appropriate directory (see
step 2). If this is a sub-section, the first file should be an *index.rst* file
containing the list of the current sub-section files.
#. For not yet ready page, you can create simply create an empty page using the template
below. The template starts with a reference, so that you can link to this new page
from elsewhere. The page name should follow the step 3. scheme.
#. For existing page, you can link the new page with the existing one containing the
desired information.
Empty page template
^^^^^^^^^^^^^^^^^^^
.. code-block:: rst
.. _empty_page:
Empty page
==========
.. admonition:: Intended audience
:class: important
add the audience target(s) of this page
-
+
.. todo::
This page is a work in progress. For now, please refer to the `existing documentation `_.
Empty subsection template
^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: rst
.. _empty_subsection:
Empty subsection
================
.. toctree::
:titlesonly:
tutorial-my-first-tuto
howto-do-things
howto-test-stuff
howto-dance
reference-info
reference-best-practices
README in module
^^^^^^^^^^^^^^^^
We want to reduce redundancy in documentation as much as possible. The option we should
strive for is adding a symlink to docs/README.rst in the repo's module. Furthermore,
docs/README.rst should include docs/index.rst, as following:
.. code-block:: rst
.. _swh-fuse:
.. include:: README.rst
.. toctree::
:maxdepth: 1
:caption: Overview
cli
configuration
Design notes
Tutorial
Step 5: Add link to page/sub-section from an index.rst
------------------------------------------------------
Add the file-name to the menu of the parent index.rst
Step 6: Commit change for code review
-------------------------------------
You should open a diff for a documentation change following the instructions in
:ref:`code-review`
diff --git a/docs/journal.rst b/docs/journal.rst
index c5b02a1..83e554a 100644
--- a/docs/journal.rst
+++ b/docs/journal.rst
@@ -1,673 +1,673 @@
.. _journal-specs:
Journal Specification
=====================
The |swh| journal is a Kafka_-based stream of events for every added object in
the |swh| Archive and some of its related services, especially indexers.
Each topic_ will stream added elements for a given object type according to the
topic name.
Objects streamed in a topic are serialized versions of objects stored in the
|swh| Archive specified by the main |swh| :py:mod:`data model ` or
the :py:mod:`indexer object model `.
In this document we will describe expected messages in each topic, so a
potential consumer can easily cope with the |swh| journal without having to
read the source code or the |swh| :ref:`data model ` in details (it
is however recommended to familiarize yourself with this later).
Kafka message values are dictionary structures serialized as msgpack_, with a
few custom encodings. See the section `Kafka message format`_ below for a
complete description of the serialization format.
Note that each example given below show the dictionary before being serialized
as a msgpack_ chunk.
Topics
------
There are several groups of topics:
- main storage Merkle-DAG related topics,
- other storage objects (not part of the Merkle DAG),
- indexer related objects (not yet documented below).
Topics prefix can be either `swh.journal.objects` or
`swh.journal.objects_privileged` (see below).
Anonymized topics
+++++++++++++++++
For topics that transport messages with user information (name and email
address), namely `swh.journal.objects.release`_ and
`swh.journal.objects.revision`_, there are 2 versions of those: one is an
anonymized topic, in which user information are obfuscated, and a pristine
version with clear data.
Access to pristine topics depends on ACLs linked to credentials used to connect
to the Kafka cluster.
List of topics
++++++++++++++
- `swh.journal.objects.origin`_
- `swh.journal.objects.origin_visit`_
- `swh.journal.objects.origin_visit_status`_
- `swh.journal.objects.snapshot`_
- `swh.journal.objects.release`_
- `swh.journal.objects.privileged_release `_
- `swh.journal.objects.revision`_
- `swh.journal.objects.privileged_revision `_
- `swh.journal.objects.directory`_
- `swh.journal.objects.content`_
- `swh.journal.objects.skipped_content`_
- `swh.journal.objects.metadata_authority`_
- `swh.journal.objects.metadata_fetcher`_
- `swh.journal.objects.raw_extrinsic_metadata`_
Topics for Merkle-DAG objects
-----------------------------
These topics are for the various objects stored in the |swh| Merkle DAG, see
the :ref:`data model ` for more details.
`swh.journal.objects.snapshot`
++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.Snapshot` objects.
Message format:
- `branches` [dict] branches present in this snapshot,
- `id` [bytes] the intrinsic identifier of the
:py:class:`swh.model.model.Snapshot` object
with `branches` being a dictionary which keys are branch names [bytes], and values a dictionary of:
- `target` [bytes] intrinsic identifier of the targeted object
- `target_type` [string] the type of the targeted object (can be "content",
"directory", "revision", "release", "snapshot" or "alias").
Example:
.. code:: python
{
'branches': {
b'refs/pull/1/head': {
'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c',
'target_type': 'revision'
},
b'refs/pull/2/head': {
'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ',
'target_type': 'revision'
},
b'refs/heads/master': {
'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT',
'target_type': 'revision'
},
b'HEAD': {
'target': b'refs/heads/master',
'target_type': 'alias'
}
},
'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^'
}
`swh.journal.objects.release`
+++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.Release` objects.
This topics is anonymized. The non-anonymized version of this topic is
`swh.journal.objects_privileged.release`.
Message format:
- `name` [bytes] name (typically the version) of the release
- `message` [bytes] message of the release
- `target` [bytes] identifier of the target object
- `target_type` [string] type of the target, can be "content", "directory",
"revision", "release" or "snapshot"
- `synthetic` [bool] True if the :py:class:`swh.model.model.Release` object has
been forged by the loading process; this flag is not used for the id
computation,
- `author` [dict] the author of the release
- `date` [gitdate] the date of the release
- `id` [bytes] the intrinsic identifier of the
:py:class:`swh.model.model.Release` object
Example:
.. code:: python
{
'name': b'0.3',
'message': b'',
'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d',
'target_type': 'revision',
'synthetic': False,
'author': {
'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9',
'name': None,
'email': None
},
'date': {
'timestamp': {
'seconds': 1480432642,
'microseconds': 0
},
'offset': 180,
'negative_utc': False
},
'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86'
}
`swh.journal.objects.revision`
++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.Revision` objects.
This topics is anonymized. The non-anonymized version of this topic is
`swh.journal.objects_privileged.revision`.
Message format:
- `message` [bytes] the commit message for the revision
- `author` [dict] the author of the revision
- `committer` [dict] the committer of the revision
- `date` [gitdate] the revision date
- `committer_date` [gitdate] the revision commit date
- `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg")
- `directory` [bytes] the intrinsic identifier of the directory this revision links to
- `synthetic` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not,
- `metadata` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the
intrinsic identifier computation),
- `parents` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers
- `id` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision`
- `extra_headers` [list[(bytes, bytes)]] TODO
Example:
.. code:: python
{
'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n',
'author': {
'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z',
'name': None,
'email': None
},
'committer': {
'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z',
'name': None,
'email': None
},
'date': {
'timestamp': {'seconds': 1495977610, 'microseconds': 334267},
'offset': 0,
'negative_utc': False
},
'committer_date': {
'timestamp': {'seconds': 1495977610, 'microseconds': 334267},
'offset': 0,
'negative_utc': False
},
'type': 'svn',
'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe',
'synthetic': True,
'metadata': None,
'parents': [
b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c',
],
'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8',
'perms': 33188},
{'name': b'lib',
'type': 'dir',
'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U',
'perms': 16384},
{'name': b'package.json',
'type': 'file',
'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x',
'perms': 33188}
],
'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P'
}
Other Objects Topics
--------------------
These topics are for objects of the |swh| archive that are not part of the
Merkle DAG but are essential parts of the archive; see the :ref:`data model
` for more details.
`swh.journal.objects.origin`
++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.Origin` objects.
Message format:
- `url` [string] URL of the :py:class:`swh.model.model.Origin`
Example:
.. code:: python
{
"url": "https://github.com/vujkovicm/pml"
}
`swh.journal.objects.origin_visit`
++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.OriginVisit` objects.
Message format:
- `origin` [string] URL of the visited :py:class:`swh.model.model.Origin`
- `date` [timestamp] date of the visit
- `type` [string] type of the loader used to perform the visit
- `visit` [int] number of the visit for this `origin`
Example:
.. code:: python
{
'origin': 'https://pypi.org/project/wasp-eureka/',
'date': Timestamp(seconds=1606260407, nanoseconds=818259954),
'type': 'pypi',
'visit': 505}
}
`swh.journal.objects.origin_visit_status`
+++++++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.OriginVisitStatus` objects.
Message format:
- `origin` [string] URL of the visited :py:class:`swh.model.model.Origin`
- `visit` [int] number of the visit for this `origin` this status concerns
- `date` [timestamp] date of the visit status update
- `status` [string] status (can be "created", "ongoing", "full" or "partial"),
- `snapshot` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this
visit resulted in (if `status` is "full" or "partial")
- `metadata`: deprecated
Example:
.. code:: python
{
'origin': 'https://pypi.org/project/stricttype/',
'visit': 524,
'date': Timestamp(seconds=1606260407, nanoseconds=818259954),
'status': 'full',
'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7",
'metadata': None
}
Extrinsic Metadata related Topics
---------------------------------
Extrinsic metadata is information about software that is not part of the source
code itself but still closely related to the software. See
:ref:`extrinsic-metadata-specification` for more details on the Extrinsic
Metadata model.
`swh.journal.objects.metadata_authority`
++++++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.MetadataAuthority` objects.
Message format:
- `type` [string]
- `url` [string]
- `metadata` [dict]
Examples:
.. code:: python
{
'type': 'forge',
'url': 'https://guix.gnu.org/sources.json',
'metadata': {}
}
{
'type': 'deposit_client',
'url': 'https://www.softwareheritage.org',
'metadata': {'name': 'swh'}
}
`swh.journal.objects.metadata_fetcher`
++++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.MetadataFetcher` objects.
Message format:
- `type` [string]
- `version` [string]
- `metadata` [dict]
Example:
.. code:: python
{
'name': 'swh.loader.package.cran.loader.CRANLoader',
'version': '0.15.0',
'metadata': {}
}
`swh.journal.objects.raw_extrinsic_metadata`
++++++++++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.RawExtrinsicMetadata` objects.
Message format:
- `type` [string]
- `target` [string]
- `discovery_date` [timestamp]
- `authority` [dict]
- `fetcher` [dict]
- `format` [string]
- `metadata` [bytes]
- `origin` [string]
- `visit` [int]
- `snapshot` [SWHID]
- `release` [SWHID]
- `revision` [SWHID]
- `path` [bytes]
- `directory` [SWHID]
Example:
.. code:: python
{
'type': 'snapshot',
'id': 'swh:1:snp:f3b180979283d4931d3199e6171840a3241829a3',
'discovery_date': Timestamp(seconds=1606260407, nanoseconds=818259954),
'authority': {
'type': 'forge',
'url': 'https://pypi.org/',
'metadata': {}
},
'fetcher': {
'name': 'swh.loader.package.pypi.loader.PyPILoader',
'version': '0.10.0',
'metadata': {}
},
'format': 'pypi-project-json',
'metadata': b'{"info":{"author":"Signaltonsalat","author_email":"signaltonsalat@gmail.com"}]}',
'origin': 'https://pypi.org/project/schwurbler/'
}
Kafka message format
--------------------
Each value of a Kafka message in a topic is a dictionary-like structure
encoded as a msgpack_ byte string.
Keys are ASCII strings.
All values are encoded using default msgpack type system except for long
integers for which we use a custom format using msgpack `extended type`_ to
prevent overflow while packing some objects.
Integer
+++++++
For long integers (that do not fit in the `[-(2**63), 2 ** 64 - 1]` range), a
custom `extended type`_ based encoding scheme is used.
The `type` information can be:
- `1` for positive (possibly long) integers,
- `2` for negative (possibly long) integers.
The payload is simply the bytes (big endian) representation of the absolute
value (always positive).
For example (adapted to standard integers for the sake of readability; these
values are small so they will actually be encoded using the default msgpack
format for integers):
- `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`)
- `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`)
Datetime
++++++++
There are 2 type of date that can be encoded in a Kafka message:
- dates for git-like objects (:py:class:`swh.model.model.Revision` and
:py:class:`swh.model.model.Release`): these dates are part of the hash
computation used as identifier in the Merkle DAG. In order to fully support
git repositories, a custom encoding is required. These dates (coming from the
git data model) are encoded as a dictionary with:
- `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys
(`seconds` and `microseconds`)
- `offset` [int] offset of the date (in minutes)
- `negative_utc` [bool] only True for the very edge case where the date has a
zero but negative offset value (which does not makes much sense, but
technically the git format permits)
Example:
.. code:: python
{
'timestamp': {'seconds': 1480432642, 'microseconds': 0},
'offset': 180,
'negative_utc': False
}
These are denoted as `gitdate` below.
- other dates (resulting of the |swh| processing stack) are encoded using
msgpack's Timestamp_ extended type.
These are denoted as `timestamp` below.
Note that these dates used to be encoded as a dictionary (beware: keys are bytes):
.. code:: python
{
b"swhtype": "datetime",
b"d": '2020-09-15T16:19:13.037809+00:00'
}
Person
++++++
:py:class:`swh.model.model.Person` objects represent a person in the |swh|
Merkle DAG, namely a :py:class:`swh.model.model.Revision` author or committer,
or a :py:class:`swh.model.model.Release` author.
:py:class:`swh.model.model.Person` objects are serialized as a dictionary like:
.. code:: python
{
'fullname': 'John Doe ',
'name': 'John Doe',
'email': 'john.doe@example.com'
}
For anonymized topics, :py:class:`swh.model.model.Person` entities have seen
anonymized prior to being serialized. The anonymized
:py:class:`swh.model.model.Person` object is a dictionary like:
.. code:: python
{
'fullname': ,
'name': null,
'email': null
}
where the `` is computed from original values as a sha256 of the
original's `fullname`.
.. _Kafka: https://kafka.apache.org
.. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms
.. _msgpack: https://msgpack.org/
.. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types
.. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type
diff --git a/sysadm/user-management/how-to-manage-creds-store.rst b/sysadm/user-management/how-to-manage-creds-store.rst
index 67b280a..c34dcbc 100644
--- a/sysadm/user-management/how-to-manage-creds-store.rst
+++ b/sysadm/user-management/how-to-manage-creds-store.rst
@@ -1,9 +1,9 @@
.. _how_to_manage_creds_store:
How to manage the credentials store
===================================
.. todo::
- This page is a work in progress. For now, please refer to the `existing documentation
+ This page is a work in progress. For now, please refer to the `existing documentation
`_.
diff --git a/sysadm/user-management/keycloak/authentification.rst b/sysadm/user-management/keycloak/authentication.rst
similarity index 67%
rename from sysadm/user-management/keycloak/authentification.rst
rename to sysadm/user-management/keycloak/authentication.rst
index b3bcb9a..3f3af1f 100644
--- a/sysadm/user-management/keycloak/authentification.rst
+++ b/sysadm/user-management/keycloak/authentication.rst
@@ -1,9 +1,9 @@
-.. _authentification:
+.. _authentication:
-Reference: Authentification services
+Reference: Authentication services
====================================
.. todo::
- This page is a work in progress. For now, please refer to the `existing documentation
+ This page is a work in progress. For now, please refer to the `existing documentation
`_.
diff --git a/sysadm/user-management/keycloak/how-to-user-perms.rst b/sysadm/user-management/keycloak/how-to-user-perms.rst
index 8e480e9..3fc3661 100644
--- a/sysadm/user-management/keycloak/how-to-user-perms.rst
+++ b/sysadm/user-management/keycloak/how-to-user-perms.rst
@@ -1,9 +1,9 @@
.. _how_to_user_perms:
How to set user permissions in keycloak
=======================================
.. todo::
- This page is a work in progress. For now, please refer to the `existing documentation
+ This page is a work in progress. For now, please refer to the `existing documentation
`_.
diff --git a/sysadm/user-management/keycloak/index.rst b/sysadm/user-management/keycloak/index.rst
index 0fa09b1..ea9932c 100644
--- a/sysadm/user-management/keycloak/index.rst
+++ b/sysadm/user-management/keycloak/index.rst
@@ -1,9 +1,9 @@
Keycloak
--------
.. toctree::
:titlesonly:
how-to-user-perms
- authentification
+ authentication
diff --git a/sysadm/user-management/onboarding.rst b/sysadm/user-management/onboarding.rst
index 6dd11c9..6c77be4 100644
--- a/sysadm/user-management/onboarding.rst
+++ b/sysadm/user-management/onboarding.rst
@@ -1,9 +1,9 @@
.. _onboarding:
Reference: Onboarding checklist
===============================
.. todo::
- This page is a work in progress. For now, please refer to the `existing documentation
+ This page is a work in progress. For now, please refer to the `existing documentation
`_.
diff --git a/sysadm/user-management/outboarding.rst b/sysadm/user-management/outboarding.rst
index 76e736d..01772f1 100644
--- a/sysadm/user-management/outboarding.rst
+++ b/sysadm/user-management/outboarding.rst
@@ -1,9 +1,9 @@
.. _outboarding:
Reference: Outboarding checklist
================================
.. todo::
- This page is a work in progress. For now, please refer to the `existing documentation
+ This page is a work in progress. For now, please refer to the `existing documentation
`_.