Changeset View
Changeset View
Standalone View
Standalone View
docs/journal.rst
.. _journal-specs: | .. _journal-specs: | ||||
Software Heritage Journal --- Specifications | Software Heritage Journal --- Specifications | ||||
============================================ | ============================================ | ||||
The |swh| journal is a kafka_-based stream of events for every added object in | The |swh| journal is a Kafka_-based stream of events for every added object in | ||||
the |swh| Archive and some of its related services, especially indexers. | the |swh| Archive and some of its related services, especially indexers. | ||||
Each topic_ will stream added elements for a given object type according to the | Each topic_ will stream added elements for a given object type according to the | ||||
topic name. | topic name. | ||||
Objects streamed in a topic are serialized versions of objects stored in the | Objects streamed in a topic are serialized versions of objects stored in the | ||||
|swh| Archive specified by the main |swh| :py:mod:`data model <swh.model.model>` or | |swh| Archive specified by the main |swh| :py:mod:`data model <swh.model.model>` or | ||||
the :py:mod:`indexer object model <swh.indexer.storage.model>`. | the :py:mod:`indexer object model <swh.indexer.storage.model>`. | ||||
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines | |||||
- `swh.journal.objects.content`_ | - `swh.journal.objects.content`_ | ||||
- `swh.journal.objects.skippedcontent`_ | - `swh.journal.objects.skippedcontent`_ | ||||
- `swh.journal.objects.metadata_authority`_ | - `swh.journal.objects.metadata_authority`_ | ||||
- `swh.journal.objects.metadata_fetcher`_ | - `swh.journal.objects.metadata_fetcher`_ | ||||
- `swh.journal.objects.raw_extrinsic_metadata`_ | - `swh.journal.objects.raw_extrinsic_metadata`_ | ||||
Topics for Merkel-DAG objects | Topics for Merkle-DAG objects | ||||
----------------------------- | ----------------------------- | ||||
These topics are for the various objects stored in the |swh| Merkle DAG, see | These topics are for the various objects stored in the |swh| Merkle DAG, see | ||||
the :ref:`data model <swh-model>` for more details. | the :ref:`data model <swh-model>` for more details. | ||||
`swh.journal.objects.snapshot` | `swh.journal.objects.snapshot` | ||||
++++++++++++++++++++++++++++++ | ++++++++++++++++++++++++++++++ | ||||
▲ Show 20 Lines • Show All 462 Lines • ▼ Show 20 Lines | |||||
Kafka message format | Kafka message format | ||||
-------------------- | -------------------- | ||||
Each value of a kafka message in a topic is a dictionary-like structure | Each value of a Kafka message in a topic is a dictionary-like structure | ||||
encoded as a msgpack_ byte string. | encoded as a msgpack_ byte string. | ||||
Keys are ASCII strings. | Keys are ASCII strings. | ||||
All values are encoded using default msgpack type system except for long | All values are encoded using default msgpack type system except for long | ||||
integers for which we use a custom format using msgpack `extended type`_ to | integers for which we use a custom format using msgpack `extended type`_ to | ||||
prevent overflow while packing some objects. | prevent overflow while packing some objects. | ||||
Show All 18 Lines | |||||
- `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`) | - `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`) | ||||
- `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`) | - `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`) | ||||
Datetime | Datetime | ||||
++++++++ | ++++++++ | ||||
There are 2 type of date that can be encoded in a kafka message: | There are 2 type of date that can be encoded in a Kafka message: | ||||
- dates for git-like objects (:py:class:`swh.model.model.Revision` and | - dates for git-like objects (:py:class:`swh.model.model.Revision` and | ||||
:py:class:`swh.model.model.Release`): these dates are part of the hash | :py:class:`swh.model.model.Release`): these dates are part of the hash | ||||
computation used as identifier in the Merkle DAG. In order to fully support | computation used as identifier in the Merkle DAG. In order to fully support | ||||
git repositories, a custom encoding is required. These dates (coming from the | git repositories, a custom encoding is required. These dates (coming from the | ||||
git data model) are encoded as a dictionary with: | git data model) are encoded as a dictionary with: | ||||
- `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys | - `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys | ||||
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines | .. code:: python | ||||
{ | { | ||||
'fullname': <hashed value>, | 'fullname': <hashed value>, | ||||
'name': null, | 'name': null, | ||||
'email': null | 'email': null | ||||
} | } | ||||
where the `<hashed value>` is computed from original values as a sha256 of the | where the `<hashed value>` is computed from original values as a sha256 of the | ||||
orignal's `fullname`. | original's `fullname`. | ||||
.. _kafka: https://kafka.apache.org | .. _Kafka: https://kafka.apache.org | ||||
.. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms | .. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms | ||||
.. _msgpack: https://msgpack.org/ | .. _msgpack: https://msgpack.org/ | ||||
.. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types | .. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types | ||||
.. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type | .. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type |