diff --git a/docs/journal.rst b/docs/journal.rst new file mode 100644 --- /dev/null +++ b/docs/journal.rst @@ -0,0 +1,426 @@ +.. _journal-specs: + +Software Heritage Journal --- Specifications +============================================ + +The |swh| journal is a kafka_-based stream of events for every added object in +the |swh| Archive and some of its related services, especially indexers. + +Each topic_ will stream added elements for a given object type according to the +topic name. + +Objects streamed in a topic are serialized versions of objects stored in the +|swh| Archive specified by the |swh| :ref:`object model `. + +In this document we will describe expected messages in each topic, so a +potential consumer can easily cope with the |swh| journal without having to +read the source code or the |swh| :ref:`data model ` in details (it +is however recommended to familiarize yourself with this later). + +There are several groups of topics: + +- main storage Merkle-DAG related topics, +- metadata related topics, +- indexation related topics. + + +Kafka message formats +--------------------- + +Each value of a kafka message in a topic is a dictionnaty-like structure +encoded as a msgpack_ byte string. + +Keys (at first level) are ASCII strings. + +Most values are either string, simple integer or bytes, but there are also a +few custom formats (`extended types`_). + +- integers: to prevent overflow while packing some objects, a extended integer + format is used. The type information can be: + + - `1` for positive (possibly long) integers, + - `2` for negative (possibly long) integers. + + The value is simply the bytes (big endian) representation of the absolute + value (always positive). + + For example: + + - `12345` => `[1, [0x01, 0xE2, 0x40]]` + - `-42` +> `[2, [0x2A]]` + +- dates: there are 2 types of date that can be encoded in kafka messages, + depending on the object type they belong to. They both are encoded as dict, + but with different keys. + + - dates that comes from the git data model (present in :py:class:`Revision` + and :py:class:`Release` objects) are encoded as: + + - `timestamp` [dict] POSIX timestamp of the date, as a dict with 2 keys + ("seconds" and "microseconds") + - `offset` [int] offset of the date (in minutes) + - `negative_utc` [bool] only True for the very edge case where the date has + a zero but negative offset value (which technically the git format + permits) + + example: + + ``` + { + 'timestamp': {'seconds': 1480432642, 'microseconds': 0}, + 'offset': 180, + 'negative_utc': False + } + ``` + + - dates that comes from the |swh| archiving processes, and are encoded as a + dict **of which keys are bytes** and **values are strings**: + + - `swhtype`: "datetime" + - `d`: ISO 8601 representation of the date + + example: + + ``` + { + b'swhtype': 'datetime', + b'd': '2020-09-15T16:19:19.968050+00:00' + } + ``` + + +Merkel-DAG Topics +----------------- + +These topics are for the core of the |swh| archive, see the :ref:`data model +` for more details. + + +`swh.journal.objects.snapshot` +++++++++++++++++++++++++++++++ + +Topic for :py:class:`Snapshot` objects. + +Message format: + +- `branches` [dict] branches present in this snapshot, +- `id` [bytes] the intrinsic identifier of the :py:class:`Snapshot` object + +with `branches` being a dict which keys are branch names [bytes], and values a dict of: + +- `target` [bytes] intrinsic identifier of the targeted object +- `target_type` [string] the type of the targeted object (can be "content", "directory", + "revision", "release", "snapshot" or "alias"). + +Example: + +``` +{ + 'branches': {b'refs/pull/1/head': {'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c', + 'target_type': 'revision'}, + b'refs/pull/2/head': {'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ', + 'target_type': 'revision'}, + b'refs/heads/master': {'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT', + 'target_type': 'revision'}, + b'HEAD': {'target': b'refs/heads/master', + 'target_type': 'alias'}}, + 'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^' +} +``` + + +`swh.journal.objects.release` ++++++++++++++++++++++++++++++ + +Topic for :py:class:`Release` objects. + +Message format: + +- `name` [bytes] name (typically the version) of the release +- `message` [bytes] message of the release +- `target` [bytes] identifier of the target object +- `target_type` [string] type of the target, can be "content", "directory", + "revision", "release" or "snapshot" +- `synthetic` [bool] True if the :py:class:`Release` object has been forged by the loading + process; this flag is not used for the id computation, +- `author` [dict] the author of the release +- `date` [dict] the date of the release +- `id` [bytes] the intrinsic identifier of the :py:class:`Release` object + +Example: + +``` +{ + 'name': b'0.3', + 'message': b'', + 'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d', + 'target_type': 'revision', + 'synthetic': False, + 'author': {'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9', + 'name': None, + 'email': None}, + 'date': {'timestamp': {'seconds': 1480432642, 'microseconds': 0}, + 'offset': 180, + 'negative_utc': False}, + 'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86' +} +``` + + +`swh.journal.objects.revision` +++++++++++++++++++++++++++++++ + +Topic for :py:class:`Revision` objects. + +Message format: + +- `message` [bytes] the commit message for the revision +- `author` [dict] the author of the revision +- `committer` [dict] the committer of the revision +- `date` [dict] the revision date +- `committer_date` [dict] the revision commit date +- `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg") +- `directory` [bytes] the intrinsic identifier of the directory this revision links to +- `synthetic` [bool] whether this :py:class:`Revision` is synthetic or not, +- `metadata` [bytes] the metadata linked to this :py:class:`Revision` (not part of the + intrinsic identifier computation), +- `parents` [list[bytes]] list of parent :py:class:`Revision` intrinsic identifiers +- `id` [bytes] intrinsic identifier of the :py:class:`Revision` +- `extra_headers` [list[(bytes, bytes)]] TODO + + +Example: + +``` +{ + 'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n', + 'author': { + 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', + 'name': None, + 'email': None + }, + 'committer': { + 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', + 'name': None, + 'email': None + }, + 'date': { + 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, + 'offset': 0, + 'negative_utc': False + }, + 'committer_date': { + 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, + 'offset': 0, + 'negative_utc': False + }, + 'type': 'svn', + 'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe', + 'synthetic': True, + 'metadata': None, + 'parents': [ + b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c', + ], + 'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8', + 'perms': 33188}, + {'name': b'lib', + 'type': 'dir', + 'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U', + 'perms': 16384}, + {'name': b'package.json', + 'type': 'file', + 'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x', + 'perms': 33188} + ], + 'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P' +} +``` + + +Other Objects Topics +-------------------- + +These topics are for objects of the |swh| archive that are not part of the +Merkle DAG but are essential parts of the archive; see the :ref:`data model +` for more details. + + +`swh.journal.objects.origin` +++++++++++++++++++++++++++++ + +Topic for :py:class:`Origin` objects. + +Message format: + +- `url` [string] URL of the :py:class:`Origin` + +Example: +``` +{ + "url": "https://github.com/vujkovicm/pml" +} +``` + +`swh.journal.objects.origin_visit` +++++++++++++++++++++++++++++++++++ + +Topic for :py:class:`OriginVisit` objects. + +Message format: + +- `origin` [string] URL of the visited :py:class:`Origin` +- `date` [dict] date of the visit +- `type` [string] type of the loader used to perform the visit +- `visit` [int] number of the visit for this `origin` + +Example: + +``` +{ + 'origin': 'https://pypi.org/project/wasp-eureka/', + 'date': {b'swhtype': 'datetime', + b'd': '2020-09-15T16:19:29.554608+00:00'}, + 'type': 'pypi', + 'visit': 505} +} +``` + +`swh.journal.objects.origin_visit_status` ++++++++++++++++++++++++++++++++++++++++++ + +Topic for :py:class:`OriginVisitStatus` objects. + +Message format: + +- `origin` [string] URL of the visited :py:class:`Origin` +- `visit` [int] number of the visit for this `origin` this status concerns +- `date` [dict] date of the visit status update +- `status` [string] status (can be "created", "ongoing", "full" or "partial"), +- `snapshot` [bytes] identifier of the :py:class:`Snaphot` this visit resulted in (if + `status` is "full" or "partial") +- `metadata`: deprecated + +Example: + +``` +{ + 'origin': 'https://pypi.org/project/stricttype/', + 'visit': 524, + 'date': {b'swhtype': 'datetime', + b'd': '2020-09-15T16:19:19.968050+00:00'}, + 'status': 'full', + 'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7", + 'metadata': None +} +``` + +.. _kafka: https://kafka.apache.org +.. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms +.. _msgpack: https://msgpack.org/ +.. _`extended types`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types