Page MenuHomeSoftware Heritage

D4633.id16455.diff
No OneTemporary

D4633.id16455.diff

diff --git a/docs/journal.rst b/docs/journal.rst
new file mode 100644
--- /dev/null
+++ b/docs/journal.rst
@@ -0,0 +1,426 @@
+.. _journal-specs:
+
+Software Heritage Journal --- Specifications
+============================================
+
+The |swh| journal is a kafka_-based stream of events for every added object in
+the |swh| Archive and some of its related services, especially indexers.
+
+Each topic_ will stream added elements for a given object type according to the
+topic name.
+
+Objects streamed in a topic are serialized versions of objects stored in the
+|swh| Archive specified by the |swh| :ref:`object model <swh-model>`.
+
+In this document we will describe expected messages in each topic, so a
+potential consumer can easily cope with the |swh| journal without having to
+read the source code or the |swh| :ref:`data model <swh-model>` in details (it
+is however recommended to familiarize yourself with this later).
+
+There are several groups of topics:
+
+- main storage Merkle-DAG related topics,
+- metadata related topics,
+- indexation related topics.
+
+
+Kafka message formats
+---------------------
+
+Each value of a kafka message in a topic is a dictionnaty-like structure
+encoded as a msgpack_ byte string.
+
+Keys (at first level) are ASCII strings.
+
+Most values are either string, simple integer or bytes, but there are also a
+few custom formats (`extended types`_).
+
+- integers: to prevent overflow while packing some objects, a extended integer
+ format is used. The type information can be:
+
+ - `1` for positive (possibly long) integers,
+ - `2` for negative (possibly long) integers.
+
+ The value is simply the bytes (big endian) representation of the absolute
+ value (always positive).
+
+ For example:
+
+ - `12345` => `[1, [0x01, 0xE2, 0x40]]`
+ - `-42` +> `[2, [0x2A]]`
+
+- dates: there are 2 types of date that can be encoded in kafka messages,
+ depending on the object type they belong to. They both are encoded as dict,
+ but with different keys.
+
+ - dates that comes from the git data model (present in :py:class:`Revision`
+ and :py:class:`Release` objects) are encoded as:
+
+ - `timestamp` [dict] POSIX timestamp of the date, as a dict with 2 keys
+ ("seconds" and "microseconds")
+ - `offset` [int] offset of the date (in minutes)
+ - `negative_utc` [bool] only True for the very edge case where the date has
+ a zero but negative offset value (which technically the git format
+ permits)
+
+ example:
+
+ ```
+ {
+ 'timestamp': {'seconds': 1480432642, 'microseconds': 0},
+ 'offset': 180,
+ 'negative_utc': False
+ }
+ ```
+
+ - dates that comes from the |swh| archiving processes, and are encoded as a
+ dict **of which keys are bytes** and **values are strings**:
+
+ - `swhtype`: "datetime"
+ - `d`: ISO 8601 representation of the date
+
+ example:
+
+ ```
+ {
+ b'swhtype': 'datetime',
+ b'd': '2020-09-15T16:19:19.968050+00:00'
+ }
+ ```
+
+
+Merkel-DAG Topics
+-----------------
+
+These topics are for the core of the |swh| archive, see the :ref:`data model
+<swh-model>` for more details.
+
+
+`swh.journal.objects.snapshot`
+++++++++++++++++++++++++++++++
+
+Topic for :py:class:`Snapshot` objects.
+
+Message format:
+
+- `branches` [dict] branches present in this snapshot,
+- `id` [bytes] the intrinsic identifier of the :py:class:`Snapshot` object
+
+with `branches` being a dict which keys are branch names [bytes], and values a dict of:
+
+- `target` [bytes] intrinsic identifier of the targeted object
+- `target_type` [string] the type of the targeted object (can be "content", "directory",
+ "revision", "release", "snapshot" or "alias").
+
+Example:
+
+```
+{
+ 'branches': {b'refs/pull/1/head': {'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c',
+ 'target_type': 'revision'},
+ b'refs/pull/2/head': {'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ',
+ 'target_type': 'revision'},
+ b'refs/heads/master': {'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT',
+ 'target_type': 'revision'},
+ b'HEAD': {'target': b'refs/heads/master',
+ 'target_type': 'alias'}},
+ 'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^'
+}
+```
+
+
+`swh.journal.objects.release`
++++++++++++++++++++++++++++++
+
+Topic for :py:class:`Release` objects.
+
+Message format:
+
+- `name` [bytes] name (typically the version) of the release
+- `message` [bytes] message of the release
+- `target` [bytes] identifier of the target object
+- `target_type` [string] type of the target, can be "content", "directory",
+ "revision", "release" or "snapshot"
+- `synthetic` [bool] True if the :py:class:`Release` object has been forged by the loading
+ process; this flag is not used for the id computation,
+- `author` [dict] the author of the release
+- `date` [dict] the date of the release
+- `id` [bytes] the intrinsic identifier of the :py:class:`Release` object
+
+Example:
+
+```
+{
+ 'name': b'0.3',
+ 'message': b'',
+ 'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d',
+ 'target_type': 'revision',
+ 'synthetic': False,
+ 'author': {'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9',
+ 'name': None,
+ 'email': None},
+ 'date': {'timestamp': {'seconds': 1480432642, 'microseconds': 0},
+ 'offset': 180,
+ 'negative_utc': False},
+ 'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86'
+}
+```
+
+
+`swh.journal.objects.revision`
+++++++++++++++++++++++++++++++
+
+Topic for :py:class:`Revision` objects.
+
+Message format:
+
+- `message` [bytes] the commit message for the revision
+- `author` [dict] the author of the revision
+- `committer` [dict] the committer of the revision
+- `date` [dict] the revision date
+- `committer_date` [dict] the revision commit date
+- `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg")
+- `directory` [bytes] the intrinsic identifier of the directory this revision links to
+- `synthetic` [bool] whether this :py:class:`Revision` is synthetic or not,
+- `metadata` [bytes] the metadata linked to this :py:class:`Revision` (not part of the
+ intrinsic identifier computation),
+- `parents` [list[bytes]] list of parent :py:class:`Revision` intrinsic identifiers
+- `id` [bytes] intrinsic identifier of the :py:class:`Revision`
+- `extra_headers` [list[(bytes, bytes)]] TODO
+
+
+Example:
+
+```
+{
+ 'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n',
+ 'author': {
+ 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z',
+ 'name': None,
+ 'email': None
+ },
+ 'committer': {
+ 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z',
+ 'name': None,
+ 'email': None
+ },
+ 'date': {
+ 'timestamp': {'seconds': 1495977610, 'microseconds': 334267},
+ 'offset': 0,
+ 'negative_utc': False
+ },
+ 'committer_date': {
+ 'timestamp': {'seconds': 1495977610, 'microseconds': 334267},
+ 'offset': 0,
+ 'negative_utc': False
+ },
+ 'type': 'svn',
+ 'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe',
+ 'synthetic': True,
+ 'metadata': None,
+ 'parents': [
+ b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c',
+ ],
+ 'id': b'\x1e\x1c\x19<l\xaa\xd2~{P\x11jH\x0f\xfd\xb0Y\x86\x99\x08',
+ 'extra_headers': [
+ [b'svn_repo_uuid', b'2bfe0521-f11c-4a00-b80e-6202646ff360'],
+ [b'svn_revision', b'4067']
+ ]
+}
+```
+
+
+`swh.journal.objects.content`
++++++++++++++++++++++++++++++
+
+Topic for :py:class:`Content` objects.
+
+Message format:
+
+- `sha1` [bytes] SHA1 of the :py:class:`Content`
+- `sha1_git` [bytes] SHA1_GIT of the :py:class:`Content`
+- `sha256` [bytes] SHA256 of the :py:class:`Content`
+- `blake2s256` [bytes] Blake2S256 hash of the :py:class:`Content`
+- `length` [int] length of the :py:class:`Content`
+- `status` [string] visibility status of the :py:class:`Content` (can be "visible" or "hidden")
+- `ctime` [dict] creation date of the :py:class:`Content` (i.e. date at which this
+ :py:class:`Content` has been seen for the first time in the |swh| Archive).
+
+Example:
+
+```
+{
+ 'sha1': b'-\xe7\xc1`\x9d\xd7\x7fu+\x05l\x07\xd1}\x95\x16o-u\x1d',
+ 'sha1_git': b'\xb9B\xa7EOW[\xef\x8b\x98\xa6b\xe9\xc7\xf0\x96g\x06`\xa4',
+ 'sha256': b'h{\xda\x8d\xaeG\xa4\xc6\x10\x05\xbc\xc9hca\x0em)\xd3A\x08\xd6\x95~(\xe5\xba\xe4\xaa\xcaT\x19',
+ 'blake2s256': b'\x8cl\xec\xe8S\xcd\xab\x90E\xc2\x8c\xfax\xe3\xbe\xca\x9aJ6\x1a\x9c](6\xc3\xb49\x8b:\xf9\xd8r',
+ 'length': 3220,
+ 'status': 'visible',
+ 'ctime': {b'swhtype': 'datetime',
+ b'd': '2020-09-15T16:19:13.037809+00:00'}
+}
+```
+
+
+`swh.journal.objects.skipped_content`
++++++++++++++++++++++++++++++++++++++
+
+Topic for :py:class:`SkippedContent` objects.
+
+
+Message format:
+
+- `sha1` [bytes] SHA1 of the :py:class:`SkippedContent`
+- `sha1_git` [bytes] SHA1 of the :py:class:`SkippedContent`
+- `sha256` [bytes] SHA1 of the :py:class:`SkippedContent`
+- `blake2s256` [bytes] SHA1 of the :py:class:`SkippedContent`
+- `length` [int] length of the :py:class:`SkippedContent`
+- `status` [string] visibility status of the :py:class:`SkippedContent` (can only be "absent")
+- `reason` [string] message indicating the reason for this content to be a
+ :py:class:`SkippedContent` (rather than a :py:class:`Content`).
+- `ctime` [dict] creation date of the :py:class:`SkippedContent` (i.e. date at which this
+ :py:class:`SkippedContent` has been seen for the first time in the |swh| Archive).
+
+
+Example:
+
+```
+{
+ 'sha1': b'[\x0f\x19I-%+\xec\x9dS\x86\xffz\xcb\xa2\x9f\x15\xcc\xb4&',
+ 'sha1_git': b'\xa9\xff4\xa7\xff\x85\xb3x$Ot\xaa\x91\x0b\xd0ZB!\x04\x8a',
+ 'sha256': b"\xe6\x876\xb2U-\x87\xb8\xe3\x12\xa0L\rq'\x88\xd4\x95\x92\xdf\x86\xfci\xe3E\x82\xe0\x95^\xbf\x1e\xbe",
+ 'blake2s256': b'\xe1 \n\x1d5\x8b\x1f\x98\\\x8e\xaa\x1d?8*\xc1\xf7\xb9\x95\r|\x1e\xee^\x10\x10\x19\xc6\x9c\x11\xedX',
+ 'length': 125146729,
+ 'status': 'absent',
+ 'reason': 'Content too large',
+ 'ctime': {b'swhtype': 'datetime',
+ b'd': '2020-11-24T23:26:47.818260+00:00'}
+}
+```
+
+
+`swh.journal.objects.directory`
++++++++++++++++++++++++++++++++
+
+Topic for :py:class:`Directory` objects.
+
+Message format:
+
+- `entries` [list[dict]] list of directory entries
+- `id` [bytes] intrinsic identifier of this :py:class:`Directory`
+
+with directory entries being dicts:
+
+- `name` [bytes] name of the directory entry
+- `type` [string] type of directory entry (can be "file", "dir" or "rev")
+- `perms` [int] permissions for this directory entry
+
+
+Example:
+
+```
+{
+ 'entries': [
+ {'name': b'LICENSE',
+ 'type': 'file',
+ 'target': b'b\x03f\xeb\x90\x07\x1cs\xaeib\x8eg\x97]0\xf0\x9dg\x01',
+ 'perms': 33188},
+ {'name': b'README.md',
+ 'type': 'file',
+ 'target': b'\x1e>\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8',
+ 'perms': 33188},
+ {'name': b'lib',
+ 'type': 'dir',
+ 'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U',
+ 'perms': 16384},
+ {'name': b'package.json',
+ 'type': 'file',
+ 'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x',
+ 'perms': 33188}
+ ],
+ 'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P'
+}
+```
+
+
+Other Objects Topics
+--------------------
+
+These topics are for objects of the |swh| archive that are not part of the
+Merkle DAG but are essential parts of the archive; see the :ref:`data model
+<swh-model>` for more details.
+
+
+`swh.journal.objects.origin`
+++++++++++++++++++++++++++++
+
+Topic for :py:class:`Origin` objects.
+
+Message format:
+
+- `url` [string] URL of the :py:class:`Origin`
+
+Example:
+```
+{
+ "url": "https://github.com/vujkovicm/pml"
+}
+```
+
+`swh.journal.objects.origin_visit`
+++++++++++++++++++++++++++++++++++
+
+Topic for :py:class:`OriginVisit` objects.
+
+Message format:
+
+- `origin` [string] URL of the visited :py:class:`Origin`
+- `date` [dict] date of the visit
+- `type` [string] type of the loader used to perform the visit
+- `visit` [int] number of the visit for this `origin`
+
+Example:
+
+```
+{
+ 'origin': 'https://pypi.org/project/wasp-eureka/',
+ 'date': {b'swhtype': 'datetime',
+ b'd': '2020-09-15T16:19:29.554608+00:00'},
+ 'type': 'pypi',
+ 'visit': 505}
+}
+```
+
+`swh.journal.objects.origin_visit_status`
++++++++++++++++++++++++++++++++++++++++++
+
+Topic for :py:class:`OriginVisitStatus` objects.
+
+Message format:
+
+- `origin` [string] URL of the visited :py:class:`Origin`
+- `visit` [int] number of the visit for this `origin` this status concerns
+- `date` [dict] date of the visit status update
+- `status` [string] status (can be "created", "ongoing", "full" or "partial"),
+- `snapshot` [bytes] identifier of the :py:class:`Snaphot` this visit resulted in (if
+ `status` is "full" or "partial")
+- `metadata`: deprecated
+
+Example:
+
+```
+{
+ 'origin': 'https://pypi.org/project/stricttype/',
+ 'visit': 524,
+ 'date': {b'swhtype': 'datetime',
+ b'd': '2020-09-15T16:19:19.968050+00:00'},
+ 'status': 'full',
+ 'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7",
+ 'metadata': None
+}
+```
+
+.. _kafka: https://kafka.apache.org
+.. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms
+.. _msgpack: https://msgpack.org/
+.. _`extended types`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types

File Metadata

Mime Type
text/plain
Expires
Thu, Dec 19, 12:14 PM (18 h, 27 m)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3225207

Event Timeline