diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -149,6 +149,7 @@ architecture getting-started developer-setup + journal API documentation swh.core swh.dataset diff --git a/docs/journal.rst b/docs/journal.rst new file mode 100644 --- /dev/null +++ b/docs/journal.rst @@ -0,0 +1,673 @@ +.. _journal-specs: + +Software Heritage Journal --- Specifications +============================================ + +The |swh| journal is a kafka_-based stream of events for every added object in +the |swh| Archive and some of its related services, especially indexers. + +Each topic_ will stream added elements for a given object type according to the +topic name. + +Objects streamed in a topic are serialized versions of objects stored in the +|swh| Archive specified by the main |swh| :py:mod:`data model ` or +the :py:mod:`indexer object model `. + + +In this document we will describe expected messages in each topic, so a +potential consumer can easily cope with the |swh| journal without having to +read the source code or the |swh| :ref:`data model ` in details (it +is however recommended to familiarize yourself with this later). + +Kafka message values are dictionary structures serialized as msgpack_, with a +few custom encodings. See the section `Kafka message format`_ below for a +complete description of the serialization format. + +Note that each example given below show the dictionary before being serialized +as a msgpack_ chunk. + + +Topics +------ + +There are several groups of topics: + +- main storage Merkle-DAG related topics, +- other storage objects (not part of the Merkle DAG), +- indexer related objects (not yet documented below). + +Topics prefix can be either `swh.journal.objects` or +`swh.journal.objects_privileged` (see below). + +Anonymized topics ++++++++++++++++++ + +For topics that transport messages with user information (name and email +address), namely `swh.journal.objects.release`_ and +`swh.journal.objects.revision`_, there are 2 versions of those: one is an +anonymized topic, in which user information are obfuscated, and a pristine +version with clear data. + +Access to pristine topics depends on ACLs linked to credentials used to connect +to the Kafka cluster. + + +List of topics +++++++++++++++ + +- `swh.journal.objects.origin`_ +- `swh.journal.objects.origin_visit`_ +- `swh.journal.objects.origin_visit_status`_ +- `swh.journal.objects.snapshot`_ +- `swh.journal.objects.release`_ +- `swh.journal.objects.privileged_release `_ +- `swh.journal.objects.revision`_ +- `swh.journal.objects.privileged_revision `_ +- `swh.journal.objects.directory`_ +- `swh.journal.objects.content`_ +- `swh.journal.objects.skippedcontent`_ +- `swh.journal.objects.metadata_authority`_ +- `swh.journal.objects.metadata_fetcher`_ +- `swh.journal.objects.raw_extrinsic_metadata`_ + + + +Topics for Merkel-DAG objects +----------------------------- + +These topics are for the various objects stored in the |swh| Merkle DAG, see +the :ref:`data model ` for more details. + + +`swh.journal.objects.snapshot` +++++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.Snapshot` objects. + +Message format: + +- `branches` [dict] branches present in this snapshot, +- `id` [bytes] the intrinsic identifier of the + :py:class:`swh.model.model.Snapshot` object + +with `branches` being a dictionary which keys are branch names [bytes], and values a dictionary of: + +- `target` [bytes] intrinsic identifier of the targeted object +- `target_type` [string] the type of the targeted object (can be "content", + "directory", "revision", "release", "snapshot" or "alias"). + +Example: + +.. code:: json + + { + 'branches': { + b'refs/pull/1/head': { + 'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c', + 'target_type': 'revision' + }, + b'refs/pull/2/head': { + 'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ', + 'target_type': 'revision' + }, + b'refs/heads/master': { + 'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT', + 'target_type': 'revision' + }, + b'HEAD': { + 'target': b'refs/heads/master', + 'target_type': 'alias' + } + }, + 'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^' + } + + + +`swh.journal.objects.release` ++++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.Release` objects. + +This topics is anonymized. The non-anonymized version of this topic is +`swh.journal.objects_privileged.release`. + +Message format: + +- `name` [bytes] name (typically the version) of the release +- `message` [bytes] message of the release +- `target` [bytes] identifier of the target object +- `target_type` [string] type of the target, can be "content", "directory", + "revision", "release" or "snapshot" +- `synthetic` [bool] True if the :py:class:`swh.model.model.Release` object has + been forged by the loading process; this flag is not used for the id + computation, +- `author` [dict] the author of the release +- `date` [gitdate] the date of the release +- `id` [bytes] the intrinsic identifier of the + :py:class:`swh.model.model.Release` object + +Example: + +.. code:: json + + { + 'name': b'0.3', + 'message': b'', + 'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d', + 'target_type': 'revision', + 'synthetic': False, + 'author': { + 'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9', + 'name': None, + 'email': None + }, + 'date': { + 'timestamp': { + 'seconds': 1480432642, + 'microseconds': 0 + }, + 'offset': 180, + 'negative_utc': False + }, + 'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86' + } + + +`swh.journal.objects.revision` +++++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.Revision` objects. + +This topics is anonymized. The non-anonymized version of this topic is +`swh.journal.objects_privileged.revision`. + +Message format: + +- `message` [bytes] the commit message for the revision +- `author` [dict] the author of the revision +- `committer` [dict] the committer of the revision +- `date` [gitdate] the revision date +- `committer_date` [gitdate] the revision commit date +- `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg") +- `directory` [bytes] the intrinsic identifier of the directory this revision links to +- `synthetic` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not, +- `metadata` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the + intrinsic identifier computation), +- `parents` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers +- `id` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision` +- `extra_headers` [list[(bytes, bytes)]] TODO + + +Example: + +.. code:: json + + { + 'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n', + 'author': { + 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', + 'name': None, + 'email': None + }, + 'committer': { + 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', + 'name': None, + 'email': None + }, + 'date': { + 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, + 'offset': 0, + 'negative_utc': False + }, + 'committer_date': { + 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, + 'offset': 0, + 'negative_utc': False + }, + 'type': 'svn', + 'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe', + 'synthetic': True, + 'metadata': None, + 'parents': [ + b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c', + ], + 'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8', + 'perms': 33188}, + {'name': b'lib', + 'type': 'dir', + 'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U', + 'perms': 16384}, + {'name': b'package.json', + 'type': 'file', + 'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x', + 'perms': 33188} + ], + 'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P' + } + + + +Other Objects Topics +-------------------- + +These topics are for objects of the |swh| archive that are not part of the +Merkle DAG but are essential parts of the archive; see the :ref:`data model +` for more details. + + +`swh.journal.objects.origin` +++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.Origin` objects. + +Message format: + +- `url` [string] URL of the :py:class:`swh.model.model.Origin` + +Example: + +.. code:: json + + { + "url": "https://github.com/vujkovicm/pml" + } + + +`swh.journal.objects.origin_visit` +++++++++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.OriginVisit` objects. + +Message format: + +- `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` +- `date` [timestamp] date of the visit +- `type` [string] type of the loader used to perform the visit +- `visit` [int] number of the visit for this `origin` + +Example: + +.. code:: json + + { + 'origin': 'https://pypi.org/project/wasp-eureka/', + 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), + 'type': 'pypi', + 'visit': 505} + } + + +`swh.journal.objects.origin_visit_status` ++++++++++++++++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.OriginVisitStatus` objects. + +Message format: + +- `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` +- `visit` [int] number of the visit for this `origin` this status concerns +- `date` [timestamp] date of the visit status update +- `status` [string] status (can be "created", "ongoing", "full" or "partial"), +- `snapshot` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this + visit resulted in (if `status` is "full" or "partial") +- `metadata`: deprecated + +Example: + +.. code:: json + + { + 'origin': 'https://pypi.org/project/stricttype/', + 'visit': 524, + 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), + 'status': 'full', + 'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7", + 'metadata': None + } + + + +Extrinsic Metadata related Topics +--------------------------------- + +Extrinsic metadata is information about software that is not part of the source +code itself but still closely related to the software. See +:ref:`extrinsic-metadata-specification` for more details on the Extrinsic +Metadata model. + +`swh.journal.objects.metadata_authority` +++++++++++++++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.MetadataAuthority` objects. + +Message format: + +- `type` [string] +- `url` [string] +- `metadata` [dict] + +Examples: + +.. code:: json + + { + 'type': 'forge', + 'url': 'https://guix.gnu.org/sources.json', + 'metadata': {} + } + + { + 'type': 'deposit_client', + 'url': 'https://www.softwareheritage.org', + 'metadata': {'name': 'swh'} + } + + + +`swh.journal.objects.metadata_fetcher` +++++++++++++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.MetadataFetcher` objects. + +Message format: + +- `type` [string] +- `version` [string] +- `metadata` [dict] + +Example: + +.. code:: json + + { + 'name': 'swh.loader.package.cran.loader.CRANLoader', + 'version': '0.15.0', + 'metadata': {} + } + + + +`swh.journal.objects.raw_extrinsic_metadata` +++++++++++++++++++++++++++++++++++++++++++++ + +Topic for :py:class:`swh.model.model.RawExtrinsicMetadata` objects. + +Message format: + +- `type` [string] +- `target` [string] +- `discovery_date` [timestamp] +- `authority` [dict] +- `fetcher` [dict] +- `format` [string] +- `metadata` [bytes] +- `origin` [string] +- `visit` [int] +- `snapshot` [SWHID] +- `release` [SWHID] +- `revision` [SWHID] +- `path` [bytes] +- `directory` [SWHID] + +Example: + +.. code:: json + + { + 'type': 'snapshot', + 'id': 'swh:1:snp:f3b180979283d4931d3199e6171840a3241829a3', + 'discovery_date': Timestamp(seconds=1606260407, nanoseconds=818259954), + 'authority': { + 'type': 'forge', + 'url': 'https://pypi.org/', + 'metadata': {} + }, + 'fetcher': { + 'name': 'swh.loader.package.pypi.loader.PyPILoader', + 'version': '0.10.0', + 'metadata': {} + }, + 'format': 'pypi-project-json', + 'metadata': b'{"info":{"author":"Signaltonsalat","author_email":"signaltonsalat@gmail.com"}]}', + 'origin': 'https://pypi.org/project/schwurbler/' + } + + + + + +Kafka message format +-------------------- + +Each value of a kafka message in a topic is a dictionary-like structure +encoded as a msgpack_ byte string. + +Keys are ASCII strings. + +All values are encoded using default msgpack type system except for long +integers for which we use a custom format using msgpack `extended type`_ to +prevent overflow while packing some objects. + + +Integer ++++++++ + +For long integers (that do not fit in the `[-(2**63), 2 ** 64 - 1]` range), a +custom `extended type`_ based encoding scheme is used. + +The `type` information can be: + +- `1` for positive (possibly long) integers, +- `2` for negative (possibly long) integers. + +The payload is simply the bytes (big endian) representation of the absolute +value (always positive). + +For example (adapted to standard integers for the sake of readability; these +values are small so they will actually be encoded using the default msgpack +format for integers): + +- `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`) +- `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`) + + +Datetime +++++++++ + +There are 2 type of date that can be encoded in a kafka message: + +- dates for git-like objects (:py:class:`swh.model.model.Revision` and + :py:class:`swh.model.model.Release`): these dates are part of the hash + computation used as identifier in the Merkle DAG. In order to fully support + git repositories, a custom encoding is required. These dates (coming from the + git data model) are encoded as a dictionary with: + + - `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys + (`seconds` and `microseconds`) + + - `offset` [int] offset of the date (in minutes) + + - `negative_utc` [bool] only True for the very edge case where the date has a + zero but negative offset value (which does not makes much sense, but + technically the git format permits) + + Example: + + .. code:: json + + { + 'timestamp': {'seconds': 1480432642, 'microseconds': 0}, + 'offset': 180, + 'negative_utc': False + } + + These are denoted as `gitdate` below. + +- other dates (resulting of the |swh| processing stack) are encoded using + msgpack's Timestamp_ extended type. + + These are denoted as `timestamp` below. + + Note that these dates used to be encoded as a dictionary (beware: keys are bytes): + + .. code:: json + + { + b"swhtype": "datetime", + b"d": '2020-09-15T16:19:13.037809+00:00' + } + + +Person +++++++ + +:py:class:`swh.model.model.Person` objects represent a person in the |swh| +Merkle DAG, namely a :py:class:`swh.model.model.Revision` author or committer, +or a :py:class:`swh.model.model.Release` author. + +:py:class:`swh.model.model.Person` objects are serialized as a dictionary like: + +.. code:: json + + { + 'fullname': 'John Doe ', + 'name': 'John Doe', + 'email': 'john.doe@example.com' + } + +For anonymized topics, :py:class:`swh.model.model.Person` entities have seen +anonymized prior to being serialized. The anonymized +:py:class:`swh.model.model.Person` object is a dictionary like: + +.. code:: json + + { + 'fullname': , + 'name': null, + 'email': null + } + + +where the `` is computed from original values as a sha256 of the +orignal's `fullname`. + + + + +.. _kafka: https://kafka.apache.org +.. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms +.. _msgpack: https://msgpack.org/ +.. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types +.. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type