diff --git a/docs/archive-changelog.rst b/docs/archive-changelog.rst index 2ec5e44..a0e2bdb 100644 --- a/docs/archive-changelog.rst +++ b/docs/archive-changelog.rst @@ -1,161 +1,160 @@ .. _archive-changelog: - -Software Heritage --- Archive ChangeLog -======================================= +Archive ChangeLog +================= Below you can find a time-indexed list of notable events and changes to archival policies in the Software Heritage Archive. Each of them might have (had) an impact on how content is archived and explain apparent statistical anomalies or other changes in archival behavior over time. They are collected in this document for historical reasons. 2021 ---- * **2021-09-25** Completed first archival of the `Opam repository _`. Regular crawling for those repositories enabled (tracking: `T3424 `_) * **2021-09-23** Completed first archival of the `Heptapod instance _`. Regular crawling for those repositories enabled (tracking: `T3597 `_) * **2021-09-23** Completed first archival of the `Heptapod instance _`. Regular crawling for those repositories enabled (tracking: `T3600 `_) * **2021-09-22** Completed first archival of the `Heptapod instance _`. This is the first forge with mostly mercurial origins. Regular crawling for those repositories enabled (tracking: `T3581 `_) * **2021-08-03** Completed first archival of SourceForge Mercurial repositories; regular crawling for those repositories enabled (tracking: `T3374 `_) * **2021-07-22** Completed first archival of SourceForge Git and Subversion repositories; regular crawling for those repositories enabled (tracking: `T3374 `_) 2020 ---- * **2020-10-06 - 2020-11-23:** source code crawlers have been paused to avoid an out of disk condition, due to an unexpected delay in the arrival of new storage hardware. Push archival (both deposit_ and `save code now`_) remained in operation. (tracking: `T2656 `_) * **2020-09-15:** completed first archival of, and added to regular crawling `GNU Guix System`_ (tracking: `T2594 `_) * **2020-06-11:** completed integration with the IPOL_ journal, allowing paper authors to explicitly deposit_ source code to the archive (`announcement `__) * **2020-05-25:** completed first archival of, and added to regular crawling NixOS_ (tracking: `T2411 `_) 2019 ---- * **2019-09-10:** completed first archival of Bitbucket_ Git repositories and added Bitbucket as a regularly crawled forge (tracking: `T592 `_) * **2019-06-30:** completed first archival of, and added to regular crawling, several GitLab_ instances: `0xacab.org `_, `framagit.org `_, `gite.lirmm.fr `_, `gitlab.common-lisp.net `_, `gitlab.freedesktop.org `_, `gitlab.gnome.org `_, `gitlab.inria.fr `_, `salsa.debian.org `_ * **2019-06-12:** completed first archival of CRAN_ packages and added CRAN as a regularly crawled package repository (tracking: `T1709 `_) * **2019-06-11:** completed a full archival of GNU_ source code releases from `ftp.gnu.org`_, and added it to regular crawling (tracking: `T1722 `_) * **2019-05-27:** completed a full archival of NPM_ packages and added it as a regularly crawled package repository (tracking: `T1378 `_) * **2019-01-10:** enabled the `save code now`_ service, allowing users to explicitly request archival of a specific source code repository (`announcement `__) 2018 ---- * **2018-10-10:** completed first archival of PyPI_ packages and added PyPI as a regularly crawled package repository (`announcement `__) * **2018-09-25:** completed integration with HAL_, allowing paper authors to explicitly deposit_ source code to the archive (`announcement `__) * **2018-08-31:** completed first archival of public GitLab_ repositories from `gitlab.com `_ and added it as a regularly crawled forge (tracking: `T1111 `_) * **2018-03-21:** completed archival of `Google Code`_ Mercurial repositories. (tracking: `T682 `_) * **2018-02-20:** completed archival of Debian_ packages and added Debian as a regularly crawled distribution (`announcement `__) 2017 ---- * **2017-10-02:** completed archival of `Google Code`_ Subversion repositories (tracking: `T617 `_) * **2017-06-06:** completed archival of `Google Code`_ Git repositories (tracking: `T673 `_) 2016 ---- * **2016-04-04:** completed archival of the Gitorious_ (tracking: `T312 `_) 2015 ---- * **2015-11-06:** archived all GNU_ source code releases from `ftp.gnu.org`_ (tracking: `T90 `_) * **2015-07-28:** started archiving public GitHub_ repositories .. _Bitbucket: https://bitbucket.org .. _CRAN: https://cran.r-project.org .. _Debian: https://www.debian.org .. _GNU Guix System: https://guix.gnu.org/ .. _GNU: https://en.wikipedia.org/wiki/Google_Code .. _GitHub: https://github.com .. _GitLab: https://gitlab.com .. _Gitorious: https://en.wikipedia.org/wiki/Gitorious .. _Google Code: https://en.wikipedia.org/wiki/Google_Code .. _HAL: https://hal.archives-ouvertes.fr .. _IPOL: http://www.ipol.im .. _NPM: https://www.npmjs.com .. _NixOS: https://nixos.org/ .. _PyPI: https://pypi.org .. _deposit: https://deposit.softwareheritage.org .. _ftp.gnu.org: http://ftp.gnu.org .. _save code now: https://save.softwareheritage.org diff --git a/docs/journal.rst b/docs/journal.rst index f8db95a..c5b02a1 100644 --- a/docs/journal.rst +++ b/docs/journal.rst @@ -1,673 +1,673 @@ .. _journal-specs: -Software Heritage Journal --- Specifications -============================================ +Journal Specification +===================== The |swh| journal is a Kafka_-based stream of events for every added object in the |swh| Archive and some of its related services, especially indexers. Each topic_ will stream added elements for a given object type according to the topic name. Objects streamed in a topic are serialized versions of objects stored in the |swh| Archive specified by the main |swh| :py:mod:`data model ` or the :py:mod:`indexer object model `. In this document we will describe expected messages in each topic, so a potential consumer can easily cope with the |swh| journal without having to read the source code or the |swh| :ref:`data model ` in details (it is however recommended to familiarize yourself with this later). Kafka message values are dictionary structures serialized as msgpack_, with a few custom encodings. See the section `Kafka message format`_ below for a complete description of the serialization format. Note that each example given below show the dictionary before being serialized as a msgpack_ chunk. Topics ------ There are several groups of topics: - main storage Merkle-DAG related topics, - other storage objects (not part of the Merkle DAG), - indexer related objects (not yet documented below). Topics prefix can be either `swh.journal.objects` or `swh.journal.objects_privileged` (see below). Anonymized topics +++++++++++++++++ For topics that transport messages with user information (name and email address), namely `swh.journal.objects.release`_ and `swh.journal.objects.revision`_, there are 2 versions of those: one is an anonymized topic, in which user information are obfuscated, and a pristine version with clear data. Access to pristine topics depends on ACLs linked to credentials used to connect to the Kafka cluster. List of topics ++++++++++++++ - `swh.journal.objects.origin`_ - `swh.journal.objects.origin_visit`_ - `swh.journal.objects.origin_visit_status`_ - `swh.journal.objects.snapshot`_ - `swh.journal.objects.release`_ - `swh.journal.objects.privileged_release `_ - `swh.journal.objects.revision`_ - `swh.journal.objects.privileged_revision `_ - `swh.journal.objects.directory`_ - `swh.journal.objects.content`_ - `swh.journal.objects.skipped_content`_ - `swh.journal.objects.metadata_authority`_ - `swh.journal.objects.metadata_fetcher`_ - `swh.journal.objects.raw_extrinsic_metadata`_ Topics for Merkle-DAG objects ----------------------------- These topics are for the various objects stored in the |swh| Merkle DAG, see the :ref:`data model ` for more details. `swh.journal.objects.snapshot` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Snapshot` objects. Message format: - `branches` [dict] branches present in this snapshot, - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Snapshot` object with `branches` being a dictionary which keys are branch names [bytes], and values a dictionary of: - `target` [bytes] intrinsic identifier of the targeted object - `target_type` [string] the type of the targeted object (can be "content", "directory", "revision", "release", "snapshot" or "alias"). Example: .. code:: python { 'branches': { b'refs/pull/1/head': { 'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c', 'target_type': 'revision' }, b'refs/pull/2/head': { 'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ', 'target_type': 'revision' }, b'refs/heads/master': { 'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT', 'target_type': 'revision' }, b'HEAD': { 'target': b'refs/heads/master', 'target_type': 'alias' } }, 'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^' } `swh.journal.objects.release` +++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Release` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.release`. Message format: - `name` [bytes] name (typically the version) of the release - `message` [bytes] message of the release - `target` [bytes] identifier of the target object - `target_type` [string] type of the target, can be "content", "directory", "revision", "release" or "snapshot" - `synthetic` [bool] True if the :py:class:`swh.model.model.Release` object has been forged by the loading process; this flag is not used for the id computation, - `author` [dict] the author of the release - `date` [gitdate] the date of the release - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Release` object Example: .. code:: python { 'name': b'0.3', 'message': b'', 'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d', 'target_type': 'revision', 'synthetic': False, 'author': { 'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9', 'name': None, 'email': None }, 'date': { 'timestamp': { 'seconds': 1480432642, 'microseconds': 0 }, 'offset': 180, 'negative_utc': False }, 'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86' } `swh.journal.objects.revision` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Revision` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.revision`. Message format: - `message` [bytes] the commit message for the revision - `author` [dict] the author of the revision - `committer` [dict] the committer of the revision - `date` [gitdate] the revision date - `committer_date` [gitdate] the revision commit date - `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg") - `directory` [bytes] the intrinsic identifier of the directory this revision links to - `synthetic` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not, - `metadata` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the intrinsic identifier computation), - `parents` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers - `id` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision` - `extra_headers` [list[(bytes, bytes)]] TODO Example: .. code:: python { 'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n', 'author': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'committer': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'committer_date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'type': 'svn', 'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe', 'synthetic': True, 'metadata': None, 'parents': [ b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c', ], 'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8', 'perms': 33188}, {'name': b'lib', 'type': 'dir', 'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U', 'perms': 16384}, {'name': b'package.json', 'type': 'file', 'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x', 'perms': 33188} ], 'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P' } Other Objects Topics -------------------- These topics are for objects of the |swh| archive that are not part of the Merkle DAG but are essential parts of the archive; see the :ref:`data model ` for more details. `swh.journal.objects.origin` ++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Origin` objects. Message format: - `url` [string] URL of the :py:class:`swh.model.model.Origin` Example: .. code:: python { "url": "https://github.com/vujkovicm/pml" } `swh.journal.objects.origin_visit` ++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisit` objects. Message format: - `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` - `date` [timestamp] date of the visit - `type` [string] type of the loader used to perform the visit - `visit` [int] number of the visit for this `origin` Example: .. code:: python { 'origin': 'https://pypi.org/project/wasp-eureka/', 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'type': 'pypi', 'visit': 505} } `swh.journal.objects.origin_visit_status` +++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisitStatus` objects. Message format: - `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` - `visit` [int] number of the visit for this `origin` this status concerns - `date` [timestamp] date of the visit status update - `status` [string] status (can be "created", "ongoing", "full" or "partial"), - `snapshot` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this visit resulted in (if `status` is "full" or "partial") - `metadata`: deprecated Example: .. code:: python { 'origin': 'https://pypi.org/project/stricttype/', 'visit': 524, 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'status': 'full', 'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7", 'metadata': None } Extrinsic Metadata related Topics --------------------------------- Extrinsic metadata is information about software that is not part of the source code itself but still closely related to the software. See :ref:`extrinsic-metadata-specification` for more details on the Extrinsic Metadata model. `swh.journal.objects.metadata_authority` ++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataAuthority` objects. Message format: - `type` [string] - `url` [string] - `metadata` [dict] Examples: .. code:: python { 'type': 'forge', 'url': 'https://guix.gnu.org/sources.json', 'metadata': {} } { 'type': 'deposit_client', 'url': 'https://www.softwareheritage.org', 'metadata': {'name': 'swh'} } `swh.journal.objects.metadata_fetcher` ++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataFetcher` objects. Message format: - `type` [string] - `version` [string] - `metadata` [dict] Example: .. code:: python { 'name': 'swh.loader.package.cran.loader.CRANLoader', 'version': '0.15.0', 'metadata': {} } `swh.journal.objects.raw_extrinsic_metadata` ++++++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.RawExtrinsicMetadata` objects. Message format: - `type` [string] - `target` [string] - `discovery_date` [timestamp] - `authority` [dict] - `fetcher` [dict] - `format` [string] - `metadata` [bytes] - `origin` [string] - `visit` [int] - `snapshot` [SWHID] - `release` [SWHID] - `revision` [SWHID] - `path` [bytes] - `directory` [SWHID] Example: .. code:: python { 'type': 'snapshot', 'id': 'swh:1:snp:f3b180979283d4931d3199e6171840a3241829a3', 'discovery_date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'authority': { 'type': 'forge', 'url': 'https://pypi.org/', 'metadata': {} }, 'fetcher': { 'name': 'swh.loader.package.pypi.loader.PyPILoader', 'version': '0.10.0', 'metadata': {} }, 'format': 'pypi-project-json', 'metadata': b'{"info":{"author":"Signaltonsalat","author_email":"signaltonsalat@gmail.com"}]}', 'origin': 'https://pypi.org/project/schwurbler/' } Kafka message format -------------------- Each value of a Kafka message in a topic is a dictionary-like structure encoded as a msgpack_ byte string. Keys are ASCII strings. All values are encoded using default msgpack type system except for long integers for which we use a custom format using msgpack `extended type`_ to prevent overflow while packing some objects. Integer +++++++ For long integers (that do not fit in the `[-(2**63), 2 ** 64 - 1]` range), a custom `extended type`_ based encoding scheme is used. The `type` information can be: - `1` for positive (possibly long) integers, - `2` for negative (possibly long) integers. The payload is simply the bytes (big endian) representation of the absolute value (always positive). For example (adapted to standard integers for the sake of readability; these values are small so they will actually be encoded using the default msgpack format for integers): - `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`) - `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`) Datetime ++++++++ There are 2 type of date that can be encoded in a Kafka message: - dates for git-like objects (:py:class:`swh.model.model.Revision` and :py:class:`swh.model.model.Release`): these dates are part of the hash computation used as identifier in the Merkle DAG. In order to fully support git repositories, a custom encoding is required. These dates (coming from the git data model) are encoded as a dictionary with: - `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys (`seconds` and `microseconds`) - `offset` [int] offset of the date (in minutes) - `negative_utc` [bool] only True for the very edge case where the date has a zero but negative offset value (which does not makes much sense, but technically the git format permits) Example: .. code:: python { 'timestamp': {'seconds': 1480432642, 'microseconds': 0}, 'offset': 180, 'negative_utc': False } These are denoted as `gitdate` below. - other dates (resulting of the |swh| processing stack) are encoded using msgpack's Timestamp_ extended type. These are denoted as `timestamp` below. Note that these dates used to be encoded as a dictionary (beware: keys are bytes): .. code:: python { b"swhtype": "datetime", b"d": '2020-09-15T16:19:13.037809+00:00' } Person ++++++ :py:class:`swh.model.model.Person` objects represent a person in the |swh| Merkle DAG, namely a :py:class:`swh.model.model.Revision` author or committer, or a :py:class:`swh.model.model.Release` author. :py:class:`swh.model.model.Person` objects are serialized as a dictionary like: .. code:: python { 'fullname': 'John Doe ', 'name': 'John Doe', 'email': 'john.doe@example.com' } For anonymized topics, :py:class:`swh.model.model.Person` entities have seen anonymized prior to being serialized. The anonymized :py:class:`swh.model.model.Person` object is a dictionary like: .. code:: python { 'fullname': , 'name': null, 'email': null } where the `` is computed from original values as a sha256 of the original's `fullname`. .. _Kafka: https://kafka.apache.org .. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms .. _msgpack: https://msgpack.org/ .. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types .. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type