diff --git a/docs/faq/index.rst b/docs/faq/index.rst index 633aa87..3d66d1b 100644 --- a/docs/faq/index.rst +++ b/docs/faq/index.rst @@ -1,268 +1,268 @@ .. _faq: Frequently Asked Questions ************************** .. contents:: :depth: 3 :local: .. .. _faq_prerequisites: Prerequisites for code contributions ==================================== What are the Skills required to be a code contributor? ------------------------------------------------------ Generally, only Python and basic Git knowledge are required to contribute. Other than that, it really depends on what technical areas you want to work on. For student internships, the `internships`_ page details specific prerequisites needed to pick up a topic. Feel free to contact us via our `development channels `__ to inquiry about specific skills needed to work on any topic of your interest. What are the minimum system requirements (hardware/software) to run SWH locally? -------------------------------------------------------------------------------- Python 3.7 or newer is required. See the :ref:`developer setup documentation ` for more details. .. _faq_getting_started: Getting Started =============== What are the must read docs before I start contributing? -------------------------------------------------------- We recommend you read the top links listed at from the :ref:`documentation home page ` in order: getting started, contributing, and architecture overview, as well as the data model. Where can I see the getting started guide for developers? --------------------------------------------------------- For hacking on the Software Heritage code base you should start from the :ref:`developer-setup` tutorial. How do I find an easy task to get started? ------------------------------------------ We maintain a `list of easy tickets `__ to work on, see the `Easy hacks page `__ for more details. I am skilled in one specific technology, can I find tickets requiring that skill? --------------------------------------------------------------------------------- Unfortunately, not at the moment. But you can look at the `internships`_ list to look for something matching this skill, and this may allow you to find topics to search for in the `bug tracking system`_. Either way, feel free to contact our developers through any of the `development channels`_, we would love to work with you. Where should I ask for technical help? -------------------------------------- You can choose one of the following: * `development channels`_ * `contact form`_ for any enquiries .. _faq_run_swh: Running an SWH instance locally =============================== How do I run a local "toy version" of the archive? -------------------------------------------------- The :ref:`getting-started` tutorial shows how to run a local instance of the Software Heritage software infrastructure, using Docker. I have SWH stack running in my local. How do I get some initial data to play around? ------------------------------------------------------------------------------------ You can setup a job on your local machine, for this you can :ref:`schedule a listing task ` for example. Doing so on small forge, will allow you to load some repositories. Or you can also trigger directly :ref:`loading from the cli `. I have a SWH stack running in local, How do I setup a lister/loader job? ------------------------------------------------------------------------ See the :ref:`"Managing tasks" chapter ` in the Docker environment documentation. How can I create a user in my local instance? --------------------------------------------- We cannot right now. Stay either anonymous or use the user "test" (password "test") or the user ambassador (password "ambassador"). Should I run/test the web app in any particular browser? -------------------------------------------------------- We expect the web app to work on all major browsers. It uses mostly straightforward HTML/CSS and a little Javascript for search and source code highlighting, so testing in a single browser is usually enough. .. _faq_dataset: Getting sample datasets ======================= Is there a way to connect to SWH archived (production) database from my local machine? -------------------------------------------------------------------------------------- We provide the archive as a dataset on public clouds, see the :ref:`swh-dataset documentation `. We can also provide read access to one of the main databases on request, `contact us`_. .. _faq_error_bugs: Errors and bugs =============== I found a bug/improvement in the system, where should I report it? ------------------------------------------------------------------ Please report it on our `bug tracking system`_. First create an account, then create a bug report using the "Create task" button. You should get some feedback within a week (at least someone triaging your issue). If not, `get in touch with us `_ to make sure we did not miss it. .. _faq_legal: Legal matters ============= Do I need to sign a form to contribute code? -------------------------------------------- Yes, on your first diff, you will have to sign such document. As long as it's not signed, your diff content won't be visible. Will my name be added to a CONTRIBUTORS file? --------------------------------------------- You will be asked during review to add yourself. .. _faq_code_review: Code Review =========== I found a straightforward typo fix, should my fix go through the entire code review process? -------------------------------------------------------------------------------------------- You are welcome to drop us a message at one of the `development channels`_, we will pick it up and fix it so you don't have to follow the whole :ref:`code review process `. What tests I should run before committing the code? --------------------------------------------------- -Mostly run `tox` (or `pytest`) to run the unit tests suite. When you will propose a -patch in our forge, the continuous integration factory will trigger a build (using `tox` +Mostly run ``tox`` (or ``pytest``) to run the unit tests suite. When you will propose a +patch in our forge, the continuous integration factory will trigger a build (using ``tox`` as well). I am getting errors while trying to commit. What is going wrong? ---------------------------------------------------------------- Ensure you followed the proper guide to :ref:`setup your environment ` and try again. If the error persists, you are welcome to drop us a message at one of the `development channels`_ Is there a format/guideline for writing commit messages? -------------------------------------------------------- See the :ref:`git-style-guide` Is there some recommended git branching strategy? ------------------------------------------------- It's left at the developer's discretion. Mostly people hack on their feature, then propose a diff from a git branch or directly from the master branch. There is no imperative. The only imperative is that for a feature to be packaged and deployed, it needs to land first in the master branch. how should I document the code I contribute to SWH? --------------------------------------------------- Any new feature should include documentation in the form of comments and/or docstrings. -Ideally, they should also be documented in plain English in the repository's `docs/` -folder if relevant to a single package, or in the main `swh-docs` repository if it is a +Ideally, they should also be documented in plain English in the repository's :file:`docs/` +folder if relevant to a single package, or in the main ``swh-docs`` repository if it is a transversal feature. .. _faq_api: Software Heritage API ===================== How do I generate API usage credentials? ---------------------------------------- See the :ref:`Authentication guide `. Is there a page where I can see all the API endpoints? ------------------------------------------------------ See the :swh_web:`API endpoint listing page `. What are the usage limits for SWH APIs? --------------------------------------- Maximum number of permitted requests per hour: * 120 for anonymous users * 1200 for authenticated users It's described in the :swh_web:`rate limit documentation page `. .. It's temporarily here but it should be moved into its own sphinx instance at some point in the future. .. _faq_sysadm: System Administration ===================== How does SWH release? --------------------- Release is mostly done: - first in docker (somewhat as part of the development process) - secondly packaged and deployed on staging (mostly) - thirdly the same package is deployed on production Is there a release cycle? ------------------------- When a functionality is ready (tests ok, landed in master, docker run ok), the module is tagged. The tag is pushed. This triggers a packaging build process. When the package is ready, depending on the module [1], sysadms deploy the package with the help of puppet. [1] swh-web module is mostly automatic. Other modules are not yet automatic as some internal state migration (dbs) often enters the release cycle and due to the data volume, that may need human intervention. .. _bug tracking system: https://forge.softwareheritage.org/ .. _contact form: https://www.softwareheritage.org/contact/ .. _contact us: https://www.softwareheritage.org/contact/ .. _development channels: https://www.softwareheritage.org/community/developers/ .. _internships: https://wiki.softwareheritage.org/wiki/Internships diff --git a/docs/glossary.rst b/docs/glossary.rst index 97a2bad..3a40473 100644 --- a/docs/glossary.rst +++ b/docs/glossary.rst @@ -1,213 +1,213 @@ :orphan: .. _glossary: Glossary ======== .. glossary:: archive An instance of the |swh| data store. ark `Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is a multi-purpose persistent identifier for information objects of any type. artifact software artifact An artifact is one of many kinds of tangible by-products produced during the development of software. content blob A (specific version of a) file stored in the archive, identified by its cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also known as: :term:`blob`. Note: it is incorrect to refer to Contents as "files", because files are usually considered to be named, whereas Contents are nameless. It is only in the context of specific :term:`directories ` that :term:`contents ` acquire (local) names. deposit A :term:`software artifact` that was pushed to the Software Heritage archive (unlike :term:`loaders `, which pull artifacts). A deposit is useful when you want to ensure a software release's source code is archived in SWH even if it is not published anywhere else. See also: the :ref:`swh-deposit` component, which implements a deposit client and server. directory A set of named pointers to contents (file entries), directories (directory entries) and revisions (revision entries). All entries are associated to the local name of the entry (i.e., a relative path without any path separator) and permission metadata (e.g., ``chmod`` value or equivalent). doi A Digital Object Identifier or DOI_ is a persistent identifier or handle used to uniquely identify objects, standardized by the International Organization for Standardization (ISO). extid external identifier An identifier used by a system that does not fit the |swh| :ref:`data model `, such as Mercurial's ``nodeid``, or the hash of a tarball from a package manager. They may be stored in the |swh| archive independently of the identified object, to quickly match an external object (a changeset or tarball) to an object in the archive without downloading it. extrinsic metadata Metadata about software that is not shipped as part of the software source code, but is available instead via out-of-band means. For example, homepage, maintainer contact information, and popularity information ("stars") as listed on GitHub/GitLab repository pages. See also: :term:`intrinsic metadata` :ref:`architecture-metadata`. journal The :ref:`journal ` is the persistent logger of the |swh| architecture in charge of logging changes of the archive, with publish-subscribe_ support. lister A :ref:`lister ` is a component of the |swh| architecture that is in charge of enumerating the :term:`software origin` (e.g., VCS, packages, etc.) available at a source code distribution place. loader A :ref:`loader ` is a component of the |swh| architecture responsible for reading a source code :term:`origin` (typically a git repository) and import or update its content in the :term:`archive` (ie. add new file contents int :term:`object storage` and repository structure in the :term:`storage database`). hash cryptographic hash checksum digest A fixed-size "summary" of a stream of bytes that is easy to compute, and hard to reverse. (Cryptographic hash function Wikipedia article) also known as: :term:`checksum`, :term:`digest`. indexer A component of the |swh| architecture dedicated to producing metadata linked to the known :term:`blobs ` in the :term:`archive`. intrinsic identifier A short character string that uniquely identifies an object, that can be generated deterministically, using only the content of the object, usually a :term:`cryptographic hash`. This excludes network interaction and central authority. Examples of intrinsic identifiers are: checksums (for files/strings only), git hashes, and :ref:`SWHIDs ` intrinsic metadata Metadata about software that is shipped as part of the source code of the software itself or as part of related artifacts (e.g., revisions, releases, etc). For example, metadata that is shipped in `PKG-INFO` files - for Python packages, `pom.xml` for Maven-based Java projects, - `debian/control` for Debian packages, `metadata.json` for NPM, etc. + for Python packages, :file:`pom.xml` for Maven-based Java projects, + :file:`debian/control` for Debian packages, :file:`metadata.json` for NPM, etc. See also: :term:`extrinsic metadata`, :ref:`architecture-metadata`. objstore objstorage object store object storage Content-addressable object storage. It is the place where actual object :term:`blobs ` objects are stored. origin software origin data source A location from which a coherent set of sources has been obtained, like a git repository, a directory containing tarballs, etc. person An entity referenced by a revision as either the author or the committer of the corresponding change. A person is associated to a full name and/or an email address. release tag milestone a revision that has been marked as noteworthy with a specific name (e.g., a version number), together with associated development metadata (e.g., author, timestamp, etc). revision commit changeset A point in time snapshot of the content of a directory, together with associated development metadata (e.g., author, timestamp, log message, etc). scheduler The component of the |swh| architecture dedicated to the management and the prioritization of the many tasks. snapshot the state of all visible branches during a specific visit of an origin storage storage database The main database of the |swh| platform in which the all the elements of the :ref:`data-model` but the :term:`content` are stored as a :ref:`Merkle DAG `. type of origin Information about the kind of hosting, e.g., whether it is a forge, a collection of repositories, an homepage publishing tarball, or a one shot source code repository. For all kind of repositories please specify which VCS system is in use (Git, SVN, CVS, etc.) object. vault vault service User-facing service that allows to retrieve parts of the :term:`archive` as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) visit The passage of |swh| on a given :term:`origin`, to retrieve all source code and metadata available there at the time. A visit object stores the state of all visible branches (if any) available at the origin at visit time; each of them points to a revision object in the archive. Future visits of the same origin will create new visit objects, without removing previous ones. .. _blob: https://en.wikipedia.org/wiki/Binary_large_object .. _DOI: https://www.doi.org .. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers .. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html .. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern diff --git a/docs/journal.rst b/docs/journal.rst index 83e554a..2c1eb1a 100644 --- a/docs/journal.rst +++ b/docs/journal.rst @@ -1,673 +1,673 @@ .. _journal-specs: Journal Specification ===================== The |swh| journal is a Kafka_-based stream of events for every added object in the |swh| Archive and some of its related services, especially indexers. Each topic_ will stream added elements for a given object type according to the topic name. Objects streamed in a topic are serialized versions of objects stored in the |swh| Archive specified by the main |swh| :py:mod:`data model ` or the :py:mod:`indexer object model `. In this document we will describe expected messages in each topic, so a potential consumer can easily cope with the |swh| journal without having to read the source code or the |swh| :ref:`data model ` in details (it is however recommended to familiarize yourself with this later). Kafka message values are dictionary structures serialized as msgpack_, with a few custom encodings. See the section `Kafka message format`_ below for a complete description of the serialization format. Note that each example given below show the dictionary before being serialized as a msgpack_ chunk. Topics ------ There are several groups of topics: - main storage Merkle-DAG related topics, - other storage objects (not part of the Merkle DAG), - indexer related objects (not yet documented below). Topics prefix can be either `swh.journal.objects` or `swh.journal.objects_privileged` (see below). Anonymized topics +++++++++++++++++ For topics that transport messages with user information (name and email address), namely `swh.journal.objects.release`_ and `swh.journal.objects.revision`_, there are 2 versions of those: one is an anonymized topic, in which user information are obfuscated, and a pristine version with clear data. Access to pristine topics depends on ACLs linked to credentials used to connect to the Kafka cluster. List of topics ++++++++++++++ - `swh.journal.objects.origin`_ - `swh.journal.objects.origin_visit`_ - `swh.journal.objects.origin_visit_status`_ - `swh.journal.objects.snapshot`_ - `swh.journal.objects.release`_ - `swh.journal.objects.privileged_release `_ - `swh.journal.objects.revision`_ - `swh.journal.objects.privileged_revision `_ - `swh.journal.objects.directory`_ - `swh.journal.objects.content`_ - `swh.journal.objects.skipped_content`_ - `swh.journal.objects.metadata_authority`_ - `swh.journal.objects.metadata_fetcher`_ - `swh.journal.objects.raw_extrinsic_metadata`_ Topics for Merkle-DAG objects ----------------------------- These topics are for the various objects stored in the |swh| Merkle DAG, see the :ref:`data model ` for more details. `swh.journal.objects.snapshot` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Snapshot` objects. Message format: - `branches` [dict] branches present in this snapshot, - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Snapshot` object with `branches` being a dictionary which keys are branch names [bytes], and values a dictionary of: - `target` [bytes] intrinsic identifier of the targeted object - `target_type` [string] the type of the targeted object (can be "content", "directory", "revision", "release", "snapshot" or "alias"). Example: .. code:: python { 'branches': { b'refs/pull/1/head': { 'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c', 'target_type': 'revision' }, b'refs/pull/2/head': { 'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ', 'target_type': 'revision' }, b'refs/heads/master': { 'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT', 'target_type': 'revision' }, b'HEAD': { 'target': b'refs/heads/master', 'target_type': 'alias' } }, 'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^' } `swh.journal.objects.release` +++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Release` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.release`. Message format: - `name` [bytes] name (typically the version) of the release - `message` [bytes] message of the release - `target` [bytes] identifier of the target object - `target_type` [string] type of the target, can be "content", "directory", "revision", "release" or "snapshot" - `synthetic` [bool] True if the :py:class:`swh.model.model.Release` object has been forged by the loading process; this flag is not used for the id computation, - `author` [dict] the author of the release - `date` [gitdate] the date of the release - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Release` object Example: .. code:: python { 'name': b'0.3', 'message': b'', 'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d', 'target_type': 'revision', 'synthetic': False, 'author': { 'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9', 'name': None, 'email': None }, 'date': { 'timestamp': { 'seconds': 1480432642, 'microseconds': 0 }, 'offset': 180, 'negative_utc': False }, 'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86' } `swh.journal.objects.revision` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Revision` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.revision`. Message format: -- `message` [bytes] the commit message for the revision -- `author` [dict] the author of the revision -- `committer` [dict] the committer of the revision -- `date` [gitdate] the revision date -- `committer_date` [gitdate] the revision commit date -- `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg") -- `directory` [bytes] the intrinsic identifier of the directory this revision links to -- `synthetic` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not, -- `metadata` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the +- ``message`` [bytes] the commit message for the revision +- ``author`` [dict] the author of the revision +- ``committer`` [dict] the committer of the revision +- ``date`` [gitdate] the revision date +- ``committer_date`` [gitdate] the revision commit date +- ``type`` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg") +- ``directory`` [bytes] the intrinsic identifier of the directory this revision links to +- ``synthetic`` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not, +- ``metadata`` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the intrinsic identifier computation), -- `parents` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers -- `id` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision` -- `extra_headers` [list[(bytes, bytes)]] TODO +- ``parents`` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers +- ``id`` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision` +- ``extra_headers`` [list[(bytes, bytes)]] TODO Example: .. code:: python { 'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n', 'author': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'committer': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'committer_date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'type': 'svn', 'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe', 'synthetic': True, 'metadata': None, 'parents': [ b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c', ], 'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8', 'perms': 33188}, {'name': b'lib', 'type': 'dir', 'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U', 'perms': 16384}, {'name': b'package.json', 'type': 'file', 'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x', 'perms': 33188} ], 'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P' } Other Objects Topics -------------------- These topics are for objects of the |swh| archive that are not part of the Merkle DAG but are essential parts of the archive; see the :ref:`data model ` for more details. -`swh.journal.objects.origin` -++++++++++++++++++++++++++++ +``swh.journal.objects.origin`` +++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Origin` objects. Message format: -- `url` [string] URL of the :py:class:`swh.model.model.Origin` +- ``url`` [string] URL of the :py:class:`swh.model.model.Origin` Example: .. code:: python { "url": "https://github.com/vujkovicm/pml" } -`swh.journal.objects.origin_visit` -++++++++++++++++++++++++++++++++++ +``swh.journal.objects.origin_visit`` +++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisit` objects. Message format: -- `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` -- `date` [timestamp] date of the visit -- `type` [string] type of the loader used to perform the visit -- `visit` [int] number of the visit for this `origin` +- ``origin`` [string] URL of the visited :py:class:`swh.model.model.Origin` +- ``date`` [timestamp] date of the visit +- ``type`` [string] type of the loader used to perform the visit +- ``visit`` [int] number of the visit for this ``origin`` Example: .. code:: python { 'origin': 'https://pypi.org/project/wasp-eureka/', 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'type': 'pypi', 'visit': 505} } -`swh.journal.objects.origin_visit_status` -+++++++++++++++++++++++++++++++++++++++++ +``swh.journal.objects.origin_visit_status`` ++++++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisitStatus` objects. Message format: -- `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` -- `visit` [int] number of the visit for this `origin` this status concerns -- `date` [timestamp] date of the visit status update -- `status` [string] status (can be "created", "ongoing", "full" or "partial"), -- `snapshot` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this - visit resulted in (if `status` is "full" or "partial") -- `metadata`: deprecated +- ``origin`` [string] URL of the visited :py:class:`swh.model.model.Origin` +- ``visit`` [int] number of the visit for this ``origin`` this status concerns +- ``date`` [timestamp] date of the visit status update +- ``status`` [string] status (can be "created", "ongoing", "full" or "partial"), +- ``snapshot`` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this + visit resulted in (if ``status`` is "full" or "partial") +- ``metadata``: deprecated Example: .. code:: python { 'origin': 'https://pypi.org/project/stricttype/', 'visit': 524, 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'status': 'full', 'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7", 'metadata': None } Extrinsic Metadata related Topics --------------------------------- Extrinsic metadata is information about software that is not part of the source code itself but still closely related to the software. See :ref:`extrinsic-metadata-specification` for more details on the Extrinsic Metadata model. -`swh.journal.objects.metadata_authority` -++++++++++++++++++++++++++++++++++++++++ +``swh.journal.objects.metadata_authority`` +++++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataAuthority` objects. Message format: -- `type` [string] -- `url` [string] -- `metadata` [dict] +- ``type`` [string] +- ``url`` [string] +- ``metadata`` [dict] Examples: .. code:: python { 'type': 'forge', 'url': 'https://guix.gnu.org/sources.json', 'metadata': {} } { 'type': 'deposit_client', 'url': 'https://www.softwareheritage.org', 'metadata': {'name': 'swh'} } -`swh.journal.objects.metadata_fetcher` -++++++++++++++++++++++++++++++++++++++ +``swh.journal.objects.metadata_fetcher`` +++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataFetcher` objects. Message format: -- `type` [string] -- `version` [string] -- `metadata` [dict] +- ``type`` [string] +- ``version`` [string] +- ``metadata`` [dict] Example: .. code:: python { 'name': 'swh.loader.package.cran.loader.CRANLoader', 'version': '0.15.0', 'metadata': {} } -`swh.journal.objects.raw_extrinsic_metadata` -++++++++++++++++++++++++++++++++++++++++++++ +``swh.journal.objects.raw_extrinsic_metadata`` +++++++++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.RawExtrinsicMetadata` objects. Message format: -- `type` [string] -- `target` [string] -- `discovery_date` [timestamp] -- `authority` [dict] -- `fetcher` [dict] -- `format` [string] -- `metadata` [bytes] -- `origin` [string] -- `visit` [int] -- `snapshot` [SWHID] -- `release` [SWHID] -- `revision` [SWHID] -- `path` [bytes] -- `directory` [SWHID] +- ``type`` [string] +- ``target`` [string] +- ``discovery_date`` [timestamp] +- ``authority`` [dict] +- ``fetcher`` [dict] +- ``format`` [string] +- ``metadata`` [bytes] +- ``origin`` [string] +- ``visit`` [int] +- ``snapshot`` [SWHID] +- ``release`` [SWHID] +- ``revision`` [SWHID] +- ``path`` [bytes] +- ``directory`` [SWHID] Example: .. code:: python { 'type': 'snapshot', 'id': 'swh:1:snp:f3b180979283d4931d3199e6171840a3241829a3', 'discovery_date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'authority': { 'type': 'forge', 'url': 'https://pypi.org/', 'metadata': {} }, 'fetcher': { 'name': 'swh.loader.package.pypi.loader.PyPILoader', 'version': '0.10.0', 'metadata': {} }, 'format': 'pypi-project-json', 'metadata': b'{"info":{"author":"Signaltonsalat","author_email":"signaltonsalat@gmail.com"}]}', 'origin': 'https://pypi.org/project/schwurbler/' } Kafka message format -------------------- Each value of a Kafka message in a topic is a dictionary-like structure encoded as a msgpack_ byte string. Keys are ASCII strings. All values are encoded using default msgpack type system except for long integers for which we use a custom format using msgpack `extended type`_ to prevent overflow while packing some objects. Integer +++++++ -For long integers (that do not fit in the `[-(2**63), 2 ** 64 - 1]` range), a +For long integers (that do not fit in the ``[-(2**63), 2 ** 64 - 1]`` range), a custom `extended type`_ based encoding scheme is used. -The `type` information can be: +The ``type`` information can be: -- `1` for positive (possibly long) integers, -- `2` for negative (possibly long) integers. +- ``1`` for positive (possibly long) integers, +- ``2`` for negative (possibly long) integers. The payload is simply the bytes (big endian) representation of the absolute value (always positive). For example (adapted to standard integers for the sake of readability; these values are small so they will actually be encoded using the default msgpack format for integers): -- `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`) -- `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`) +- ``12345`` would be encoded as the extension value ``[1, [0x30, 0x39]]`` (aka ``0xd5013039``) +- ``-42`` would be encoded as the extension value ``[2, [0x2A]]`` (aka ``0xd4022a``) Datetime ++++++++ There are 2 type of date that can be encoded in a Kafka message: - dates for git-like objects (:py:class:`swh.model.model.Revision` and :py:class:`swh.model.model.Release`): these dates are part of the hash computation used as identifier in the Merkle DAG. In order to fully support git repositories, a custom encoding is required. These dates (coming from the git data model) are encoded as a dictionary with: - - `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys - (`seconds` and `microseconds`) + - ``timestamp`` [dict] POSIX timestamp of the date, as a dictionary with 2 keys + (``seconds`` and ``microseconds``) - - `offset` [int] offset of the date (in minutes) + - ``offset`` [int] offset of the date (in minutes) - - `negative_utc` [bool] only True for the very edge case where the date has a + - ``negative_utc`` [bool] only True for the very edge case where the date has a zero but negative offset value (which does not makes much sense, but technically the git format permits) Example: .. code:: python { 'timestamp': {'seconds': 1480432642, 'microseconds': 0}, 'offset': 180, 'negative_utc': False } - These are denoted as `gitdate` below. + These are denoted as ``gitdate`` below. - other dates (resulting of the |swh| processing stack) are encoded using msgpack's Timestamp_ extended type. - These are denoted as `timestamp` below. + These are denoted as ``timestamp`` below. Note that these dates used to be encoded as a dictionary (beware: keys are bytes): .. code:: python { b"swhtype": "datetime", b"d": '2020-09-15T16:19:13.037809+00:00' } Person ++++++ :py:class:`swh.model.model.Person` objects represent a person in the |swh| Merkle DAG, namely a :py:class:`swh.model.model.Revision` author or committer, or a :py:class:`swh.model.model.Release` author. :py:class:`swh.model.model.Person` objects are serialized as a dictionary like: .. code:: python { 'fullname': 'John Doe ', 'name': 'John Doe', 'email': 'john.doe@example.com' } For anonymized topics, :py:class:`swh.model.model.Person` entities have seen anonymized prior to being serialized. The anonymized :py:class:`swh.model.model.Person` object is a dictionary like: .. code:: python { 'fullname': , 'name': null, 'email': null } -where the `` is computed from original values as a sha256 of the -original's `fullname`. +where the ```` is computed from original values as a sha256 of the +original's ``fullname``. .. _Kafka: https://kafka.apache.org .. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms .. _msgpack: https://msgpack.org/ .. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types .. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type