diff --git a/docs/faq/index.rst b/docs/faq/index.rst
index 633aa87..3d66d1b 100644
--- a/docs/faq/index.rst
+++ b/docs/faq/index.rst
@@ -1,268 +1,268 @@
.. _faq:
Frequently Asked Questions
**************************
.. contents::
:depth: 3
:local:
..
.. _faq_prerequisites:
Prerequisites for code contributions
====================================
What are the Skills required to be a code contributor?
------------------------------------------------------
Generally, only Python and basic Git knowledge are required to contribute.
Other than that, it really depends on what technical areas you want to work on.
For student internships, the `internships`_ page details specific prerequisites
needed to pick up a topic.
Feel free to contact us via our `development channels
`__ to inquiry about
specific skills needed to work on any topic of your interest.
What are the minimum system requirements (hardware/software) to run SWH locally?
--------------------------------------------------------------------------------
Python 3.7 or newer is required. See the :ref:`developer setup documentation
` for more details.
.. _faq_getting_started:
Getting Started
===============
What are the must read docs before I start contributing?
--------------------------------------------------------
We recommend you read the top links listed at from the :ref:`documentation home page
` in order: getting started,
contributing, and architecture overview, as well as the data model.
Where can I see the getting started guide for developers?
---------------------------------------------------------
For hacking on the Software Heritage code base you should start from the
:ref:`developer-setup` tutorial.
How do I find an easy task to get started?
------------------------------------------
We maintain a `list of easy tickets
`__ to work on, see
the `Easy hacks page `__ for more
details.
I am skilled in one specific technology, can I find tickets requiring that skill?
---------------------------------------------------------------------------------
Unfortunately, not at the moment. But you can look at the `internships`_
list to look for something matching
this skill, and this may allow you to find topics to search for in the `bug tracking
system`_.
Either way, feel free to contact our developers through any of the
`development channels`_, we would love to work with
you.
Where should I ask for technical help?
--------------------------------------
You can choose one of the following:
* `development channels`_
* `contact form`_ for any enquiries
.. _faq_run_swh:
Running an SWH instance locally
===============================
How do I run a local "toy version" of the archive?
--------------------------------------------------
The :ref:`getting-started` tutorial shows how to run a local instance of the
Software Heritage software infrastructure, using Docker.
I have SWH stack running in my local. How do I get some initial data to play around?
------------------------------------------------------------------------------------
You can setup a job on your local machine, for this you can
:ref:`schedule a listing task `
for example. Doing so on small forge, will allow you to load some repositories.
Or you can also trigger directly :ref:`loading from the cli `.
I have a SWH stack running in local, How do I setup a lister/loader job?
------------------------------------------------------------------------
See the :ref:`"Managing tasks" chapter `
in the Docker environment documentation.
How can I create a user in my local instance?
---------------------------------------------
We cannot right now. Stay either anonymous or use the user "test" (password "test") or
the user ambassador (password "ambassador").
Should I run/test the web app in any particular browser?
--------------------------------------------------------
We expect the web app to work on all major browsers. It uses mostly straightforward
HTML/CSS and a little Javascript for search and source code highlighting, so testing in
a single browser is usually enough.
.. _faq_dataset:
Getting sample datasets
=======================
Is there a way to connect to SWH archived (production) database from my local machine?
--------------------------------------------------------------------------------------
We provide the archive as a dataset on public clouds, see the :ref:`swh-dataset
documentation `. We can
also provide read access to one of the main databases on request, `contact us`_.
.. _faq_error_bugs:
Errors and bugs
===============
I found a bug/improvement in the system, where should I report it?
------------------------------------------------------------------
Please report it on our `bug tracking system`_.
First create an account, then create a bug report using the "Create task" button. You
should get some feedback within a week (at least someone triaging your issue). If not,
`get in touch with us `_ to
make sure we did not miss it.
.. _faq_legal:
Legal matters
=============
Do I need to sign a form to contribute code?
--------------------------------------------
Yes, on your first diff, you will have to sign such document.
As long as it's not signed, your diff content won't be visible.
Will my name be added to a CONTRIBUTORS file?
---------------------------------------------
You will be asked during review to add yourself.
.. _faq_code_review:
Code Review
===========
I found a straightforward typo fix, should my fix go through the entire code review process?
--------------------------------------------------------------------------------------------
You are welcome to drop us a message at one of the `development
channels`_, we will pick it up
and fix it so you don't have to follow the whole :ref:`code review process `.
What tests I should run before committing the code?
---------------------------------------------------
-Mostly run `tox` (or `pytest`) to run the unit tests suite. When you will propose a
-patch in our forge, the continuous integration factory will trigger a build (using `tox`
+Mostly run ``tox`` (or ``pytest``) to run the unit tests suite. When you will propose a
+patch in our forge, the continuous integration factory will trigger a build (using ``tox``
as well).
I am getting errors while trying to commit. What is going wrong?
----------------------------------------------------------------
Ensure you followed the proper guide to :ref:`setup your
environment `
and try again. If the error persists, you are welcome to drop us a message at one of the
`development channels`_
Is there a format/guideline for writing commit messages?
--------------------------------------------------------
See the :ref:`git-style-guide`
Is there some recommended git branching strategy?
-------------------------------------------------
It's left at the developer's discretion. Mostly people hack on their feature, then
propose a diff from a git branch or directly from the master branch. There is no
imperative. The only imperative is that for a feature to be packaged and deployed, it
needs to land first in the master branch.
how should I document the code I contribute to SWH?
---------------------------------------------------
Any new feature should include documentation in the form of comments and/or docstrings.
-Ideally, they should also be documented in plain English in the repository's `docs/`
-folder if relevant to a single package, or in the main `swh-docs` repository if it is a
+Ideally, they should also be documented in plain English in the repository's :file:`docs/`
+folder if relevant to a single package, or in the main ``swh-docs`` repository if it is a
transversal feature.
.. _faq_api:
Software Heritage API
=====================
How do I generate API usage credentials?
----------------------------------------
See the :ref:`Authentication guide `.
Is there a page where I can see all the API endpoints?
------------------------------------------------------
See the :swh_web:`API endpoint listing page `.
What are the usage limits for SWH APIs?
---------------------------------------
Maximum number of permitted requests per hour:
* 120 for anonymous users
* 1200 for authenticated users
It's described in the :swh_web:`rate limit documentation page `.
.. It's temporarily here but it should be moved into its own sphinx instance at some
point in the future.
.. _faq_sysadm:
System Administration
=====================
How does SWH release?
---------------------
Release is mostly done:
- first in docker (somewhat as part of the development process)
- secondly packaged and deployed on staging (mostly)
- thirdly the same package is deployed on production
Is there a release cycle?
-------------------------
When a functionality is ready (tests ok, landed in master, docker run ok), the module is
tagged. The tag is pushed. This triggers a packaging build process. When the package is
ready, depending on the module [1], sysadms deploy the package with the help of puppet.
[1] swh-web module is mostly automatic. Other modules are not yet automatic as some
internal state migration (dbs) often enters the release cycle and due to the data
volume, that may need human intervention.
.. _bug tracking system: https://forge.softwareheritage.org/
.. _contact form: https://www.softwareheritage.org/contact/
.. _contact us: https://www.softwareheritage.org/contact/
.. _development channels: https://www.softwareheritage.org/community/developers/
.. _internships: https://wiki.softwareheritage.org/wiki/Internships
diff --git a/docs/glossary.rst b/docs/glossary.rst
index 97a2bad..3a40473 100644
--- a/docs/glossary.rst
+++ b/docs/glossary.rst
@@ -1,213 +1,213 @@
:orphan:
.. _glossary:
Glossary
========
.. glossary::
archive
An instance of the |swh| data store.
ark
`Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is
a multi-purpose persistent identifier for information objects of any type.
artifact
software artifact
An artifact is one of many kinds of tangible by-products produced during
the development of software.
content
blob
A (specific version of a) file stored in the archive, identified by its
cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also
known as: :term:`blob`. Note: it is incorrect to refer to Contents as
"files", because files are usually considered to be named, whereas
Contents are nameless. It is only in the context of specific
:term:`directories ` that :term:`contents ` acquire
(local) names.
deposit
A :term:`software artifact` that was pushed to the Software Heritage
archive (unlike :term:`loaders `, which pull artifacts).
A deposit is useful when you want to ensure a software release's source
code is archived in SWH even if it is not published anywhere else.
See also: the :ref:`swh-deposit` component, which implements a deposit
client and server.
directory
A set of named pointers to contents (file entries), directories (directory
entries) and revisions (revision entries). All entries are associated to
the local name of the entry (i.e., a relative path without any path
separator) and permission metadata (e.g., ``chmod`` value or equivalent).
doi
A Digital Object Identifier or DOI_ is a persistent identifier or handle
used to uniquely identify objects, standardized by the International
Organization for Standardization (ISO).
extid
external identifier
An identifier used by a system that does not fit the |swh|
:ref:`data model `, such as Mercurial's ``nodeid``,
or the hash of a tarball from a package manager.
They may be stored in the |swh| archive independently of the identified object,
to quickly match an external object (a changeset or tarball) to an object
in the archive without downloading it.
extrinsic metadata
Metadata about software that is not shipped as part of the software source
code, but is available instead via out-of-band means. For example,
homepage, maintainer contact information, and popularity information
("stars") as listed on GitHub/GitLab repository pages.
See also: :term:`intrinsic metadata` :ref:`architecture-metadata`.
journal
The :ref:`journal ` is the persistent logger of the |swh| architecture in charge
of logging changes of the archive, with publish-subscribe_ support.
lister
A :ref:`lister ` is a component of the |swh| architecture that is in charge of
enumerating the :term:`software origin` (e.g., VCS, packages, etc.)
available at a source code distribution place.
loader
A :ref:`loader ` is a component of the |swh| architecture
responsible for reading a source code :term:`origin` (typically a git
repository) and import or update its content in the :term:`archive` (ie.
add new file contents int :term:`object storage` and repository structure
in the :term:`storage database`).
hash
cryptographic hash
checksum
digest
A fixed-size "summary" of a stream of bytes that is easy to compute, and
hard to reverse. (Cryptographic hash function Wikipedia article) also
known as: :term:`checksum`, :term:`digest`.
indexer
A component of the |swh| architecture dedicated to producing metadata
linked to the known :term:`blobs ` in the :term:`archive`.
intrinsic identifier
A short character string that uniquely identifies an object,
that can be generated deterministically, using only the content of the object,
usually a :term:`cryptographic hash`.
This excludes network interaction and central authority.
Examples of intrinsic identifiers are: checksums (for files/strings only),
git hashes, and :ref:`SWHIDs `
intrinsic metadata
Metadata about software that is shipped as part of the source code of the
software itself or as part of related artifacts (e.g., revisions,
releases, etc). For example, metadata that is shipped in `PKG-INFO` files
- for Python packages, `pom.xml` for Maven-based Java projects,
- `debian/control` for Debian packages, `metadata.json` for NPM, etc.
+ for Python packages, :file:`pom.xml` for Maven-based Java projects,
+ :file:`debian/control` for Debian packages, :file:`metadata.json` for NPM, etc.
See also: :term:`extrinsic metadata`, :ref:`architecture-metadata`.
objstore
objstorage
object store
object storage
Content-addressable object storage. It is the place where actual object
:term:`blobs ` objects are stored.
origin
software origin
data source
A location from which a coherent set of sources has been obtained, like a
git repository, a directory containing tarballs, etc.
person
An entity referenced by a revision as either the author or the committer
of the corresponding change. A person is associated to a full name and/or
an email address.
release
tag
milestone
a revision that has been marked as noteworthy with a specific name (e.g.,
a version number), together with associated development metadata (e.g.,
author, timestamp, etc).
revision
commit
changeset
A point in time snapshot of the content of a directory, together with
associated development metadata (e.g., author, timestamp, log message,
etc).
scheduler
The component of the |swh| architecture dedicated to the management and
the prioritization of the many tasks.
snapshot
the state of all visible branches during a specific visit of an origin
storage
storage database
The main database of the |swh| platform in which the all the elements of
the :ref:`data-model` but the :term:`content` are stored as a :ref:`Merkle
DAG `.
type of origin
Information about the kind of hosting, e.g., whether it is a forge, a
collection of repositories, an homepage publishing tarball, or a one shot
source code repository. For all kind of repositories please specify which
VCS system is in use (Git, SVN, CVS, etc.) object.
vault
vault service
User-facing service that allows to retrieve parts of the :term:`archive`
as self-contained bundles (e.g., individual releases, entire repository
snapshots, etc.)
visit
The passage of |swh| on a given :term:`origin`, to retrieve all source
code and metadata available there at the time. A visit object stores the
state of all visible branches (if any) available at the origin at visit
time; each of them points to a revision object in the archive. Future
visits of the same origin will create new visit objects, without removing
previous ones.
.. _blob: https://en.wikipedia.org/wiki/Binary_large_object
.. _DOI: https://www.doi.org
.. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers
.. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html
.. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern
diff --git a/docs/journal.rst b/docs/journal.rst
index 83e554a..2c1eb1a 100644
--- a/docs/journal.rst
+++ b/docs/journal.rst
@@ -1,673 +1,673 @@
.. _journal-specs:
Journal Specification
=====================
The |swh| journal is a Kafka_-based stream of events for every added object in
the |swh| Archive and some of its related services, especially indexers.
Each topic_ will stream added elements for a given object type according to the
topic name.
Objects streamed in a topic are serialized versions of objects stored in the
|swh| Archive specified by the main |swh| :py:mod:`data model ` or
the :py:mod:`indexer object model `.
In this document we will describe expected messages in each topic, so a
potential consumer can easily cope with the |swh| journal without having to
read the source code or the |swh| :ref:`data model ` in details (it
is however recommended to familiarize yourself with this later).
Kafka message values are dictionary structures serialized as msgpack_, with a
few custom encodings. See the section `Kafka message format`_ below for a
complete description of the serialization format.
Note that each example given below show the dictionary before being serialized
as a msgpack_ chunk.
Topics
------
There are several groups of topics:
- main storage Merkle-DAG related topics,
- other storage objects (not part of the Merkle DAG),
- indexer related objects (not yet documented below).
Topics prefix can be either `swh.journal.objects` or
`swh.journal.objects_privileged` (see below).
Anonymized topics
+++++++++++++++++
For topics that transport messages with user information (name and email
address), namely `swh.journal.objects.release`_ and
`swh.journal.objects.revision`_, there are 2 versions of those: one is an
anonymized topic, in which user information are obfuscated, and a pristine
version with clear data.
Access to pristine topics depends on ACLs linked to credentials used to connect
to the Kafka cluster.
List of topics
++++++++++++++
- `swh.journal.objects.origin`_
- `swh.journal.objects.origin_visit`_
- `swh.journal.objects.origin_visit_status`_
- `swh.journal.objects.snapshot`_
- `swh.journal.objects.release`_
- `swh.journal.objects.privileged_release `_
- `swh.journal.objects.revision`_
- `swh.journal.objects.privileged_revision `_
- `swh.journal.objects.directory`_
- `swh.journal.objects.content`_
- `swh.journal.objects.skipped_content`_
- `swh.journal.objects.metadata_authority`_
- `swh.journal.objects.metadata_fetcher`_
- `swh.journal.objects.raw_extrinsic_metadata`_
Topics for Merkle-DAG objects
-----------------------------
These topics are for the various objects stored in the |swh| Merkle DAG, see
the :ref:`data model ` for more details.
`swh.journal.objects.snapshot`
++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.Snapshot` objects.
Message format:
- `branches` [dict] branches present in this snapshot,
- `id` [bytes] the intrinsic identifier of the
:py:class:`swh.model.model.Snapshot` object
with `branches` being a dictionary which keys are branch names [bytes], and values a dictionary of:
- `target` [bytes] intrinsic identifier of the targeted object
- `target_type` [string] the type of the targeted object (can be "content",
"directory", "revision", "release", "snapshot" or "alias").
Example:
.. code:: python
{
'branches': {
b'refs/pull/1/head': {
'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c',
'target_type': 'revision'
},
b'refs/pull/2/head': {
'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ',
'target_type': 'revision'
},
b'refs/heads/master': {
'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT',
'target_type': 'revision'
},
b'HEAD': {
'target': b'refs/heads/master',
'target_type': 'alias'
}
},
'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^'
}
`swh.journal.objects.release`
+++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.Release` objects.
This topics is anonymized. The non-anonymized version of this topic is
`swh.journal.objects_privileged.release`.
Message format:
- `name` [bytes] name (typically the version) of the release
- `message` [bytes] message of the release
- `target` [bytes] identifier of the target object
- `target_type` [string] type of the target, can be "content", "directory",
"revision", "release" or "snapshot"
- `synthetic` [bool] True if the :py:class:`swh.model.model.Release` object has
been forged by the loading process; this flag is not used for the id
computation,
- `author` [dict] the author of the release
- `date` [gitdate] the date of the release
- `id` [bytes] the intrinsic identifier of the
:py:class:`swh.model.model.Release` object
Example:
.. code:: python
{
'name': b'0.3',
'message': b'',
'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d',
'target_type': 'revision',
'synthetic': False,
'author': {
'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9',
'name': None,
'email': None
},
'date': {
'timestamp': {
'seconds': 1480432642,
'microseconds': 0
},
'offset': 180,
'negative_utc': False
},
'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86'
}
`swh.journal.objects.revision`
++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.Revision` objects.
This topics is anonymized. The non-anonymized version of this topic is
`swh.journal.objects_privileged.revision`.
Message format:
-- `message` [bytes] the commit message for the revision
-- `author` [dict] the author of the revision
-- `committer` [dict] the committer of the revision
-- `date` [gitdate] the revision date
-- `committer_date` [gitdate] the revision commit date
-- `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg")
-- `directory` [bytes] the intrinsic identifier of the directory this revision links to
-- `synthetic` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not,
-- `metadata` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the
+- ``message`` [bytes] the commit message for the revision
+- ``author`` [dict] the author of the revision
+- ``committer`` [dict] the committer of the revision
+- ``date`` [gitdate] the revision date
+- ``committer_date`` [gitdate] the revision commit date
+- ``type`` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg")
+- ``directory`` [bytes] the intrinsic identifier of the directory this revision links to
+- ``synthetic`` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not,
+- ``metadata`` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the
intrinsic identifier computation),
-- `parents` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers
-- `id` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision`
-- `extra_headers` [list[(bytes, bytes)]] TODO
+- ``parents`` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers
+- ``id`` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision`
+- ``extra_headers`` [list[(bytes, bytes)]] TODO
Example:
.. code:: python
{
'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n',
'author': {
'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z',
'name': None,
'email': None
},
'committer': {
'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z',
'name': None,
'email': None
},
'date': {
'timestamp': {'seconds': 1495977610, 'microseconds': 334267},
'offset': 0,
'negative_utc': False
},
'committer_date': {
'timestamp': {'seconds': 1495977610, 'microseconds': 334267},
'offset': 0,
'negative_utc': False
},
'type': 'svn',
'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe',
'synthetic': True,
'metadata': None,
'parents': [
b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c',
],
'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8',
'perms': 33188},
{'name': b'lib',
'type': 'dir',
'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U',
'perms': 16384},
{'name': b'package.json',
'type': 'file',
'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x',
'perms': 33188}
],
'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P'
}
Other Objects Topics
--------------------
These topics are for objects of the |swh| archive that are not part of the
Merkle DAG but are essential parts of the archive; see the :ref:`data model
` for more details.
-`swh.journal.objects.origin`
-++++++++++++++++++++++++++++
+``swh.journal.objects.origin``
+++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.Origin` objects.
Message format:
-- `url` [string] URL of the :py:class:`swh.model.model.Origin`
+- ``url`` [string] URL of the :py:class:`swh.model.model.Origin`
Example:
.. code:: python
{
"url": "https://github.com/vujkovicm/pml"
}
-`swh.journal.objects.origin_visit`
-++++++++++++++++++++++++++++++++++
+``swh.journal.objects.origin_visit``
+++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.OriginVisit` objects.
Message format:
-- `origin` [string] URL of the visited :py:class:`swh.model.model.Origin`
-- `date` [timestamp] date of the visit
-- `type` [string] type of the loader used to perform the visit
-- `visit` [int] number of the visit for this `origin`
+- ``origin`` [string] URL of the visited :py:class:`swh.model.model.Origin`
+- ``date`` [timestamp] date of the visit
+- ``type`` [string] type of the loader used to perform the visit
+- ``visit`` [int] number of the visit for this ``origin``
Example:
.. code:: python
{
'origin': 'https://pypi.org/project/wasp-eureka/',
'date': Timestamp(seconds=1606260407, nanoseconds=818259954),
'type': 'pypi',
'visit': 505}
}
-`swh.journal.objects.origin_visit_status`
-+++++++++++++++++++++++++++++++++++++++++
+``swh.journal.objects.origin_visit_status``
++++++++++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.OriginVisitStatus` objects.
Message format:
-- `origin` [string] URL of the visited :py:class:`swh.model.model.Origin`
-- `visit` [int] number of the visit for this `origin` this status concerns
-- `date` [timestamp] date of the visit status update
-- `status` [string] status (can be "created", "ongoing", "full" or "partial"),
-- `snapshot` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this
- visit resulted in (if `status` is "full" or "partial")
-- `metadata`: deprecated
+- ``origin`` [string] URL of the visited :py:class:`swh.model.model.Origin`
+- ``visit`` [int] number of the visit for this ``origin`` this status concerns
+- ``date`` [timestamp] date of the visit status update
+- ``status`` [string] status (can be "created", "ongoing", "full" or "partial"),
+- ``snapshot`` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this
+ visit resulted in (if ``status`` is "full" or "partial")
+- ``metadata``: deprecated
Example:
.. code:: python
{
'origin': 'https://pypi.org/project/stricttype/',
'visit': 524,
'date': Timestamp(seconds=1606260407, nanoseconds=818259954),
'status': 'full',
'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7",
'metadata': None
}
Extrinsic Metadata related Topics
---------------------------------
Extrinsic metadata is information about software that is not part of the source
code itself but still closely related to the software. See
:ref:`extrinsic-metadata-specification` for more details on the Extrinsic
Metadata model.
-`swh.journal.objects.metadata_authority`
-++++++++++++++++++++++++++++++++++++++++
+``swh.journal.objects.metadata_authority``
+++++++++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.MetadataAuthority` objects.
Message format:
-- `type` [string]
-- `url` [string]
-- `metadata` [dict]
+- ``type`` [string]
+- ``url`` [string]
+- ``metadata`` [dict]
Examples:
.. code:: python
{
'type': 'forge',
'url': 'https://guix.gnu.org/sources.json',
'metadata': {}
}
{
'type': 'deposit_client',
'url': 'https://www.softwareheritage.org',
'metadata': {'name': 'swh'}
}
-`swh.journal.objects.metadata_fetcher`
-++++++++++++++++++++++++++++++++++++++
+``swh.journal.objects.metadata_fetcher``
+++++++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.MetadataFetcher` objects.
Message format:
-- `type` [string]
-- `version` [string]
-- `metadata` [dict]
+- ``type`` [string]
+- ``version`` [string]
+- ``metadata`` [dict]
Example:
.. code:: python
{
'name': 'swh.loader.package.cran.loader.CRANLoader',
'version': '0.15.0',
'metadata': {}
}
-`swh.journal.objects.raw_extrinsic_metadata`
-++++++++++++++++++++++++++++++++++++++++++++
+``swh.journal.objects.raw_extrinsic_metadata``
+++++++++++++++++++++++++++++++++++++++++++++++
Topic for :py:class:`swh.model.model.RawExtrinsicMetadata` objects.
Message format:
-- `type` [string]
-- `target` [string]
-- `discovery_date` [timestamp]
-- `authority` [dict]
-- `fetcher` [dict]
-- `format` [string]
-- `metadata` [bytes]
-- `origin` [string]
-- `visit` [int]
-- `snapshot` [SWHID]
-- `release` [SWHID]
-- `revision` [SWHID]
-- `path` [bytes]
-- `directory` [SWHID]
+- ``type`` [string]
+- ``target`` [string]
+- ``discovery_date`` [timestamp]
+- ``authority`` [dict]
+- ``fetcher`` [dict]
+- ``format`` [string]
+- ``metadata`` [bytes]
+- ``origin`` [string]
+- ``visit`` [int]
+- ``snapshot`` [SWHID]
+- ``release`` [SWHID]
+- ``revision`` [SWHID]
+- ``path`` [bytes]
+- ``directory`` [SWHID]
Example:
.. code:: python
{
'type': 'snapshot',
'id': 'swh:1:snp:f3b180979283d4931d3199e6171840a3241829a3',
'discovery_date': Timestamp(seconds=1606260407, nanoseconds=818259954),
'authority': {
'type': 'forge',
'url': 'https://pypi.org/',
'metadata': {}
},
'fetcher': {
'name': 'swh.loader.package.pypi.loader.PyPILoader',
'version': '0.10.0',
'metadata': {}
},
'format': 'pypi-project-json',
'metadata': b'{"info":{"author":"Signaltonsalat","author_email":"signaltonsalat@gmail.com"}]}',
'origin': 'https://pypi.org/project/schwurbler/'
}
Kafka message format
--------------------
Each value of a Kafka message in a topic is a dictionary-like structure
encoded as a msgpack_ byte string.
Keys are ASCII strings.
All values are encoded using default msgpack type system except for long
integers for which we use a custom format using msgpack `extended type`_ to
prevent overflow while packing some objects.
Integer
+++++++
-For long integers (that do not fit in the `[-(2**63), 2 ** 64 - 1]` range), a
+For long integers (that do not fit in the ``[-(2**63), 2 ** 64 - 1]`` range), a
custom `extended type`_ based encoding scheme is used.
-The `type` information can be:
+The ``type`` information can be:
-- `1` for positive (possibly long) integers,
-- `2` for negative (possibly long) integers.
+- ``1`` for positive (possibly long) integers,
+- ``2`` for negative (possibly long) integers.
The payload is simply the bytes (big endian) representation of the absolute
value (always positive).
For example (adapted to standard integers for the sake of readability; these
values are small so they will actually be encoded using the default msgpack
format for integers):
-- `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`)
-- `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`)
+- ``12345`` would be encoded as the extension value ``[1, [0x30, 0x39]]`` (aka ``0xd5013039``)
+- ``-42`` would be encoded as the extension value ``[2, [0x2A]]`` (aka ``0xd4022a``)
Datetime
++++++++
There are 2 type of date that can be encoded in a Kafka message:
- dates for git-like objects (:py:class:`swh.model.model.Revision` and
:py:class:`swh.model.model.Release`): these dates are part of the hash
computation used as identifier in the Merkle DAG. In order to fully support
git repositories, a custom encoding is required. These dates (coming from the
git data model) are encoded as a dictionary with:
- - `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys
- (`seconds` and `microseconds`)
+ - ``timestamp`` [dict] POSIX timestamp of the date, as a dictionary with 2 keys
+ (``seconds`` and ``microseconds``)
- - `offset` [int] offset of the date (in minutes)
+ - ``offset`` [int] offset of the date (in minutes)
- - `negative_utc` [bool] only True for the very edge case where the date has a
+ - ``negative_utc`` [bool] only True for the very edge case where the date has a
zero but negative offset value (which does not makes much sense, but
technically the git format permits)
Example:
.. code:: python
{
'timestamp': {'seconds': 1480432642, 'microseconds': 0},
'offset': 180,
'negative_utc': False
}
- These are denoted as `gitdate` below.
+ These are denoted as ``gitdate`` below.
- other dates (resulting of the |swh| processing stack) are encoded using
msgpack's Timestamp_ extended type.
- These are denoted as `timestamp` below.
+ These are denoted as ``timestamp`` below.
Note that these dates used to be encoded as a dictionary (beware: keys are bytes):
.. code:: python
{
b"swhtype": "datetime",
b"d": '2020-09-15T16:19:13.037809+00:00'
}
Person
++++++
:py:class:`swh.model.model.Person` objects represent a person in the |swh|
Merkle DAG, namely a :py:class:`swh.model.model.Revision` author or committer,
or a :py:class:`swh.model.model.Release` author.
:py:class:`swh.model.model.Person` objects are serialized as a dictionary like:
.. code:: python
{
'fullname': 'John Doe ',
'name': 'John Doe',
'email': 'john.doe@example.com'
}
For anonymized topics, :py:class:`swh.model.model.Person` entities have seen
anonymized prior to being serialized. The anonymized
:py:class:`swh.model.model.Person` object is a dictionary like:
.. code:: python
{
'fullname': ,
'name': null,
'email': null
}
-where the `` is computed from original values as a sha256 of the
-original's `fullname`.
+where the ```` is computed from original values as a sha256 of the
+original's ``fullname``.
.. _Kafka: https://kafka.apache.org
.. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms
.. _msgpack: https://msgpack.org/
.. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types
.. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type