diff --git a/docs/archive-changelog.rst b/docs/archive-changelog.rst index d24ca8e..115d583 100644 --- a/docs/archive-changelog.rst +++ b/docs/archive-changelog.rst @@ -1,132 +1,132 @@ .. _archive-changelog: Software Heritage --- Archive ChangeLog ======================================= Below you can find a time-indexed list of notable events and changes to archival policies in the Software Heritage Archive. Each of them might have (had) an impact on how content is archived and explain apparent statistical anomalies or other changes in archival behavior over time. They are collected in this document for historical reasons. 2020 ---- * **2020-10-06 - 2020-11-23:** source code crawlers have been paused to avoid an out of disk condition, due to an unexpected delay in the arrival of new storage hardware. Push archival (both deposit_ and `save code now`_) remained in operation. (tracking: `T2656 `_) * **2020-09-15:** completed first archival of, and added to regular crawling `GNU Guix System`_ (tracking: `T2594 `_) * **2020-06-11:** completed integration with the IPOL_ journal, allowing paper authors to explicitly deposit_ source code to the archive (`announcement `_) * **2020-05-25:** completed first archival of, and added to regular crawling NixOS_ (tracking: `T2411 `_) 2019 ---- * **2019-09-10:** completed first archival of Bitbucket_ Git repositories and added Bitbucket as a regularly crawled forge (tracking: `T592 `_) * **2019-06-30:** completed first archival of, and added to regular crawling, several GitLab_ instances: `0xacab.org `_, `framagit.org `_, `gite.lirmm.fr `_, `gitlab.common-lisp.net `_, `gitlab.freedesktop.org `_, `gitlab.gnome.org `_, `gitlab.inria.fr `_, `salsa.debian.org `_ * **2019-06-12:** completed first archival of CRAN_ packages and added CRAN as a regularly crawled package repository (tracking: `T1709 `_) * **2019-06-11:** completed a full archival of GNU_ source code releases from `ftp.gnu.org`_, and added it to regular crawling (tracking: `T1722 `_) -* **2019-05-27:** completed a full archival of NPM_ packages andded it as a +* **2019-05-27:** completed a full archival of NPM_ packages and added it as a regularly crawled package repository (tracking: `T1378 `_) * **2019-01-10:** enabled the `save code now`_ service, allowing users to explicitly request archival of a specific source code repository (`announcement `_) 2018 ---- * **2018-10-10:** completed first archival of PyPI_ packages and added PyPI as a regularly crawled package repository (`announcement `_) * **2018-09-25:** completed integration with HAL_, allowing paper authors to explicitly deposit_ source code to the archive (`announcement `_) * **2018-08-31:** completed first archival of public GitLab_ repositories from `gitlab.com `_ and added it as a regularly crawled forge (tracking: `T1111 `_) * **2018-03-21:** completed archival of `Google Code`_ Mercurial repositories. (tracking: `T682 `_) * **2018-02-20:** completed archival of Debian_ packages and added Debian as a regularly crawled distribution (`announcement `_) 2017 ---- * **2017-10-02:** completed archival of `Google Code`_ Subversion repositories (tracking: `T617 `_) * **2017-06-06:** completed archival of `Google Code`_ Git repositories (tracking: `T673 `_) 2016 ---- * **2016-04-04:** completed archival of the Gitorious_ (tracking: `T312 `_) 2015 ---- * **2015-11-06:** archived all GNU_ source code releases from `ftp.gnu.org`_ (tracking: `T90 `_) * **2015-07-28:** started archiving public GitHub_ repositories .. _Bitbucket: https://bitbucket.org .. _CRAN: https://cran.r-project.org .. _Debian: https://www.debian.org .. _GNU Guix System: https://guix.gnu.org/ .. _GNU: https://en.wikipedia.org/wiki/Google_Code .. _GitHub: https://github.com .. _GitLab: https://gitlab.com .. _Gitorious: https://en.wikipedia.org/wiki/Gitorious .. _Google Code: https://en.wikipedia.org/wiki/Google_Code .. _HAL: https://hal.archives-ouvertes.fr .. _IPOL: http://www.ipol.im .. _NPM: https://www.npmjs.com .. _NixOS: https://nixos.org/ .. _PyPI: https://pypi.org .. _deposit: https://deposit.softwareheritage.org .. _ftp.gnu.org: http://ftp.gnu.org .. _save code now: https://save.softwareheritage.org diff --git a/docs/contributing/phabricator.rst b/docs/contributing/phabricator.rst index 313bf0b..cca5d4d 100644 --- a/docs/contributing/phabricator.rst +++ b/docs/contributing/phabricator.rst @@ -1,289 +1,289 @@ .. highlight:: bash .. _patch-submission: Submitting patches ================== `Phabricator`_ is the tool that Software Heritage uses as its coding/collaboration forge. Software Heritage's Phabricator instance can be found at https://forge.softwareheritage.org/ .. _Phabricator: http://phabricator.org/ Code Review in Phabricator -------------------------- We use the Differential application of Phabricator to perform :ref:`code reviews ` in the context of Software Heritage. * we use Git and ``history.immutable=true`` (but beware as that is partly a Phabricator misnomer, read on) * when code reviews are required, developers will be allowed to push directly to master once an accepted Differential diff exists Configuration +++++++++++++ Arcanist configuration ^^^^^^^^^^^^^^^^^^^^^^ Authentication ~~~~~~~~~~~~~~ First, you should install Arcanist and authenticate it to Phabricator:: sudo apt-get install arcanist arc set-config default https://forge.softwareheritage.org/ arc install-certificate arc will prompt you to login into Phabricator via web (which will ask your personal Phabricator credentials). You will then have to copy paste the API token from the web page to arc, and hit Enter to complete the certificate installation. Immutability ~~~~~~~~~~~~ When using git, Arcanist by default mess with the local history, rewriting commits at the time of first submission. To avoid that we use so called `history immutability`_ .. _history immutability: https://secure.phabricator.com/book/phabricator/article/arcanist_new_project/#history-mutability-git To that end, you shall configure your ``arc`` accordingly:: arc set-config history.immutable true Note that this does **not** mean that you are forbidden to rewrite your local branches (e.g., with ``git rebase``). Quite the contrary: you are encouraged to locally rewrite branches before pushing to ensure that commits are logically separated and your commit history easy to bisect. The above setting just means that *arc* will not rewrite commit history under your nose. Enabling ``git push`` to our forge ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The way we've configured our review setup for continuous integration needs you to configure git to allow pushes to our forge. There's two ways you can do this : setting a ssh key to push over ssh, or setting a specific password for git pushes over https. SSH key for pushes ~~~~~~~~~~~~~~~~~~ In your forge User settings page (On the top right, click on your avatar, then click *Settings*), you have access to a *Authentication* > *SSH Public Keys* section (Direct link: ``hxxps://forge.softwareheritage.org/settings/user//page/ssh/``). You then have the option to upload a SSH public key, which will authenticate your pushes. You then need to configure ssh/git to use that key pair, for instance by editing the ``~/.ssh/config`` file. Finally, you should configure git to push over ssh when pushing to https://forge.softwareheritage.org, by running the following command:: git config --global url.git@forge.softwareheritage.org:.pushInsteadOf https://forge.softwareheritage.org This lets git know that it should use ``git@forge.softwareheritage.org:`` as a base url when pushing repositories cloned from forge.softwareheritage.org over https. VCS password for pushes ~~~~~~~~~~~~~~~~~~~~~~~ If you're not comfortable setting up SSH to upload your changes, you have the option of setting a VCS password. This password, *separate from your account password*, allows Phabricator to authenticate your uploads over HTTPS. In your forge User settings page (On the top right, click on your avatar, then click *Settings*), you need to use the *Authentication* > *VCS Password* section to set your VCS password (Direct link: ``hxxps://forge.softwareheritage.org/settings/user//page/vcspassword/``). If you still get a 403 error on push, this means you need a forge administrator to enable HTTPS pushes for the repository (which wasn't done by default in historical repositories). Please drop by on IRC and let us know! Workflow ++++++++ * work in a feature branch: ``git checkout -b my-feat`` * initial review request: hack/commit/hack/commit ; ``arc diff origin/master`` * react to change requests: hack/commit/hack/commit ; ``arc diff --update Dxx origin/master`` * landing change: ``git checkout master ; git merge my-feat ; git push`` Starting a new feature and submit it for review ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use a **one branch per feature** workflow, with well-separated **logical commits** (:ref:`following those conventions `). Please open one diff per logical commit to keep the diff size to a minimum. .. code-block:: git checkout -b my-shiny-feature ... hack hack hack ... git commit -m 'architecture skeleton for my-shiny-feature' ... hack hack hack ... git commit -m 'my-shiny-feature: implement module foo' ... etc ... Please, follow the To **submit your code for review** the first time:: arc diff origin/master arc will prompt for a **code review message**. Provide the following information: * first line: *short description* of the overall work (i.e., the feature you're working on). This will become the title of the review * *Summary* field (optional): *long description* of the overall work; the field can continue in subsequent lines, up to the next field. This will become the "Summary" section of the review * *Test Plan* field (optional): write here if something special is needed to test your change * *Reviewers* field (optional): the (Phabricator) name(s) of desired reviewers. If you don't specify one (recommended) the default reviewers will be chosen * *Subscribers* field (optional): the (Phabricator) name(s) of people that will be notified about changes to this review request. In most cases it should be left empty For example:: mercurial loader Summary: first stab at a mercurial loader (T329) The implementation follows the plan detailed in F2F discussion with @foo. Performances seem decent enough for a first trial (XXX seconds for YYY repository that contains ZZZ patches). Test plan: Reviewers: Subscribers: foo After completing the message arc will submit the review request and tell you its number and URL:: [...] Created a new Differential revision: Revision URI: https://forge.softwareheritage.org/Dxx .. _arc-update: Updating your branch to reflect requested changes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Your feature might get accepted as is, YAY! Or, reviewers might request changes; no big deal! Use the Differential web UI to follow-up to received comments, if needed. To implement requested changes in the code, hack on your branch as usual by: * adding new commits, and/or * rewriting old commits with git rebase (to preserve a nice, easy to bisect history) * pulling on master and rebasing your branch against it if meanwhile someone landed commits on master: .. code-block:: git checkout master git pull git checkout my-shiny-feature git rebase master When you're ready to **update your review request**:: arc diff --update Dxx HEAD~ Arc will prompt you for a message: describe what you've changed w.r.t. the previous review request, free form. Your message will become the changelog entry in Differential for this new version of the diff. Differential only care about the code diff, and not about the commits or their order. Therefore each "update" can be a completely different series of commits, possibly rewritten from the previous submission. Dependencies between diffs ^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that you can manage diff dependencies within the same module with the following keyword in the diff description:: Depends on Dxx That allows to keep a logical view in your diff. It's not strictly necessary (because the tooling now deals with it properly) but it might help reviewers or yourself to do so. Landing your change onto master ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once your change has been approved in Differential, you will be able to land it onto the master branch. Before doing so, you're encouraged to **clean up your git commit history**, reordering/splitting/merging commits as needed to have separate logical commits and an easy to bisect history. Update the diff :ref:`following the prior section ` -(It'd be good to let the ci build finish to make sure everything is still green). +(It'd be good to let the CI build finish to make sure everything is still green). Once you're happy you can **push to origin/master** directly, e.g.:: git checkout master git merge --ff-only my-shiny-feature git push ``--ff-only`` is optional, and makes sure you don't unintentionally create a merge commit. Optionally you can then delete your local feature branch:: git branch -d my-shiny-feature Reviewing locally / landing someone else's changes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can do local reviews of code with arc patch:: arc patch Dxyz This will create a branch **arcpatch-Dxyz** containing the changes on your local checkout. You can then merge those changes upstream with:: git checkout master git merge --ff arcpatch-Dxyz git push origin master or, alternatively:: arc land --squash See also -------- * :ref:`code-review` for guidelines on how code is reviewed when developing for Software Heritage diff --git a/docs/glossary.rst b/docs/glossary.rst index d78bdea..c7a6bca 100644 --- a/docs/glossary.rst +++ b/docs/glossary.rst @@ -1,193 +1,193 @@ :orphan: .. _glossary: Glossary ======== .. glossary:: archive An instance of the |swh| data store. ark `Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is a multi-purpose persistent identifier for information objects of any type. artifact software artifact An artifact is one of many kinds of tangible by-products produced during the development of software. content blob A (specific version of a) file stored in the archive, identified by its cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also known as: :term:`blob`. Note: it is incorrect to refer to Contents as "files", because files are usually considered to be named, whereas Contents are nameless. It is only in the context of specific :term:`directories ` that :term:`contents ` acquire (local) names. deposit A :term:`software artifact` that was pushed to the Software Heritage archive (unlike :term:`loaders `, which pull artifacts). A deposit is useful when you want to ensure a software release's source code is archived in SWH even if it is not published anywhere else. See also: the :ref:`swh-deposit` component, which implements a deposit client and server. directory A set of named pointers to contents (file entries), directories (directory entries) and revisions (revision entries). All entries are associated to the local name of the entry (i.e., a relative path without any path separator) and permission metadata (e.g., ``chmod`` value or equivalent). doi A Digital Object Identifier or DOI_ is a persistent identifier or handle used to uniquely identify objects, standardized by the International Organization for Standardization (ISO). extrinsic metadata Metadata about software that is not shipped as part of the software source code, but is available instead via out-of-band means. For example, homepage, maintainer contact information, and popularity information ("stars") as listed on GitHub/GitLab repository pages. See also: :term:`intrinsic metadata`. journal The :ref:`journal ` is the persistent logger of the |swh| architecture in charge of logging changes of the archive, with publish-subscribe_ support. lister A :ref:`lister ` is a component of the |swh| architecture that is in charge of enumerating the :term:`software origin` (e.g., VCS, packages, etc.) available at a source code distribution place. loader A :ref:`loader ` is a component of the |swh| architecture responsible for reading a source code :term:`origin` (typically a git - reposiitory) and import or update its content in the :term:`archive` (ie. + repository) and import or update its content in the :term:`archive` (ie. add new file contents int :term:`object storage` and repository structure in the :term:`storage database`). hash cryptographic hash checksum digest A fixed-size "summary" of a stream of bytes that is easy to compute, and hard to reverse. (Cryptographic hash function Wikipedia article) also known as: :term:`checksum`, :term:`digest`. indexer A component of the |swh| architecture dedicated to producing metadata linked to the known :term:`blobs ` in the :term:`archive`. intrinsic metadata Metadata about software that is shipped as part of the source code of the software itself or as part of related artifacts (e.g., revisions, releases, etc). For example, metadata that is shipped in `PKG-INFO` files for Python packages, `pom.xml` for Maven-based Java projects, `debian/control` for Debian packages, `metadata.json` for NPM, etc. See also: :term:`extrinsic metadata`. objstore objstorage object store object storage Content-addressable object storage. It is the place where actual object :term:`blobs ` objects are stored. origin software origin data source A location from which a coherent set of sources has been obtained, like a git repository, a directory containing tarballs, etc. person An entity referenced by a revision as either the author or the committer of the corresponding change. A person is associated to a full name and/or an email address. release tag milestone a revision that has been marked as noteworthy with a specific name (e.g., a version number), together with associated development metadata (e.g., author, timestamp, etc). revision commit changeset A point in time snapshot of the content of a directory, together with associated development metadata (e.g., author, timestamp, log message, etc). scheduler The component of the |swh| architecture dedicated to the management and the prioritization of the many tasks. snapshot the state of all visible branches during a specific visit of an origin storage storage database The main database of the |swh| platform in which the all the elements of the :ref:`data-model` but the :term:`content` are stored as a :ref:`Merkle DAG `. type of origin Information about the kind of hosting, e.g., whether it is a forge, a collection of repositories, an homepage publishing tarball, or a one shot source code repository. For all kind of repositories please specify which VCS system is in use (Git, SVN, CVS, etc.) object. vault vault service User-facing service that allows to retrieve parts of the :term:`archive` as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) visit The passage of |swh| on a given :term:`origin`, to retrieve all source code and metadata available there at the time. A visit object stores the state of all visible branches (if any) available at the origin at visit time; each of them points to a revision object in the archive. Future visits of the same origin will create new visit objects, without removing previous ones. .. _blob: https://en.wikipedia.org/wiki/Binary_large_object .. _DOI: https://www.doi.org .. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers .. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html .. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern diff --git a/docs/journal.rst b/docs/journal.rst index d1d3983..25746c4 100644 --- a/docs/journal.rst +++ b/docs/journal.rst @@ -1,673 +1,673 @@ .. _journal-specs: Software Heritage Journal --- Specifications ============================================ -The |swh| journal is a kafka_-based stream of events for every added object in +The |swh| journal is a Kafka_-based stream of events for every added object in the |swh| Archive and some of its related services, especially indexers. Each topic_ will stream added elements for a given object type according to the topic name. Objects streamed in a topic are serialized versions of objects stored in the |swh| Archive specified by the main |swh| :py:mod:`data model ` or the :py:mod:`indexer object model `. In this document we will describe expected messages in each topic, so a potential consumer can easily cope with the |swh| journal without having to read the source code or the |swh| :ref:`data model ` in details (it is however recommended to familiarize yourself with this later). Kafka message values are dictionary structures serialized as msgpack_, with a few custom encodings. See the section `Kafka message format`_ below for a complete description of the serialization format. Note that each example given below show the dictionary before being serialized as a msgpack_ chunk. Topics ------ There are several groups of topics: - main storage Merkle-DAG related topics, - other storage objects (not part of the Merkle DAG), - indexer related objects (not yet documented below). Topics prefix can be either `swh.journal.objects` or `swh.journal.objects_privileged` (see below). Anonymized topics +++++++++++++++++ For topics that transport messages with user information (name and email address), namely `swh.journal.objects.release`_ and `swh.journal.objects.revision`_, there are 2 versions of those: one is an anonymized topic, in which user information are obfuscated, and a pristine version with clear data. Access to pristine topics depends on ACLs linked to credentials used to connect to the Kafka cluster. List of topics ++++++++++++++ - `swh.journal.objects.origin`_ - `swh.journal.objects.origin_visit`_ - `swh.journal.objects.origin_visit_status`_ - `swh.journal.objects.snapshot`_ - `swh.journal.objects.release`_ - `swh.journal.objects.privileged_release `_ - `swh.journal.objects.revision`_ - `swh.journal.objects.privileged_revision `_ - `swh.journal.objects.directory`_ - `swh.journal.objects.content`_ - `swh.journal.objects.skippedcontent`_ - `swh.journal.objects.metadata_authority`_ - `swh.journal.objects.metadata_fetcher`_ - `swh.journal.objects.raw_extrinsic_metadata`_ -Topics for Merkel-DAG objects +Topics for Merkle-DAG objects ----------------------------- These topics are for the various objects stored in the |swh| Merkle DAG, see the :ref:`data model ` for more details. `swh.journal.objects.snapshot` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Snapshot` objects. Message format: - `branches` [dict] branches present in this snapshot, - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Snapshot` object with `branches` being a dictionary which keys are branch names [bytes], and values a dictionary of: - `target` [bytes] intrinsic identifier of the targeted object - `target_type` [string] the type of the targeted object (can be "content", "directory", "revision", "release", "snapshot" or "alias"). Example: .. code:: python { 'branches': { b'refs/pull/1/head': { 'target': b'\x07\x10\\\xfc\xae\x1f\xb1\xf9\xb5\xad\x8bI\xf1G\x10\x9a\xba>8\x0c', 'target_type': 'revision' }, b'refs/pull/2/head': { 'target': b'\x1a\x868-\x9b\x1d\x00\xfbd\xeaH\xc88\x9c\x94\xa1\xe0U\x9bJ', 'target_type': 'revision' }, b'refs/heads/master': { 'target': b'\x7f\xc4\xfe4f\x7f\xda\r\x0e[\xba\xbc\xd7\x12d#\xf7&\xbfT', 'target_type': 'revision' }, b'HEAD': { 'target': b'refs/heads/master', 'target_type': 'alias' } }, 'id': b'\x10\x00\x06\x08\xe9E^\x0c\x9bS\xa5\x05\xa8\xdf\xffw\x88\xb8\x93^' } `swh.journal.objects.release` +++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Release` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.release`. Message format: - `name` [bytes] name (typically the version) of the release - `message` [bytes] message of the release - `target` [bytes] identifier of the target object - `target_type` [string] type of the target, can be "content", "directory", "revision", "release" or "snapshot" - `synthetic` [bool] True if the :py:class:`swh.model.model.Release` object has been forged by the loading process; this flag is not used for the id computation, - `author` [dict] the author of the release - `date` [gitdate] the date of the release - `id` [bytes] the intrinsic identifier of the :py:class:`swh.model.model.Release` object Example: .. code:: python { 'name': b'0.3', 'message': b'', 'target': b'<\xd6\x15\xd9\xef@\xe0[\xe7\x11=\xa1W\x11h%\xcc\x13\x96\x8d', 'target_type': 'revision', 'synthetic': False, 'author': { 'fullname': b'\xf5\x8a\x95k\xffKgN\x82\xd0f\xbf\x12\xe8w\xc8a\xf79\x9e\xf4V\x16\x8d\xa4B\x84\x15\xea\x83\x92\xb9', 'name': None, 'email': None }, 'date': { 'timestamp': { 'seconds': 1480432642, 'microseconds': 0 }, 'offset': 180, 'negative_utc': False }, 'id': b'\xd0\x00\x06u\x05uaK`.\x0c\x03R%\xca,\xe1x\xd7\x86' } `swh.journal.objects.revision` ++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Revision` objects. This topics is anonymized. The non-anonymized version of this topic is `swh.journal.objects_privileged.revision`. Message format: - `message` [bytes] the commit message for the revision - `author` [dict] the author of the revision - `committer` [dict] the committer of the revision - `date` [gitdate] the revision date - `committer_date` [gitdate] the revision commit date - `type` [string] the type of the revision (can be "git", "tar", "dsc", "svn", "hg") - `directory` [bytes] the intrinsic identifier of the directory this revision links to - `synthetic` [bool] whether this :py:class:`swh.model.model.Revision` is synthetic or not, - `metadata` [bytes] the metadata linked to this :py:class:`swh.model.model.Revision` (not part of the intrinsic identifier computation), - `parents` [list[bytes]] list of parent :py:class:`swh.model.model.Revision` intrinsic identifiers - `id` [bytes] intrinsic identifier of the :py:class:`swh.model.model.Revision` - `extra_headers` [list[(bytes, bytes)]] TODO Example: .. code:: python { 'message': b'I now arrange to be able to create a prettyprinted version of the Pascal\ncode to make review of translation of it easier, and I have thought a bit\nmore about coping with Pastacl variant records and the like, but have yet to\nimplement everything. lufylib.red is a place for support code.\n', 'author': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'committer': { 'fullname': b'\xf3\xa7\xde7[\x8b#=\xe48\\/\xa1 \xed\x05NA\xa6\xf8\x9c\n\xad5\xe7\xe0"\xc4\xd5[\xc9z', 'name': None, 'email': None }, 'date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'committer_date': { 'timestamp': {'seconds': 1495977610, 'microseconds': 334267}, 'offset': 0, 'negative_utc': False }, 'type': 'svn', 'directory': b'\x815\xf0\xd9\xef\x94\x0b\xbf\x86<\xa4j^\xb65\xe9\xf4\xd1\xc3\xfe', 'synthetic': True, 'metadata': None, 'parents': [ b'D\xb1\xc8\x0f&\xdc\xd4 \x92J\xaf\xab\x19V\xad\xe7~\x18\n\x0c', ], 'id': b'\x1e\x1c\x19\xb56x\xbc\xe5\xba\xa4\xed\x03\xae\x83\xdb@\xd0@0\xed\xc8', 'perms': 33188}, {'name': b'lib', 'type': 'dir', 'target': b'-\xb2(\x95\xe46X\x9f\xed\x1d\xa6\x95\xec`\x10\x1a\x89\xc3\x01U', 'perms': 16384}, {'name': b'package.json', 'type': 'file', 'target': b'Z\x91N\x9bw\xec\xb0\xfbN\xe9\x18\xa2E-%\x8fxW\xa1x', 'perms': 33188} ], 'id': b'eS\x86\xcf\x16n\xeb\xa96I\x90\x10\xd0\xe9&s\x9a\x82\xd4P' } Other Objects Topics -------------------- These topics are for objects of the |swh| archive that are not part of the Merkle DAG but are essential parts of the archive; see the :ref:`data model ` for more details. `swh.journal.objects.origin` ++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.Origin` objects. Message format: - `url` [string] URL of the :py:class:`swh.model.model.Origin` Example: .. code:: python { "url": "https://github.com/vujkovicm/pml" } `swh.journal.objects.origin_visit` ++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisit` objects. Message format: - `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` - `date` [timestamp] date of the visit - `type` [string] type of the loader used to perform the visit - `visit` [int] number of the visit for this `origin` Example: .. code:: python { 'origin': 'https://pypi.org/project/wasp-eureka/', 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'type': 'pypi', 'visit': 505} } `swh.journal.objects.origin_visit_status` +++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.OriginVisitStatus` objects. Message format: - `origin` [string] URL of the visited :py:class:`swh.model.model.Origin` - `visit` [int] number of the visit for this `origin` this status concerns - `date` [timestamp] date of the visit status update - `status` [string] status (can be "created", "ongoing", "full" or "partial"), - `snapshot` [bytes] identifier of the :py:class:`swh.model.model.Snaphot` this visit resulted in (if `status` is "full" or "partial") - `metadata`: deprecated Example: .. code:: python { 'origin': 'https://pypi.org/project/stricttype/', 'visit': 524, 'date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'status': 'full', 'snapshot': b"\x85\x8f\xcb\xec\xbd\xd3P;Z\xb0~\xe7\xa2(\x0b\x11'\x05i\xf7", 'metadata': None } Extrinsic Metadata related Topics --------------------------------- Extrinsic metadata is information about software that is not part of the source code itself but still closely related to the software. See :ref:`extrinsic-metadata-specification` for more details on the Extrinsic Metadata model. `swh.journal.objects.metadata_authority` ++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataAuthority` objects. Message format: - `type` [string] - `url` [string] - `metadata` [dict] Examples: .. code:: python { 'type': 'forge', 'url': 'https://guix.gnu.org/sources.json', 'metadata': {} } { 'type': 'deposit_client', 'url': 'https://www.softwareheritage.org', 'metadata': {'name': 'swh'} } `swh.journal.objects.metadata_fetcher` ++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.MetadataFetcher` objects. Message format: - `type` [string] - `version` [string] - `metadata` [dict] Example: .. code:: python { 'name': 'swh.loader.package.cran.loader.CRANLoader', 'version': '0.15.0', 'metadata': {} } `swh.journal.objects.raw_extrinsic_metadata` ++++++++++++++++++++++++++++++++++++++++++++ Topic for :py:class:`swh.model.model.RawExtrinsicMetadata` objects. Message format: - `type` [string] - `target` [string] - `discovery_date` [timestamp] - `authority` [dict] - `fetcher` [dict] - `format` [string] - `metadata` [bytes] - `origin` [string] - `visit` [int] - `snapshot` [SWHID] - `release` [SWHID] - `revision` [SWHID] - `path` [bytes] - `directory` [SWHID] Example: .. code:: python { 'type': 'snapshot', 'id': 'swh:1:snp:f3b180979283d4931d3199e6171840a3241829a3', 'discovery_date': Timestamp(seconds=1606260407, nanoseconds=818259954), 'authority': { 'type': 'forge', 'url': 'https://pypi.org/', 'metadata': {} }, 'fetcher': { 'name': 'swh.loader.package.pypi.loader.PyPILoader', 'version': '0.10.0', 'metadata': {} }, 'format': 'pypi-project-json', 'metadata': b'{"info":{"author":"Signaltonsalat","author_email":"signaltonsalat@gmail.com"}]}', 'origin': 'https://pypi.org/project/schwurbler/' } Kafka message format -------------------- -Each value of a kafka message in a topic is a dictionary-like structure +Each value of a Kafka message in a topic is a dictionary-like structure encoded as a msgpack_ byte string. Keys are ASCII strings. All values are encoded using default msgpack type system except for long integers for which we use a custom format using msgpack `extended type`_ to prevent overflow while packing some objects. Integer +++++++ For long integers (that do not fit in the `[-(2**63), 2 ** 64 - 1]` range), a custom `extended type`_ based encoding scheme is used. The `type` information can be: - `1` for positive (possibly long) integers, - `2` for negative (possibly long) integers. The payload is simply the bytes (big endian) representation of the absolute value (always positive). For example (adapted to standard integers for the sake of readability; these values are small so they will actually be encoded using the default msgpack format for integers): - `12345` would be encoded as the extension value `[1, [0x30, 0x39]]` (aka `0xd5013039`) - `-42` would be encoded as the extension value `[2, [0x2A]]` (aka `0xd4022a`) Datetime ++++++++ -There are 2 type of date that can be encoded in a kafka message: +There are 2 type of date that can be encoded in a Kafka message: - dates for git-like objects (:py:class:`swh.model.model.Revision` and :py:class:`swh.model.model.Release`): these dates are part of the hash computation used as identifier in the Merkle DAG. In order to fully support git repositories, a custom encoding is required. These dates (coming from the git data model) are encoded as a dictionary with: - `timestamp` [dict] POSIX timestamp of the date, as a dictionary with 2 keys (`seconds` and `microseconds`) - `offset` [int] offset of the date (in minutes) - `negative_utc` [bool] only True for the very edge case where the date has a zero but negative offset value (which does not makes much sense, but technically the git format permits) Example: .. code:: python { 'timestamp': {'seconds': 1480432642, 'microseconds': 0}, 'offset': 180, 'negative_utc': False } These are denoted as `gitdate` below. - other dates (resulting of the |swh| processing stack) are encoded using msgpack's Timestamp_ extended type. These are denoted as `timestamp` below. Note that these dates used to be encoded as a dictionary (beware: keys are bytes): .. code:: python { b"swhtype": "datetime", b"d": '2020-09-15T16:19:13.037809+00:00' } Person ++++++ :py:class:`swh.model.model.Person` objects represent a person in the |swh| Merkle DAG, namely a :py:class:`swh.model.model.Revision` author or committer, or a :py:class:`swh.model.model.Release` author. :py:class:`swh.model.model.Person` objects are serialized as a dictionary like: .. code:: python { 'fullname': 'John Doe ', 'name': 'John Doe', 'email': 'john.doe@example.com' } For anonymized topics, :py:class:`swh.model.model.Person` entities have seen anonymized prior to being serialized. The anonymized :py:class:`swh.model.model.Person` object is a dictionary like: .. code:: python { 'fullname': , 'name': null, 'email': null } where the `` is computed from original values as a sha256 of the -orignal's `fullname`. +original's `fullname`. -.. _kafka: https://kafka.apache.org +.. _Kafka: https://kafka.apache.org .. _topic: https://kafka.apache.org/documentation/#intro_concepts_and_terms .. _msgpack: https://msgpack.org/ .. _`extended type`: https://github.com/msgpack/msgpack/blob/master/spec.md#extension-types .. _`Timestamp`: https://github.com/msgpack/msgpack/blob/master/spec.md#timestamp-extension-type diff --git a/docs/mirror.rst b/docs/mirror.rst index b2fd491..eea9942 100644 --- a/docs/mirror.rst +++ b/docs/mirror.rst @@ -1,132 +1,132 @@ .. _mirror: Mirroring ========= Description ----------- A mirror is a full copy of the |swh| archive, operated independently from the Software Heritage initiative. A minimal mirror consists of two parts: - the graph storage (typically an instance of :ref:`swh.storage `), which contains the Merkle DAG structure of the archive, *except* the actual content of source code files (AKA blobs), - the object storage (typically an instance of :ref:`swh.objstorage `), which contains all the blobs corresponding to archived source code files. However, a usable mirror needs also to be accessible by others. As such, a proper mirror should also allow to: - navigate the archive copy using a Web browser and/or the Web API (typically using the :ref:`the web application `), - retrieve data from the copy of the archive (typically using the :ref:`the vault service `) A mirror is initially populated and maintained up-to-date by consuming data from the |swh| Kafka-based :ref:`journal ` and retrieving the blob objects (file content) from the |swh| :ref:`object storage `. .. note:: It is not required that a mirror is deployed using the |swh| software stack. Other technologies, including different storage methods, can be used. But we will focus in this documentation to the case of mirror deployment using the |swh| software stack. .. thumbnail:: images/mirror-architecture.svg General view of the |swh| mirroring architecture. Mirroring the Graph Storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The replication of the graph is based on a journal using Kafka_ as event streaming platform. On the Software Heritage side, every addition made to the archive consist of the addition of a :ref:`data-model` object. The new object is also serialized as a msgpack_ bytestring which is used as the value of a message added to a Kafka topic dedicated to the object type. The main Kafka topics for the |swh| :ref:`data-model` are: - `swh.journal.objects.content` - `swh.journal.objects.directory` - `swh.journal.objects.metadata_authority` - `swh.journal.objects.metadata_fetcher` - `swh.journal.objects.origin_visit_status` - `swh.journal.objects.origin_visit` - `swh.journal.objects.origin` - `swh.journal.objects.raw_extrinsic_metadata` - `swh.journal.objects.release` - `swh.journal.objects.revision` - `swh.journal.objects.skipped_content` - `swh.journal.objects.snapshot` In order to set up a mirror of the graph, one needs to deploy a stack capable of retrieving all these topics and store their content reliably. For example a -kafka cluster configured as a replica of the main kafka broker hosted by |swh| +Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| would do the job (albeit not in a very useful manner by itself). A more useful mirror can be set up using the :ref:`storage ` component with the help of the special service named `replayer` provided by the :doc:`apidoc/swh.storage.replay` module. .. TODO: replace this previous link by a link to the 'swh storage replay' command once available, and ideally once https://github.com/sphinx-doc/sphinx/issues/880 is fixed Mirroring the Object Storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File content (blobs) are *not* directly stored in messages of the `swh.journal.objects.content` Kafka topic, which only contains metadata about them, such as various kinds of cryptographic hashes. A separate component is in charge of replicating blob objects from the archive and stored them in the local object storage instance. A separate `swh-journal` client should subscribe to the `swh.journal.objects.content` topic to get the stream of blob objects identifiers, then retrieve corresponding blobs from the main Software Heritage object storage, and store them in the local object storage. A reference implementation for this component is available in :ref:`content replayer `. Installation ------------ When using the |swh| software stack to deploy a mirror, a number of |swh| software components must be installed (cf. architecture diagram above): - a database to store the graph of the |swh| archive, - the :ref:`swh-storage` component, - an object storage solution (can be cloud-based or on local filesystem like ZFS pools), - the :ref:`swh-objstorage` component, - the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage` package) - the :ref:`swh.objstorage.replayer.replay` service (from the :ref:`swh-objstorage-replayer` package). A `docker-swarm `_ based deployment solution is provided as a working example of the mirror stack: https://forge.softwareheritage.org/source/swh-docker It is strongly recommended to start from there before planning a production-like deployment. See the `README `_ file of the `swh-docker `_ repository for details. -.. _kafka: https://kafka.apache.org/ +.. _Kafka: https://kafka.apache.org/ .. _msgpack: https://msgpack.org diff --git a/docs/tutorials/issue-debugging-monitoring.md b/docs/tutorials/issue-debugging-monitoring.md index aa4fad9..218197a 100644 --- a/docs/tutorials/issue-debugging-monitoring.md +++ b/docs/tutorials/issue-debugging-monitoring.md @@ -1,143 +1,143 @@ # Issue debugging and monitoring guide In order to debug issues happening in production, you need to get as much information as possible on the issue. It helps reproducing or directly fixing the issue. In addition, you want to monitor it to see how it evolves or if it is fixed for good. The tools used at SWH to get insights on issue happening in production are Sentry and Kibana. ## Sentry overview SWH instance URL: The service requires a login password pair to access, but does not require the SWH VPN access. To sign up, click "Request to join" and provide your SWH developer email address for the admins to create the account. Official documentation: Sentry is specifically geared towards debugging production issues. In the "Issues" pane, it presents issues grouped by similarity with statistics about their occurrence. Issues can be filtered by: - project (i.e. SWH service repository), e.g. "swh-loader-core" or "swh-vault"; - environment, e.g. "production" or "staging"; - time range. Viewing a particular issue, you can access: - the execution trace at the point of error, with pretty-printed local variables at each stack frame, as you would get in a post-mortem debugging session; - contextual metadata about the running environment, which includes: - the first and last occurrence as detected by Sentry, - corresponding component versions, - installed packages, - entrypoint parameters, - runtime environment such as the interpreter version, the hostname¸ or the logging configuration. - the breadcrumbs view, which shows several event log lines produced in the same run prior to the error. These are not the logs produced by the application, but events gathered through Sentry integrations. ## Debugging SWH services with Sentry Here we show a specific type of issue that is characteristic of microservice architectures as implemented at SWH. One difficulty may arise in finding where an issue originates, because the execution is split between multiple services. It results in a chain of linked issues, potentially one for each service involved. Errors of type `RemoteException` encapsulate an error occurring in the service called through a RPC mechanism. If the information encapsulated in this top-level error is not sufficient, one would search for complementary traces by filtering the "Issues" view by the linked service's project name. Example: Sentry issue: The error appear as `` A request from a vault cooker to the storage service had a network error. Thanks to Sentry we see also which was the specific storage requested: `` Upon searching in the storage service issues, we find a corresponding `HttpResponseError`: We skip through the error reporting logic in the trace to get to the operation that was performed. We see that this error comes in turn from a RPC call to the objstorage service: HttpResponseError: "Download stream interrupted." at `swh/storage/objstorage.py` in `content_get` at line 41 This is a transient network error: it should not persist when retrying. So a solution might be to add a retrying mechanism somewhere in this chain of RPC calls. ## Issue monitoring with Sentry Aggregated error traces as shown in the "Issues" pane are the primary source of information for monitoring. This includes the statistics of occurrence for a given period of time. Sentry also comes with issue management features, that notably let you silence or resolve errors. Silencing means the issue will still be recorded but not notified. Resolving means the issue will be hidden from the default view, and any new occurrence of it will specifically notify the issue owner that the issue still arises and is in fact not resolved. Make sure an owner is associated to the issue, typically through ownership rules set in the project settings. For more info on monitoring issues, refer to: ## Kibana overview SWH instance URL: Access to the SWH VPN is needed, but credentials are not. Related wiki page: Official documentation: -Kibana is a vizualization UI for searching through indexed logs. You can search through +Kibana is a visualization UI for searching through indexed logs. You can search through different sources of logs in the "Discover" pane. The sources configured include application logs for SWH services and system logs. You can also access dashboards shared by other on a particular topic or create our own from a saved search. There are 2 query languages which are quite similar: Lucene or KQL. Whatever one you choose, you will have the same querying capabilities. A query tries to match values for specific keys, and support many predicates and combination of them. See the documentation for KQL: https://www.elastic.co/guide/en/kibana/current/kuery-query.html To get logs for a particular service, you have to know the name of its systemd unit and the hostname of the production server providing this service. For a worker, switch the index pattern to "swh_workers-*", for another SWH service switch it to "systemlogs-*". Example for getting swh-vault production logs: With the index pattern set to "systemlogs-*", enter the KQL query: `systemd_unit:"gunicorn-swh-vault.service" AND hostname:"vangogh"` Upon expanding a log entry with the leading arrow icon, you can inspect the entry in a structured way. You can filter on particular values or fields, using the icons that are left to the desired field. Fields including "message", "hostname" or "systemd_unit" are often the most informational. You can also view the entry in context, several entries before and after chronologically. ## Issue monitoring with Kibana You can use Kibana saved searches and dashboards to follow issues based on associated logs. Of course, we need to have logs produced that are related to the issue we want to track. You can save a search, as opposed to only a query, to easily get back to it or include it in a dashboard. Just click "Save" in the top toolbar above the search bar. It includes the query, filters, selected columns, sorting and index pattern. Now you may want to have a customizable view of these logs, along with graphical presentations. In the "Dashboard" pane, create a new dashboard. Click "add" in the top -toolbar and select your saved search. It will appear in resizeable panel. Now doing a -search will restrict the search to the dataset cinfigured for the panels. +toolbar and select your saved search. It will appear in resizable panel. Now doing a +search will restrict the search to the dataset configured for the panels. -To create more complete vizualizations including graphs, refer to: +To create more complete visualizations including graphs, refer to: diff --git a/docs/tutorials/testing.rst b/docs/tutorials/testing.rst index 4673482..fd368ce 100644 --- a/docs/tutorials/testing.rst +++ b/docs/tutorials/testing.rst @@ -1,123 +1,123 @@ .. _testing-guide: Software testing guide ====================== Tools landscape --------------- The testing framework we use is pytest_. It provides many facilities to write tests efficiently. It is complemented by hypothesis_, a library for property-based testing in some of our test suites. Its usage is a more advanced topic. We also use tox_, the automation framework, to run the tests along with other quality checks in isolated environments. The main quality checking tools in use are: * mypy_, a static type checker. We gradually type-annotate all additions or refactorings to the codebase; * flake8_, a simple code style checker (aka linter); * black_, an uncompromising code formatter. They are run automatically through ``tox`` or as ``pre-commit`` hooks in our Git repositories. The SWH testing framework ------------------------- This sections shows specifics about our usage of pytest and custom helpers. The pytest fixture system makes easy to write, share and plug setup and teardown code. Fixtures are automatically loaded from the project ``conftest`` or ``pytest_plugin`` modules into any test function by giving its name as argument. | Several pytest plugins have been defined across SWH projects: | ``core``, ``core.db``, ``storage``, ``scheduler``, ``loader``, ``journal``. | Many others, provided by the community are in use: | ``flask``, ``django``, ``aiohttp``, ``postgresql``, ``mock``, ``requests-mock``, ``cov``, etc. We make of various mocking helpers: * ``unittest.mock``: ``Mock`` classes, ``patch`` function; * ``mocker`` fixture from the ``mock`` plugin: adaptation of ``unittest.mock`` to the fixture system, with a bonus ``spy`` function to audit without modifying objects; * ``monkeypatching`` builtin fixture: modify object attributes or environment, with automatic teardown. Other notable helpers include: * ``datadir``: to compute the path to the current test's ``data`` directory. Available in the ``core`` plugin. * ``requests_mock_datadir``: to load network responses from the datadir. Available in the ``core`` plugin. * ``swh_rpc_client``: for testing SWH RPC client and servers without incurring IO. Available in the ``core`` plugin. * ``postgresql_fact``: for testing database-backends interactions. Available in the ``core.db`` plugin, adapted for performance from the ``postgresql`` plugin. * ``click.testing.CliRunner``: to simplify testing of Click command-line interfaces. It allows to test commands with some level of isolation from the execution environment. https://click.palletsprojects.com/en/7.x/api/#click.testing.CliRunner Testing guidelines ------------------ General considerations ^^^^^^^^^^^^^^^^^^^^^^ -We mostly do functional tests, and unit-testing when more ganularity is needed. By this, +We mostly do functional tests, and unit-testing when more granularity is needed. By this, we mean that we test each functionality and invariants of a component, without isolating it from its dependencies systematically. The goal is to strike a balance between test effectiveness and test maintenance. However, the most critical parts, like the storage service, get more extensive unit-testing. Organize tests ^^^^^^^^^^^^^^ * In order to test a component (module, class), one must start by identifying its sets of functionalities and invariants (or properties). * One test may check multiples properties or commonly combined functionalities, if it can fit in a short descriptive name. * Organize tests in multiple modules, one for each aspect or subcomponent tested. - e.g.: initialization/configuration, db/backend, service API, utils, cli, etc. + e.g.: initialization/configuration, db/backend, service API, utils, CLI, etc. Test data ^^^^^^^^^ Each repository has its own ``tests`` directory, some such as listers even have one for each lister type. * Put any non-trivial test data, used for setup or mocking, in (potentially compressed) files in a ``data`` directory under the local testing directory. * Use ``datadir`` fixtures to load them. Faking dependencies ^^^^^^^^^^^^^^^^^^^ * Make use of temporary directories for testing code relying on filesystem paths. * Mock only already tested and expensive operations, typically IO with external services. * Use ``monkeypatch`` fixture when updating environment or when mocking is overkill. * Mock HTTP requests with ``requests_mock`` or ``requests_mock_datadir``. Final words ^^^^^^^^^^^ If testing is difficult, the tested design may need reconsideration. Other SWH resources on software quality --------------------------------------- | https://wiki.softwareheritage.org/wiki/Python_style_guide | https://wiki.softwareheritage.org/wiki/Git_style_guide | https://wiki.softwareheritage.org/wiki/Arcanist_setup | https://wiki.softwareheritage.org/wiki/Code_review | https://wiki.softwareheritage.org/wiki/Jenkins | https://wiki.softwareheritage.org/wiki/Testing_the_archive_features .. _pytest: https://pytest.org .. _tox: https://tox.readthedocs.io .. _hypothesis: https://hypothesis.readthedocs.io .. _mypy: https://mypy.readthedocs.io .. _flake8: https://flake8.pycqa.org .. _black: https://black.readthedocs.io