diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst index 5311ca9..215c7de 100644 --- a/docs/architecture/index.rst +++ b/docs/architecture/index.rst @@ -1,13 +1,14 @@ .. _architecture: Software Architecture ===================== .. toctree:: :maxdepth: 2 :titlesonly: overview mirror + metadata ../keycloak/index diff --git a/docs/architecture/metadata.rst b/docs/architecture/metadata.rst new file mode 100644 index 0000000..cd46e53 --- /dev/null +++ b/docs/architecture/metadata.rst @@ -0,0 +1,98 @@ +.. _architecture-metadata: + +Metadata workflow and architecture +================================== + +|swh| calls "metadata" information it collects and extracts that describes and provides additional information on source code. + +This metadata is partitioned into three types: + +1. development metadata, which is part of the :ref:`data-model`, such as authorship + and date of revisions and releases, +2. :term:`intrinsic metadata`, which is extracted from a source code repository itself, + usually mined from metadata files like :file:`package.json` or :file:`Gemfile`. + It is intrinsically part of the software origin, because both are distributed + together from the origin's VCS repository or release tarballs. +3. :term:`extrinsic metadata`, which is collected or deposited from external sources. + It can have a straightforward relationship with the repository (eg. number of stars + of GitHub origins or checksums of release tarballs), + or be more distant (provided by a third-party like Wikidata). + +This document is only about the latter two. + + +Raw metadata storage +-------------------- + +As an archive, |swh| chooses to store original metadata objects unmodified +in its long-term storage databases (:ref:`swh-storage ` and +:ref:`swh-objstorage `). + +For intrinsic metadata, this only means it is treated as any other source code content; +ie. there is no difference between a metadata file like :file:`package.json` +and a source code file like :file:`index.js` from the loaders' and the database's +points of view. + +Extrinsic metadata, however, are stored in a :ref:`dedicated storage service +` (in practice, this is currently in the same database +as the :ref:`data-model`'s Merkle DAG; but in separate tables). + +As they are both stored verbatim, they are in various formats depending on their source, +and are not directly usable. + + +Indexed metadata storages +------------------------- + +|swh| also stores metadata in indexed databases, which are directly usable +for searching and querying. +Currently, there are two: + +1. the "indexer storage", a postgresql database that acts as a cache, and provides + limited search functionality +2. :ref:`swh-search `, an advanced search service backed by + `ElasticSearch`_. + +Each of these databases has a consistent schema for ease of use. + + +Differences between raw and indexed metadata +-------------------------------------------- + +The raw metadata is the authentic piece of metadata while the indexed metadata +is a processed version, where the raw metadata is translated to a uniform vocabulary. + +Both intrinsic and extrinsic metadata can be indexed and translated. + +Therefore, most metadata stored twice in |swh|: raw and indexed. +The reason for this apparent duplication is robustness and future-proofing. + +Indeed, indexing metadata is a complex process. +By keeping the raw metadata we ensure the possibility to re-compute the metadata +in the future with other vocabularies. +Furthermore, if we did not store the raw metadata, this would mean bugs in indexers +could easily lose data, forever. +Thanks to this redundant architecture, bugs can be fixed and indexers re-ran +from the raw metadata to fix the indexed metadata. + +This also makes it easier to add features on metadata mining or change schema +in the future: instead of re-loading +from original sources (which may have disappeared since!), new indexers can simply +read stored metadata into new indexed storages. + + +Metadata mining +--------------- + +Some of the stored raw metadata is read and interpreted by worker processes known +as :ref:`indexers `. +Currently, they convert this metadata into a common format, `CodeMeta`_. + +Some indexers also read source code files to generate metadata about these files, +such as their license, language, etc. + +Then, they either send their results directly to a caller, or write it to an +indexed metadata storage (either directly or through :ref:`swh-journal `). + +.. _CodeMeta: https://codemeta.github.io/ +.. _ElasticSearch: https://www.elastic.co/elasticsearch/ diff --git a/docs/glossary.rst b/docs/glossary.rst index 66383cd..97a2bad 100644 --- a/docs/glossary.rst +++ b/docs/glossary.rst @@ -1,213 +1,213 @@ :orphan: .. _glossary: Glossary ======== .. glossary:: archive An instance of the |swh| data store. ark `Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is a multi-purpose persistent identifier for information objects of any type. artifact software artifact An artifact is one of many kinds of tangible by-products produced during the development of software. content blob A (specific version of a) file stored in the archive, identified by its cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also known as: :term:`blob`. Note: it is incorrect to refer to Contents as "files", because files are usually considered to be named, whereas Contents are nameless. It is only in the context of specific :term:`directories ` that :term:`contents ` acquire (local) names. deposit A :term:`software artifact` that was pushed to the Software Heritage archive (unlike :term:`loaders `, which pull artifacts). A deposit is useful when you want to ensure a software release's source code is archived in SWH even if it is not published anywhere else. See also: the :ref:`swh-deposit` component, which implements a deposit client and server. directory A set of named pointers to contents (file entries), directories (directory entries) and revisions (revision entries). All entries are associated to the local name of the entry (i.e., a relative path without any path separator) and permission metadata (e.g., ``chmod`` value or equivalent). doi A Digital Object Identifier or DOI_ is a persistent identifier or handle used to uniquely identify objects, standardized by the International Organization for Standardization (ISO). extid external identifier An identifier used by a system that does not fit the |swh| :ref:`data model `, such as Mercurial's ``nodeid``, or the hash of a tarball from a package manager. They may be stored in the |swh| archive independently of the identified object, to quickly match an external object (a changeset or tarball) to an object in the archive without downloading it. extrinsic metadata Metadata about software that is not shipped as part of the software source code, but is available instead via out-of-band means. For example, homepage, maintainer contact information, and popularity information ("stars") as listed on GitHub/GitLab repository pages. - See also: :term:`intrinsic metadata`. + See also: :term:`intrinsic metadata` :ref:`architecture-metadata`. journal The :ref:`journal ` is the persistent logger of the |swh| architecture in charge of logging changes of the archive, with publish-subscribe_ support. lister A :ref:`lister ` is a component of the |swh| architecture that is in charge of enumerating the :term:`software origin` (e.g., VCS, packages, etc.) available at a source code distribution place. loader A :ref:`loader ` is a component of the |swh| architecture responsible for reading a source code :term:`origin` (typically a git repository) and import or update its content in the :term:`archive` (ie. add new file contents int :term:`object storage` and repository structure in the :term:`storage database`). hash cryptographic hash checksum digest A fixed-size "summary" of a stream of bytes that is easy to compute, and hard to reverse. (Cryptographic hash function Wikipedia article) also known as: :term:`checksum`, :term:`digest`. indexer A component of the |swh| architecture dedicated to producing metadata linked to the known :term:`blobs ` in the :term:`archive`. intrinsic identifier A short character string that uniquely identifies an object, that can be generated deterministically, using only the content of the object, usually a :term:`cryptographic hash`. This excludes network interaction and central authority. Examples of intrinsic identifiers are: checksums (for files/strings only), git hashes, and :ref:`SWHIDs ` intrinsic metadata Metadata about software that is shipped as part of the source code of the software itself or as part of related artifacts (e.g., revisions, releases, etc). For example, metadata that is shipped in `PKG-INFO` files for Python packages, `pom.xml` for Maven-based Java projects, `debian/control` for Debian packages, `metadata.json` for NPM, etc. - See also: :term:`extrinsic metadata`. + See also: :term:`extrinsic metadata`, :ref:`architecture-metadata`. objstore objstorage object store object storage Content-addressable object storage. It is the place where actual object :term:`blobs ` objects are stored. origin software origin data source A location from which a coherent set of sources has been obtained, like a git repository, a directory containing tarballs, etc. person An entity referenced by a revision as either the author or the committer of the corresponding change. A person is associated to a full name and/or an email address. release tag milestone a revision that has been marked as noteworthy with a specific name (e.g., a version number), together with associated development metadata (e.g., author, timestamp, etc). revision commit changeset A point in time snapshot of the content of a directory, together with associated development metadata (e.g., author, timestamp, log message, etc). scheduler The component of the |swh| architecture dedicated to the management and the prioritization of the many tasks. snapshot the state of all visible branches during a specific visit of an origin storage storage database The main database of the |swh| platform in which the all the elements of the :ref:`data-model` but the :term:`content` are stored as a :ref:`Merkle DAG `. type of origin Information about the kind of hosting, e.g., whether it is a forge, a collection of repositories, an homepage publishing tarball, or a one shot source code repository. For all kind of repositories please specify which VCS system is in use (Git, SVN, CVS, etc.) object. vault vault service User-facing service that allows to retrieve parts of the :term:`archive` as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) visit The passage of |swh| on a given :term:`origin`, to retrieve all source code and metadata available there at the time. A visit object stores the state of all visible branches (if any) available at the origin at visit time; each of them points to a revision object in the archive. Future visits of the same origin will create new visit objects, without removing previous ones. .. _blob: https://en.wikipedia.org/wiki/Binary_large_object .. _DOI: https://www.doi.org .. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers .. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html .. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern diff --git a/docs/index.rst b/docs/index.rst index bfe1101..c223f2e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,239 +1,241 @@ .. _swh-docs: Software Heritage - Development Documentation ============================================= Getting started --------------- * :ref:`getting-started` → deploy a local copy of the Software Heritage software stack in less than 5 minutes, or * :ref:`developer-setup` → get a working development setup that allows to hack on the Software Heritage software stack Contributing ------------ * :ref:`patch-submission` → learn how to submit your patches to the Software Heritage codebase * :ref:`code-review` → rules and guidelines to review code in Software Heritage * :ref:`python-style-guide` → how to format the Python code you write Architecture ------------ * :ref:`architecture-overview` → get a glimpse of the Software Heritage software architecture * :ref:`mirror` → learn what a Software Heritage mirror is and how to set up one +* :ref:`Metadata workflow ` → learn how Software Heritage + stores and handles metadata * :ref:`Keycloak ` → learn how to use Keycloak, the authentication system used by |swh|'s web interface and public APIs Data Model and Specifications ----------------------------- * :ref:`persistent-identifiers` Specifications of the SoftWare Heritage persistent IDentifiers (SWHID). * :ref:`data-model` Documentation of the main |swh| archive data model. * :ref:`journal-specs` Documentation of the Kafka journal of the |swh| archive. Tutorials --------- * :ref:`testing-guide` * :doc:`/tutorials/issue-debugging-monitoring` * :ref:`Listing the content of your favorite forge ` and :ref:`running a lister in Docker ` * :ref:`Add a new swh package ` Frequently Asked Questions -------------------------- .. toctree:: :maxdepth: 2 faq/index Roadmap ------- * :ref:`roadmap-2021` Engineering ----------- * :ref:`infrastructure` Components ---------- Here is brief overview of the most relevant software components in the Software Heritage stack, in alphabetical order. For a better introduction to the architecture, see the :ref:`architecture-overview`, which presents each of them in a didactical order. Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.auth ` low-level library used by modules needing keycloak authentication :ref:`swh.core ` low-level utilities and helpers used by almost all other modules in the stack :ref:`swh.counters ` service providing efficient estimates of the number of objects in the SWH archive, using Redis's Hyperloglog :ref:`swh.dataset ` public datasets and periodic data dumps of the archive released by Software Heritage :ref:`swh.deposit ` push-based deposit of software artifacts to the archive swh.docs developer documentation (used to generate this doc you are reading) :ref:`swh.fuse ` Virtual file system to browse the Software Heritage archive, based on `FUSE `_ :ref:`swh.graph ` Fast, compressed, in-memory representation of the archive, with tooling to generate and query it. :ref:`swh.indexer ` tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it :ref:`swh.journal ` persistent logger of changes to the archive, with publish-subscribe support :ref:`swh.lister ` collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) :ref:`swh.loader-core ` low-level loading utilities and helpers used by all other loaders :ref:`swh.loader-git ` loader for `Git `_ repositories :ref:`swh.loader-mercurial ` loader for `Mercurial `_ repositories :ref:`swh.loader-svn ` loader for `Subversion `_ repositories :ref:`swh.loader-cvs ` loader for `CVS `_ repositories :ref:`swh.model ` implementation of the :ref:`data-model` to archive source code artifacts :ref:`swh.objstorage ` content-addressable object storage :ref:`swh.objstorage.replayer ` Object storage replication tool :ref:`swh.scanner ` source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage :ref:`swh.scheduler ` task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) :ref:`swh.search ` search engine for the archive :ref:`swh.storage ` abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata :ref:`swh.vault ` implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) :ref:`swh.web ` Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use :ref:`swh.web.client ` Python client for :ref:`swh.web ` Dependencies ------------ The dependency relationships among the various modules are depicted below. .. _py-deps-swh: .. figure:: images/py-deps-swh.svg :width: 1024px :align: center Dependencies among top-level Python modules (click to zoom). Archive ------- * :ref:`Archive ChangeLog `: notable changes to the archive over time Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * `URLs index `_ * :ref:`search` * :ref:`glossary` .. ensure sphinx does not complain about index files not being included .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: :hidden: getting-started/index architecture/index contributing/index tutorials/index faq/index roadmap/roadmap-2021.rst infrastructure/index swh.auth swh.core swh.counters swh.dataset swh.deposit swh.fuse swh.graph swh.indexer swh.journal swh.lister swh.loader swh.model swh.objstorage swh.objstorage.replayer swh.scanner swh.scheduler swh.search swh.storage swh.vault swh.web swh.web.client archive-changelog journal Python modules autodocumentation