diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst
index 5311ca9..215c7de 100644
--- a/docs/architecture/index.rst
+++ b/docs/architecture/index.rst
@@ -1,13 +1,14 @@
 .. _architecture:
 
 Software Architecture
 =====================
 
 
 .. toctree::
    :maxdepth: 2
    :titlesonly:
 
    overview
    mirror
+   metadata
    ../keycloak/index
diff --git a/docs/architecture/metadata.rst b/docs/architecture/metadata.rst
new file mode 100644
index 0000000..cd46e53
--- /dev/null
+++ b/docs/architecture/metadata.rst
@@ -0,0 +1,98 @@
+.. _architecture-metadata:
+
+Metadata workflow and architecture
+==================================
+
+|swh| calls "metadata" information it collects and extracts that describes and provides additional information on source code.
+
+This metadata is partitioned into three types:
+
+1. development metadata, which is part of the :ref:`data-model`, such as authorship
+   and date of revisions and releases,
+2. :term:`intrinsic metadata`, which is extracted from a source code repository itself,
+   usually mined from metadata files like :file:`package.json` or :file:`Gemfile`.
+   It is intrinsically part of the software origin, because both are distributed
+   together from the origin's VCS repository or release tarballs.
+3. :term:`extrinsic metadata`, which is collected or deposited from external sources.
+   It can have a straightforward relationship with the repository (eg. number of stars
+   of GitHub origins or checksums of release tarballs),
+   or be more distant (provided by a third-party like Wikidata).
+
+This document is only about the latter two.
+
+
+Raw metadata storage
+--------------------
+
+As an archive, |swh| chooses to store original metadata objects unmodified
+in its long-term storage databases (:ref:`swh-storage <swh-storage>` and
+:ref:`swh-objstorage <swh-objstorage>`).
+
+For intrinsic metadata, this only means it is treated as any other source code content;
+ie. there is no difference between a metadata file like :file:`package.json`
+and a source code file like :file:`index.js` from the loaders' and the database's
+points of view.
+
+Extrinsic metadata, however, are stored in a :ref:`dedicated storage service
+<extrinsic-metadata-specification>` (in practice, this is currently in the same database
+as the :ref:`data-model`'s Merkle DAG; but in separate tables).
+
+As they are both stored verbatim, they are in various formats depending on their source,
+and are not directly usable.
+
+
+Indexed metadata storages
+-------------------------
+
+|swh| also stores metadata in indexed databases, which are directly usable
+for searching and querying.
+Currently, there are two:
+
+1. the "indexer storage", a postgresql database that acts as a cache, and provides
+   limited search functionality
+2. :ref:`swh-search <swh-search>`, an advanced search service backed by
+   `ElasticSearch`_.
+
+Each of these databases has a consistent schema for ease of use.
+
+
+Differences between raw and indexed metadata
+--------------------------------------------
+
+The raw metadata is the authentic piece of metadata while the indexed metadata
+is a processed version, where the raw metadata is translated to a uniform vocabulary.
+
+Both intrinsic and extrinsic metadata can be indexed and translated.
+
+Therefore, most metadata stored twice in |swh|: raw and indexed.
+The reason for this apparent duplication is robustness and future-proofing.
+
+Indeed, indexing metadata is a complex process.
+By keeping the raw metadata we ensure the possibility to re-compute the metadata
+in the future with other vocabularies.
+Furthermore, if we did not store the raw metadata, this would mean bugs in indexers
+could easily lose data, forever.
+Thanks to this redundant architecture, bugs can be fixed and indexers re-ran
+from the raw metadata to fix the indexed metadata.
+
+This also makes it easier to add features on metadata mining or change schema
+in the future: instead of re-loading
+from original sources (which may have disappeared since!), new indexers can simply
+read stored metadata into new indexed storages.
+
+
+Metadata mining
+---------------
+
+Some of the stored raw metadata is read and interpreted by worker processes known
+as :ref:`indexers <swh-indexer>`.
+Currently, they convert this metadata into a common format, `CodeMeta`_.
+
+Some indexers also read source code files to generate metadata about these files,
+such as their license, language, etc.
+
+Then, they either send their results directly to a caller, or write it to an
+indexed metadata storage (either directly or through :ref:`swh-journal <swh-journal>`).
+
+.. _CodeMeta: https://codemeta.github.io/
+.. _ElasticSearch: https://www.elastic.co/elasticsearch/
diff --git a/docs/glossary.rst b/docs/glossary.rst
index 66383cd..97a2bad 100644
--- a/docs/glossary.rst
+++ b/docs/glossary.rst
@@ -1,213 +1,213 @@
 :orphan:
 
 .. _glossary:
 
 Glossary
 ========
 
 .. glossary::
 
    archive
 
      An instance of the |swh| data store.
 
    ark
 
      `Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is
      a multi-purpose persistent identifier for information objects of any type.
 
    artifact
    software artifact
 
      An artifact is one of many kinds of tangible by-products produced during
      the development of software.
 
    content
    blob
 
      A (specific version of a) file stored in the archive, identified by its
      cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also
      known as: :term:`blob`. Note: it is incorrect to refer to Contents as
      "files", because files are usually considered to be named, whereas
      Contents are nameless. It is only in the context of specific
      :term:`directories <directory>` that :term:`contents <content>` acquire
      (local) names.
 
    deposit
 
      A :term:`software artifact` that was pushed to the Software Heritage
      archive (unlike :term:`loaders <loader>`, which pull artifacts).
      A deposit is useful when you want to ensure a software release's source
      code is archived in SWH even if it is not published anywhere else.
 
      See also: the :ref:`swh-deposit` component, which implements a deposit
      client and server.
 
    directory
 
      A set of named pointers to contents (file entries), directories (directory
      entries) and revisions (revision entries). All entries are associated to
      the local name of the entry (i.e., a relative path without any path
      separator) and permission metadata (e.g., ``chmod`` value or equivalent).
 
    doi
 
      A Digital Object Identifier or DOI_ is a persistent identifier or handle
      used to uniquely identify objects, standardized by the International
      Organization for Standardization (ISO).
 
    extid
    external identifier
 
      An identifier used by a system that does not fit the |swh|
      :ref:`data model <data-model>`, such as Mercurial's ``nodeid``,
      or the hash of a tarball from a package manager.
      They may be stored in the |swh| archive independently of the identified object,
      to quickly match an external object (a changeset or tarball) to an object
      in the archive without downloading it.
 
    extrinsic metadata
 
      Metadata about software that is not shipped as part of the software source
      code, but is available instead via out-of-band means. For example,
      homepage, maintainer contact information, and popularity information
      ("stars") as listed on GitHub/GitLab repository pages.
 
-     See also: :term:`intrinsic metadata`.
+     See also: :term:`intrinsic metadata` :ref:`architecture-metadata`.
 
    journal
 
      The :ref:`journal <swh-journal>` is the persistent logger of the |swh| architecture in charge
      of logging changes of the archive, with publish-subscribe_ support.
 
    lister
 
      A :ref:`lister <swh-lister>` is a component of the |swh| architecture that is in charge of
      enumerating the :term:`software origin` (e.g., VCS, packages, etc.)
      available at a source code distribution place.
 
    loader
 
      A :ref:`loader <swh-loader-core>` is a component of the |swh| architecture
      responsible for reading a source code :term:`origin` (typically a git
      repository) and import or update its content in the :term:`archive` (ie.
      add new file contents int :term:`object storage` and repository structure
      in the :term:`storage database`).
 
    hash
    cryptographic hash
    checksum
    digest
 
      A fixed-size "summary" of a stream of bytes that is easy to compute, and
      hard to reverse. (Cryptographic hash function Wikipedia article) also
      known as: :term:`checksum`, :term:`digest`.
 
    indexer
 
      A component of the |swh| architecture dedicated to producing metadata
      linked to the known :term:`blobs <blob>` in the :term:`archive`.
 
    intrinsic identifier
 
      A short character string that uniquely identifies an object,
      that can be generated deterministically, using only the content of the object,
      usually a :term:`cryptographic hash`.
      This excludes network interaction and central authority.
 
      Examples of intrinsic identifiers are: checksums (for files/strings only),
      git hashes, and :ref:`SWHIDs <persistent-identifiers>`
 
    intrinsic metadata
 
      Metadata about software that is shipped as part of the source code of the
      software itself or as part of related artifacts (e.g., revisions,
      releases, etc). For example, metadata that is shipped in `PKG-INFO` files
      for Python packages, `pom.xml` for Maven-based Java projects,
      `debian/control` for Debian packages, `metadata.json` for NPM, etc.
 
-     See also: :term:`extrinsic metadata`.
+     See also: :term:`extrinsic metadata`, :ref:`architecture-metadata`.
 
    objstore
    objstorage
    object store
    object storage
 
      Content-addressable object storage. It is the place where actual object
      :term:`blobs <blob>` objects are stored.
 
    origin
    software origin
    data source
 
      A location from which a coherent set of sources has been obtained, like a
      git repository, a directory containing tarballs, etc.
 
    person
 
      An entity referenced by a revision as either the author or the committer
      of the corresponding change. A person is associated to a full name and/or
      an email address.
 
    release
    tag
    milestone
 
      a revision that has been marked as noteworthy with a specific name (e.g.,
      a version number), together with associated development metadata (e.g.,
      author, timestamp, etc).
 
    revision
    commit
    changeset
 
      A point in time snapshot of the content of a directory, together with
      associated development metadata (e.g., author, timestamp, log message,
      etc).
 
    scheduler
 
      The component of the |swh| architecture dedicated to the management and
      the prioritization of the many tasks.
 
    snapshot
 
      the state of all visible branches during a specific visit of an origin
 
    storage
    storage database
 
      The main database of the |swh| platform in which the all the elements of
      the :ref:`data-model` but the :term:`content` are stored as a :ref:`Merkle
      DAG <swh-merkle-dag>`.
 
    type of origin
 
      Information about the kind of hosting, e.g., whether it is a forge, a
      collection of repositories, an homepage publishing tarball, or a one shot
      source code repository. For all kind of repositories please specify which
      VCS system is in use (Git, SVN, CVS, etc.) object.
 
    vault
    vault service
 
      User-facing service that allows to retrieve parts of the :term:`archive`
      as self-contained bundles (e.g., individual releases, entire repository
      snapshots, etc.)
 
    visit
 
      The passage of |swh| on a given :term:`origin`, to retrieve all source
      code and metadata available there at the time. A visit object stores the
      state of all visible branches (if any) available at the origin at visit
      time; each of them points to a revision object in the archive. Future
      visits of the same origin will create new visit objects, without removing
      previous ones.
 
 
 
 .. _blob: https://en.wikipedia.org/wiki/Binary_large_object
 .. _DOI: https://www.doi.org
 .. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers
 .. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html
 .. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern
diff --git a/docs/index.rst b/docs/index.rst
index bfe1101..c223f2e 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,239 +1,241 @@
 .. _swh-docs:
 
 Software Heritage - Development Documentation
 =============================================
 
 Getting started
 ---------------
 
 * :ref:`getting-started` → deploy a local copy of the Software Heritage
   software stack in less than 5 minutes, or
 * :ref:`developer-setup` → get a working development setup that allows to hack
   on the Software Heritage software stack
 
 Contributing
 ------------
 
 * :ref:`patch-submission` → learn how to submit your patches to the
   Software Heritage codebase
 * :ref:`code-review` → rules and guidelines to review code in
   Software Heritage
 * :ref:`python-style-guide` → how to format the Python code you write
 
 Architecture
 ------------
 
 * :ref:`architecture-overview` → get a glimpse of the Software Heritage software
   architecture
 * :ref:`mirror` → learn what a Software Heritage mirror is and how to set up
   one
+* :ref:`Metadata workflow <architecture-metadata>` → learn how Software Heritage
+  stores and handles metadata
 * :ref:`Keycloak <keycloak>` → learn how to use Keycloak,
   the authentication system used by |swh|'s web interface and public APIs
 
 Data Model and Specifications
 -----------------------------
 
 * :ref:`persistent-identifiers` Specifications of the SoftWare Heritage persistent IDentifiers (SWHID).
 * :ref:`data-model` Documentation of the main |swh| archive data model.
 * :ref:`journal-specs` Documentation of the Kafka journal of the |swh| archive.
 
 Tutorials
 ---------
 
 * :ref:`testing-guide`
 * :doc:`/tutorials/issue-debugging-monitoring`
 * :ref:`Listing the content of your favorite forge <lister-tutorial>`
   and :ref:`running a lister in Docker <run-lister-tutorial>`
 * :ref:`Add a new swh package <tutorial-new-package>`
 
 Frequently Asked Questions
 --------------------------
 
 .. toctree::
    :maxdepth: 2
 
    faq/index
 
 Roadmap
 -------
 
 * :ref:`roadmap-2021`
 
 Engineering
 -----------
 
 * :ref:`infrastructure`
 
 Components
 ----------
 
 Here is brief overview of the most relevant software components in the Software
 Heritage stack, in alphabetical order.
 For a better introduction to the architecture, see the :ref:`architecture-overview`,
 which presents each of them in a didactical order.
 
 Each component name is linked to the development documentation
 of the corresponding Python module.
 
 :ref:`swh.auth <swh-auth>`
     low-level library used by modules needing keycloak authentication
 
 :ref:`swh.core <swh-core>`
     low-level utilities and helpers used by almost all other modules in the
     stack
 
 :ref:`swh.counters <swh-counters>`
     service providing efficient estimates of the number of objects in the SWH archive,
     using Redis's Hyperloglog
 
 :ref:`swh.dataset <swh-dataset>`
     public datasets and periodic data dumps of the archive released by Software
     Heritage
 
 :ref:`swh.deposit <swh-deposit>`
     push-based deposit of software artifacts to the archive
 
 swh.docs
     developer documentation (used to generate this doc you are reading)
 
 :ref:`swh.fuse <swh-fuse>`
     Virtual file system to browse the Software Heritage archive, based on
     `FUSE <https://github.com/libfuse/libfuse>`_
 
 :ref:`swh.graph <swh-graph>`
     Fast, compressed, in-memory representation of the archive, with tooling to
     generate and query it.
 
 :ref:`swh.indexer <swh-indexer>`
     tools and workers used to crawl the content of the archive and extract
     derived information from any artifact stored in it
 
 :ref:`swh.journal <swh-journal>`
     persistent logger of changes to the archive, with publish-subscribe support
 
 :ref:`swh.lister <swh-lister>`
     collection of listers for all sorts of source code hosting and distribution
     places (forges, distributions, package managers, etc.)
 
 :ref:`swh.loader-core <swh-loader-core>`
     low-level loading utilities and helpers used by all other loaders
 
 :ref:`swh.loader-git <swh-loader-git>`
     loader for `Git <https://git-scm.com/>`_ repositories
 
 :ref:`swh.loader-mercurial <swh-loader-mercurial>`
     loader for `Mercurial <https://www.mercurial-scm.org/>`_ repositories
 
 :ref:`swh.loader-svn <swh-loader-svn>`
     loader for `Subversion <https://subversion.apache.org/>`_ repositories
 
 :ref:`swh.loader-cvs <swh-loader-cvs>`
     loader for `CVS <https://savannah.nongnu.org/projects/cvs>`_ repositories
 
 :ref:`swh.model <swh-model>`
     implementation of the :ref:`data-model` to archive source code artifacts
 
 :ref:`swh.objstorage <swh-objstorage>`
     content-addressable object storage
 
 :ref:`swh.objstorage.replayer <swh-objstorage-replayer>`
     Object storage replication tool
 
 :ref:`swh.scanner <swh-scanner>`
     source code scanner to analyze code bases and compare them with source code
     artifacts archived by Software Heritage
 
 :ref:`swh.scheduler <swh-scheduler>`
     task manager for asynchronous/delayed tasks, used for recurrent (e.g.,
     listing a forge, loading new stuff from a Git repository) and one-off
     activities (e.g., loading a specific version of a source package)
 
 :ref:`swh.search <swh-search>`
     search engine for the archive
 
 :ref:`swh.storage <swh-storage>`
     abstraction layer over the archive, allowing to access all stored source
     code artifacts as well as their metadata
 
 :ref:`swh.vault <swh-vault>`
     implementation of the vault service, allowing to retrieve parts of the
     archive as self-contained bundles (e.g., individual releases, entire
     repository snapshots, etc.)
 
 :ref:`swh.web <swh-web>`
     Web application(s) to browse the archive, for both interactive (HTML UI)
     and mechanized (REST API) use
 
 :ref:`swh.web.client <swh-web-client>`
     Python client for :ref:`swh.web <swh-web>`
 
 
 Dependencies
 ------------
 
 The dependency relationships among the various modules are depicted below.
 
 .. _py-deps-swh:
 .. figure:: images/py-deps-swh.svg
    :width: 1024px
    :align: center
 
    Dependencies among top-level Python modules (click to zoom).
 
 
 Archive
 -------
 
 * :ref:`Archive ChangeLog <archive-changelog>`: notable changes to the archive
   over time
 
 
 Indices and tables
 ==================
 
 * :ref:`genindex`
 * :ref:`modindex`
 * `URLs index <http-routingtable.html>`_
 * :ref:`search`
 * :ref:`glossary`
 
 
 .. ensure sphinx does not complain about index files not being included
 
 .. toctree::
    :maxdepth: 2
    :caption: Contents:
    :titlesonly:
    :hidden:
 
    getting-started/index
    architecture/index
    contributing/index
    tutorials/index
    faq/index
    roadmap/roadmap-2021.rst
    infrastructure/index
    swh.auth <swh-auth/index>
    swh.core <swh-core/index>
    swh.counters <swh-counters/index>
    swh.dataset <swh-dataset/index>
    swh.deposit <swh-deposit/index>
    swh.fuse <swh-fuse/index>
    swh.graph <swh-graph/index>
    swh.indexer <swh-indexer/index>
    swh.journal <swh-journal/index>
    swh.lister <swh-lister/index>
    swh.loader <swh-loader>
    swh.model <swh-model/index>
    swh.objstorage <swh-objstorage/index>
    swh.objstorage.replayer <swh-objstorage-replayer/index>
    swh.scanner <swh-scanner/index>
    swh.scheduler <swh-scheduler/index>
    swh.search <swh-search/index>
    swh.storage <swh-storage/index>
    swh.vault <swh-vault/index>
    swh.web <swh-web/index>
    swh.web.client <swh-web-client/index>
    archive-changelog
    journal
    Python modules autodocumentation <apidoc/modules>