diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst
--- a/docs/architecture/index.rst
+++ b/docs/architecture/index.rst
@@ -10,4 +10,5 @@
 
    overview
    mirror
+   metadata
    ../keycloak/index
diff --git a/docs/architecture/metadata.rst b/docs/architecture/metadata.rst
new file mode 100644
--- /dev/null
+++ b/docs/architecture/metadata.rst
@@ -0,0 +1,98 @@
+.. _architecture-metadata:
+
+Metadata workflow and architecture
+==================================
+
+|swh| calls "metadata" information it collects and extracts that describes and provides additional information on source code.
+
+This metadata is partitioned into three types:
+
+1. development metadata, which is part of the :ref:`data-model`, such as authorship
+   and date of revisions and releases,
+2. :term:`intrinsic metadata`, which is extracted from a source code repository itself,
+   usually mined from metadata files like :file:`package.json` or :file:`Gemfile`.
+   It is intrinsically part of the software origin, because both are distributed
+   together from the origin's VCS repository or release tarballs.
+3. :term:`extrinsic metadata`, which is collected or deposited from external sources.
+   It can have a straightforward relationship with the repository (eg. number of stars
+   of GitHub origins or checksums of release tarballs),
+   or be more distant (provided by a third-party like Wikidata).
+
+This document is only about the latter two.
+
+
+Raw metadata storage
+--------------------
+
+As an archive, |swh| chooses to store original metadata objects unmodified
+in its long-term storage databases (:ref:`swh-storage <swh-storage>` and
+:ref:`swh-objstorage <swh-objstorage>`).
+
+For intrinsic metadata, this only means it is treated as any other source code content;
+ie. there is no difference between a metadata file like :file:`package.json`
+and a source code file like :file:`index.js` from the loaders' and the database's
+points of view.
+
+Extrinsic metadata, however, are stored in a :ref:`dedicated storage service
+<extrinsic-metadata-specification>` (in practice, this is currently in the same database
+as the :ref:`data-model`'s Merkle DAG; but in separate tables).
+
+As they are both stored verbatim, they are in various formats depending on their source,
+and are not directly usable.
+
+
+Indexed metadata storages
+-------------------------
+
+|swh| also stores metadata in indexed databases, which are directly usable
+for searching and querying.
+Currently, there are two:
+
+1. the "indexer storage", a postgresql database that acts as a cache, and provides
+   limited search functionality
+2. :ref:`swh-search <swh-search>`, an advanced search service backed by
+   `ElasticSearch`_.
+
+Each of these databases has a consistent schema for ease of use.
+
+
+Differences between raw and indexed metadata
+--------------------------------------------
+
+The raw metadata is the authentic piece of metadata while the indexed metadata
+is a processed version, where the raw metadata is translated to a uniform vocabulary.
+
+Both intrinsic and extrinsic metadata can be indexed and translated.
+
+Therefore, most metadata stored twice in |swh|: raw and indexed.
+The reason for this apparent duplication is robustness and future-proofing.
+
+Indeed, indexing metadata is a complex process.
+By keeping the raw metadata we ensure the possibility to re-compute the metadata
+in the future with other vocabularies.
+Furthermore, if we did not store the raw metadata, this would mean bugs in indexers
+could easily lose data, forever.
+Thanks to this redundant architecture, bugs can be fixed and indexers re-ran
+from the raw metadata to fix the indexed metadata.
+
+This also makes it easier to add features on metadata mining or change schema
+in the future: instead of re-loading
+from original sources (which may have disappeared since!), new indexers can simply
+read stored metadata into new indexed storages.
+
+
+Metadata mining
+---------------
+
+Some of the stored raw metadata is read and interpreted by worker processes known
+as :ref:`indexers <swh-indexer>`.
+Currently, they convert this metadata into a common format, `CodeMeta`_.
+
+Some indexers also read source code files to generate metadata about these files,
+such as their license, language, etc.
+
+Then, they either send their results directly to a caller, or write it to an
+indexed metadata storage (either directly or through :ref:`swh-journal <swh-journal>`).
+
+.. _CodeMeta: https://codemeta.github.io/
+.. _ElasticSearch: https://www.elastic.co/elasticsearch/
diff --git a/docs/glossary.rst b/docs/glossary.rst
--- a/docs/glossary.rst
+++ b/docs/glossary.rst
@@ -73,7 +73,7 @@
      homepage, maintainer contact information, and popularity information
      ("stars") as listed on GitHub/GitLab repository pages.
 
-     See also: :term:`intrinsic metadata`.
+     See also: :term:`intrinsic metadata` :ref:`architecture-metadata`.
 
    journal
 
@@ -126,7 +126,7 @@
      for Python packages, `pom.xml` for Maven-based Java projects,
      `debian/control` for Debian packages, `metadata.json` for NPM, etc.
 
-     See also: :term:`extrinsic metadata`.
+     See also: :term:`extrinsic metadata`, :ref:`architecture-metadata`.
 
    objstore
    objstorage
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -27,6 +27,8 @@
   architecture
 * :ref:`mirror` → learn what a Software Heritage mirror is and how to set up
   one
+* :ref:`Metadata workflow <architecture-metadata>` → learn how Software Heritage
+  stores and handles metadata
 * :ref:`Keycloak <keycloak>` → learn how to use Keycloak,
   the authentication system used by |swh|'s web interface and public APIs