Page MenuHomeSoftware Heritage

D668.id2125.diff
No OneTemporary

D668.id2125.diff

diff --git a/docs/architecture.rst b/docs/architecture.rst
new file mode 100644
--- /dev/null
+++ b/docs/architecture.rst
@@ -0,0 +1,74 @@
+.. _architecture:
+
+Software Architecture
+=====================
+
+From an end-user point of view, the |swh| platform consists in the
+:term:`archive`, which can be accessed using the web interface or its REST API.
+Behind the scene (and the web app) are several components that expose
+different aspects of the |swh| :term:`archive` as internal REST APIs.
+
+Each of these internal APIs have a dedicated (Postgresql) database.
+
+A global view of this architecture looks like:
+
+.. figure:: images/general-architecture.*
+
+ General view of the |swh| architecture.
+
+The front API components are:
+
+- :ref:`Storage API <swh-storage>`
+- :ref:`Deposit API <swh-deposit>`
+- :ref:`Vault API <swh-vault>`
+- :ref:`Indexer API <swh-indexer>`
+- :ref:`Scheduler API <swh-scheduler>`
+
+On the back stage of this show, a celery_ based game of tasks and workers
+occurs to perform all the required work to fill, maintain and update the |swh|
+:term:`archive`.
+
+The main components involved in this choreography are:
+
+- :term:`Listers <lister>`: a lister is a type of task aiming at scrapping a
+ web site, a forge, etc. to gather all the source code repositories it can
+ find. For each found source code repository, a :term:`loader` task is
+ created.
+
+- :term:`Loaders <loader>`: a loader is a type of task aiming at importing or
+ updating a source code repository. It is the one that inserts :term:`blob`
+ objects in the :term:`object storage`, and inserts nodes and edges in the
+ :ref:`graph <swh-merkle-dag>`.
+
+- :term:`Indexers <indexer>`: an indexer is a type of task aiming at crawling
+ the content of the :term:`archive` to extract derived information (mimetype,
+ etc.)
+
+
+Tasks
+-----
+
+The following sequence diagram shows the interactions between these components
+when a new forge needs to be archived. This example depicts the case of a
+gitlab_ forge, but any other supported source type would be very similar.
+
+.. mermaid:: tasks-lister.mmd
+
+As one might observe in this diagram, it does create two things:
+
+- it adds one :term:`origin` objects in the :term:`storage` database for each
+ source code repository, and
+
+- it insert one :term:`loader` task for each source code repository that will
+ be in charge of importing the content of that repository.
+
+
+The sequence diagram below describe this second step of importing the content
+of a repository. Once again, we take the example of a git repository, but any
+other type of repository would be very similar.
+
+.. mermaid:: tasks-git-loader.mmd
+
+
+.. _celery: https://www.celeryproject.org
+.. _gitlab: https://gitlab.com
diff --git a/docs/getting-started.rst b/docs/getting-started.rst
--- a/docs/getting-started.rst
+++ b/docs/getting-started.rst
@@ -119,7 +119,7 @@
Then you will need a local storage service that will archive and serve source
code artifacts via a REST API. The Software Heritage storage layer comes in two
-parts: a content-addressable object storage on your file system (for file
+parts: a content-addressable :term:`object storage` on your file system (for file
contents) and a Postgres database (for the graph structure of the archive). See
the :ref:`data-model` for more information. The storage layer is configured via
a YAML configuration file, located at
@@ -137,13 +137,13 @@
root: /srv/softwareheritage/objects/
slicing: 0:2/2:4
-Make sure that the object storage root exists on the filesystem and is writable
+Make sure that the :term:`object storage` root exists on the filesystem and is writable
to your user, e.g.::
sudo mkdir -p /srv/softwareheritage/objects
sudo chown "${USER}:" /srv/softwareheritage/objects
-You are done with object storage setup! Let's setup the database::
+You are done with :term:`object storage` setup! Let's setup the database::
swh-db-init storage -d softwareheritage-dev
diff --git a/docs/glossary.rst b/docs/glossary.rst
new file mode 100644
--- /dev/null
+++ b/docs/glossary.rst
@@ -0,0 +1,158 @@
+:orphan:
+
+.. _glossary:
+
+Glossary
+========
+
+.. glossary::
+
+ archive
+
+ An instance of the |swh| data store.
+
+ archiver
+
+ A component dedicated at replicating an :term:`archive`.
+
+ ark
+
+ `Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is
+ a multi-purpose persistent identifier for information objects of any type.
+
+ artifact
+ software artifact
+
+ An artifact is one of many kinds of tangible by-products produced during
+ the development of software.
+
+ content
+ blob
+
+ A (specific version of a) file stored in the archive, identified by its
+ cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also
+ known as: :term:`blob`. Note: it is incorrect to refer to Contents as
+ "files", because files are usually considered to be named, whereas
+ Contents are nameless. It is only in the context of specific
+ :term:`directories <directory>` that :term:`contents <content>` acquire
+ (local) names.
+
+ directory
+
+ A set of named pointers to contents (file entries), directories (directory
+ entries) and revisions (revision entries). All entries are associated to
+ the local name of the entry (i.e., a relative path without any path
+ separator) and permission metadata (e.g., ``chmod`` value or equivalent).
+
+ doi
+
+ A Digital Object Identifier or DOI_ is a persistent identifier or handle
+ used to uniquely identify objects, standardized by the International
+ Organization for Standardization (ISO).
+
+ journal
+
+ The journal_ is the persistent logger of the |swh| architecture in charge
+ of logging changes of the archive, with publish-subscribe_ support.
+
+ lister
+
+ A lister_ is a component of the |swh| architecture that is in charge of
+ enumerating the :term:`software origin` (e.g., VCS, packages, etc.)
+ available at a source code distribution place.
+
+ loader
+
+ A loader_ is a component of the |swh| architecture responsible for
+
+ hash
+ cryptographic hash
+ checksum
+ digest
+
+ A fixed-size "summary" of a stream of bytes that is easy to compute, and
+ hard to reverse. (Cryptographic hash function Wikipedia article) also
+ known as: :term:`checksum`, :term:`digest`.
+
+ indexer
+
+ A component of the |swh| architecture dedicated to producing metadata
+ linked to the known :term:`blobs <blob>` in the :term:`archive`.
+
+ objstore
+ objstorage
+ object store
+ object storage
+
+ Content-addressable object storage. It is the place where actual object
+ :term:`blobs <blob>` objects are stored.
+
+ origin
+ software origin
+ data source
+
+ A location from which a coherent set of sources has been obtained, like a
+ git repository, a directory containing tarballs, etc.
+
+ person
+
+ An entity referenced by a revision as either the author or the committer
+ of the corresponding change. A person is associated to a full name and/or
+ an email address.
+
+ release
+ tag
+ milestone
+
+ a revision that has been marked as noteworthy with a specific name (e.g.,
+ a version number), together with associated development metadata (e.g.,
+ author, timestamp, etc).
+
+ revision
+ commit
+ changeset
+
+ A point in time snapshot of the content of a directory, together with
+ associated development metadata (e.g., author, timestamp, log message,
+ etc).
+
+ scheduler
+
+ The component of the |swh| architecture dedicated to the management and
+ the prioritization of the many tasks.
+
+ snapshot
+
+ the state of all visible branches during a specific visit of an origin
+
+ type of origin
+
+ Information about the kind of hosting, e.g., whether it is a forge, a
+ collection of repositories, an homepage publishing tarball, or a one shot
+ source code repository. For all kind of repositories please specify which
+ VCS system is in use (Git, SVN, CVS, etc.) object.
+
+ vault
+ vault service
+
+ User-facing service that allows to retrieve parts of the :term:`archive`
+ as self-contained bundles (e.g., individual releases, entire repository
+ snapshots, etc.)
+
+ visit
+
+ The passage of |swh| on a given :term:`origin`, to retrieve all source
+ code and metadata available there at the time. A visit object stores the
+ state of all visible branches (if any) available at the origin at visit
+ time; each of them points to a revision object in the archive. Future
+ visits of the same origin will create new visit objects, without removing
+ previous ones.
+
+
+
+.. _blob: https://en.wikipedia.org/wiki/Binary_large_object
+.. _DOI: https://www.doi.org
+.. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers
+.. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html
+.. _lister: https://docs.softwareheritage.org/devel/swh-lister/index.html
+.. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -15,6 +15,13 @@
stack
+Architecture
+------------
+
+* :ref:`architecture` ← go there to have a glimpse on the Software Heritage software
+ architecture
+
+
Components
----------
@@ -116,6 +123,7 @@
* :ref:`modindex`
* `URLs index <http-routingtable.html>`_
* :ref:`search`
+* :ref:`glossary`
.. ensure sphinx does not complain about index files not being included
@@ -124,5 +132,6 @@
:hidden:
:glob:
+ architecture
getting-started
swh-*/index
diff --git a/docs/swh_substitutions b/docs/swh_substitutions
new file mode 100644
--- /dev/null
+++ b/docs/swh_substitutions
@@ -0,0 +1 @@
+.. |swh| replace:: *Software Heritage*
diff --git a/docs/tasks-git-loader.mmd b/docs/tasks-git-loader.mmd
new file mode 100644
--- /dev/null
+++ b/docs/tasks-git-loader.mmd
@@ -0,0 +1,63 @@
+sequenceDiagram
+ participant SCH_DB as scheduler DB
+ participant SCH_RUN as scheduler runner
+ participant SCH_LS as scheduler listener
+ participant RMQ as Rabbit-MQ
+ participant OBJSTORE as object storage
+ participant STORAGE_DB as storage DB
+ participant STORAGE_API as storage API
+ participant WORK_GIT as worker@loader-git
+ participant GIT as git server
+
+ Note over SCH_DB,RMQ: Task T2 created beforehand by the lister-gitlab task
+ loop Polling
+ SCH_RUN->>SCH_DB: GET TASK set state=scheduled
+ SCH_DB-->>SCH_RUN: TASK id=T2
+ activate SCH_RUN
+ SCH_RUN->>RMQ: CREATE Celery Task CT2 loader-git
+ deactivate SCH_RUN
+ activate RMQ
+ end
+
+ RMQ->>+WORK_GIT: Start task CT2
+ deactivate RMQ
+
+ WORK_GIT->>+STORAGE_API: GET origin state
+ STORAGE_API-->>-WORK_GIT: 200
+
+ WORK_GIT->>+GIT: GET refs
+ GIT->>-WORK_GIT: 200 / refs
+
+ WORK_GIT->>+GIT: GET new_objects
+ GIT->>-WORK_GIT: 200 / objects
+
+ WORK_GIT->>+GIT: PACKFILE
+ GIT->>-WORK_GIT: 200 / blobs
+
+ WORK_GIT->>+STORAGE_API: LOAD NEW CONTENT
+ loop For each blob
+ STORAGE_API->>OBJSTORE: ADD BLOB
+ end
+ STORAGE_API-->>-WORK_GIT: 200 / blobs
+
+ WORK_GIT->>+STORAGE_API: NEW DIR
+ STORAGE_API->>STORAGE_DB: INSERT DIR
+ STORAGE_API-->>-WORK_GIT: 201
+
+ WORK_GIT->>+STORAGE_API: NEW REV
+ STORAGE_API->>STORAGE_DB: INSERT REV
+ STORAGE_API-->>-WORK_GIT: 201
+
+ WORK_GIT->>+STORAGE_API: NEW REL
+ STORAGE_API->>STORAGE_DB: INSERT REL
+ STORAGE_API-->>-WORK_GIT: 201
+
+ WORK_GIT->>+STORAGE_API: NEW SNAPSHOT
+ STORAGE_API->>STORAGE_DB: INSERT SNAPSHOT
+ STORAGE_API-->>-WORK_GIT: 201
+
+ WORK_GIT-->>-RMQ: SET CT2 status=eventful
+ activate RMQ
+ RMQ->>+SCH_LS: NOTIFY end of task CT2
+ deactivate RMQ
+ SCH_LS->>-SCH_DB: UPDATE T2 set state=end
diff --git a/docs/tasks-lister.mmd b/docs/tasks-lister.mmd
new file mode 100644
--- /dev/null
+++ b/docs/tasks-lister.mmd
@@ -0,0 +1,43 @@
+sequenceDiagram
+ participant WEB as swh-web
+ participant SCH_API as scheduler API
+ participant SCH_DB as scheduler DB
+ participant SCH_RUN as scheduler runner
+ participant RMQ as Rabbit-MQ
+ participant SCH_LS as scheduler listener
+ participant WORK_GITLAB as worker@gitlab-lister
+ participant GITLAB as gitlab API
+ participant STORAGE_API as storage API
+ participant STORAGE_DB as storage DB
+
+ Note over WEB,SCH_API: Save gitlab forge 0xdeadbeef
+ WEB->>+SCH_API: CREATE TASK lister-gitlab
+ SCH_API->>+SCH_DB: INSERT TASK
+ SCH_API-->>-WEB: 201
+ loop Polling
+ SCH_RUN->>SCH_DB: GET TASK set state=scheduled
+ SCH_DB-->>-SCH_RUN: TASK id=T1
+ activate SCH_RUN
+ SCH_RUN->>RMQ: CREATE Celery Task CT1
+ deactivate SCH_RUN
+ activate RMQ
+ end
+
+ RMQ->>+WORK_GITLAB: Start task CT1
+ deactivate RMQ
+ WORK_GITLAB->>+GITLAB: Get git repos
+ GITLAB-->>-WORK_GITLAB: Known git repos
+ loop For Each Repo
+ WORK_GITLAB->>+STORAGE_API: CREATE ORIGIN
+ WORK_GITLAB->>+SCH_API: CREATE TASK loader-git
+ SCH_API->>SCH_DB: INSERT TASK
+ SCH_API-->>-WORK_GITLAB: 201
+ STORAGE_API->>STORAGE_DB: INSERT ORIGIN
+ STORAGE_API-->>-WORK_GITLAB: 201
+ end
+
+ WORK_GITLAB-->>-RMQ: SET CT1 status=eventful
+ activate RMQ
+ RMQ->>+SCH_LS: NOTIFY end of task CT1
+ deactivate RMQ
+ SCH_LS->>-SCH_DB: UPDATE T1 set state=end
diff --git a/requirements.txt b/requirements.txt
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,4 +4,5 @@
vcversioner
sphinx >= 1.3
sphinxcontrib-httpdomain
+sphinxcontrib-mermaid
recommonmark
diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py
--- a/swh/docs/sphinx/conf.py
+++ b/swh/docs/sphinx/conf.py
@@ -19,7 +19,9 @@
'sphinx.ext.napoleon',
# 'sphinx.ext.intersphinx',
'sphinxcontrib.httpdomain',
- 'sphinx.ext.extlinks']
+ 'sphinx.ext.extlinks',
+ 'sphinxcontrib.mermaid',
+ ]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
@@ -37,6 +39,12 @@
# The master toctree document.
master_doc = 'index'
+# A string of reStructuredText that will be included at the beginning of every
+# source file that is read.
+rst_prolog = '''
+.. include:: /swh_substitutions
+'''
+
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.

File Metadata

Mime Type
text/plain
Expires
Wed, Jul 2, 10:41 AM (2 w, 3 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3218991

Event Timeline