Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9312050
D668.id2125.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
14 KB
Subscribers
None
D668.id2125.diff
View Options
diff --git a/docs/architecture.rst b/docs/architecture.rst
new file mode 100644
--- /dev/null
+++ b/docs/architecture.rst
@@ -0,0 +1,74 @@
+.. _architecture:
+
+Software Architecture
+=====================
+
+From an end-user point of view, the |swh| platform consists in the
+:term:`archive`, which can be accessed using the web interface or its REST API.
+Behind the scene (and the web app) are several components that expose
+different aspects of the |swh| :term:`archive` as internal REST APIs.
+
+Each of these internal APIs have a dedicated (Postgresql) database.
+
+A global view of this architecture looks like:
+
+.. figure:: images/general-architecture.*
+
+ General view of the |swh| architecture.
+
+The front API components are:
+
+- :ref:`Storage API <swh-storage>`
+- :ref:`Deposit API <swh-deposit>`
+- :ref:`Vault API <swh-vault>`
+- :ref:`Indexer API <swh-indexer>`
+- :ref:`Scheduler API <swh-scheduler>`
+
+On the back stage of this show, a celery_ based game of tasks and workers
+occurs to perform all the required work to fill, maintain and update the |swh|
+:term:`archive`.
+
+The main components involved in this choreography are:
+
+- :term:`Listers <lister>`: a lister is a type of task aiming at scrapping a
+ web site, a forge, etc. to gather all the source code repositories it can
+ find. For each found source code repository, a :term:`loader` task is
+ created.
+
+- :term:`Loaders <loader>`: a loader is a type of task aiming at importing or
+ updating a source code repository. It is the one that inserts :term:`blob`
+ objects in the :term:`object storage`, and inserts nodes and edges in the
+ :ref:`graph <swh-merkle-dag>`.
+
+- :term:`Indexers <indexer>`: an indexer is a type of task aiming at crawling
+ the content of the :term:`archive` to extract derived information (mimetype,
+ etc.)
+
+
+Tasks
+-----
+
+The following sequence diagram shows the interactions between these components
+when a new forge needs to be archived. This example depicts the case of a
+gitlab_ forge, but any other supported source type would be very similar.
+
+.. mermaid:: tasks-lister.mmd
+
+As one might observe in this diagram, it does create two things:
+
+- it adds one :term:`origin` objects in the :term:`storage` database for each
+ source code repository, and
+
+- it insert one :term:`loader` task for each source code repository that will
+ be in charge of importing the content of that repository.
+
+
+The sequence diagram below describe this second step of importing the content
+of a repository. Once again, we take the example of a git repository, but any
+other type of repository would be very similar.
+
+.. mermaid:: tasks-git-loader.mmd
+
+
+.. _celery: https://www.celeryproject.org
+.. _gitlab: https://gitlab.com
diff --git a/docs/getting-started.rst b/docs/getting-started.rst
--- a/docs/getting-started.rst
+++ b/docs/getting-started.rst
@@ -119,7 +119,7 @@
Then you will need a local storage service that will archive and serve source
code artifacts via a REST API. The Software Heritage storage layer comes in two
-parts: a content-addressable object storage on your file system (for file
+parts: a content-addressable :term:`object storage` on your file system (for file
contents) and a Postgres database (for the graph structure of the archive). See
the :ref:`data-model` for more information. The storage layer is configured via
a YAML configuration file, located at
@@ -137,13 +137,13 @@
root: /srv/softwareheritage/objects/
slicing: 0:2/2:4
-Make sure that the object storage root exists on the filesystem and is writable
+Make sure that the :term:`object storage` root exists on the filesystem and is writable
to your user, e.g.::
sudo mkdir -p /srv/softwareheritage/objects
sudo chown "${USER}:" /srv/softwareheritage/objects
-You are done with object storage setup! Let's setup the database::
+You are done with :term:`object storage` setup! Let's setup the database::
swh-db-init storage -d softwareheritage-dev
diff --git a/docs/glossary.rst b/docs/glossary.rst
new file mode 100644
--- /dev/null
+++ b/docs/glossary.rst
@@ -0,0 +1,158 @@
+:orphan:
+
+.. _glossary:
+
+Glossary
+========
+
+.. glossary::
+
+ archive
+
+ An instance of the |swh| data store.
+
+ archiver
+
+ A component dedicated at replicating an :term:`archive`.
+
+ ark
+
+ `Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is
+ a multi-purpose persistent identifier for information objects of any type.
+
+ artifact
+ software artifact
+
+ An artifact is one of many kinds of tangible by-products produced during
+ the development of software.
+
+ content
+ blob
+
+ A (specific version of a) file stored in the archive, identified by its
+ cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also
+ known as: :term:`blob`. Note: it is incorrect to refer to Contents as
+ "files", because files are usually considered to be named, whereas
+ Contents are nameless. It is only in the context of specific
+ :term:`directories <directory>` that :term:`contents <content>` acquire
+ (local) names.
+
+ directory
+
+ A set of named pointers to contents (file entries), directories (directory
+ entries) and revisions (revision entries). All entries are associated to
+ the local name of the entry (i.e., a relative path without any path
+ separator) and permission metadata (e.g., ``chmod`` value or equivalent).
+
+ doi
+
+ A Digital Object Identifier or DOI_ is a persistent identifier or handle
+ used to uniquely identify objects, standardized by the International
+ Organization for Standardization (ISO).
+
+ journal
+
+ The journal_ is the persistent logger of the |swh| architecture in charge
+ of logging changes of the archive, with publish-subscribe_ support.
+
+ lister
+
+ A lister_ is a component of the |swh| architecture that is in charge of
+ enumerating the :term:`software origin` (e.g., VCS, packages, etc.)
+ available at a source code distribution place.
+
+ loader
+
+ A loader_ is a component of the |swh| architecture responsible for
+
+ hash
+ cryptographic hash
+ checksum
+ digest
+
+ A fixed-size "summary" of a stream of bytes that is easy to compute, and
+ hard to reverse. (Cryptographic hash function Wikipedia article) also
+ known as: :term:`checksum`, :term:`digest`.
+
+ indexer
+
+ A component of the |swh| architecture dedicated to producing metadata
+ linked to the known :term:`blobs <blob>` in the :term:`archive`.
+
+ objstore
+ objstorage
+ object store
+ object storage
+
+ Content-addressable object storage. It is the place where actual object
+ :term:`blobs <blob>` objects are stored.
+
+ origin
+ software origin
+ data source
+
+ A location from which a coherent set of sources has been obtained, like a
+ git repository, a directory containing tarballs, etc.
+
+ person
+
+ An entity referenced by a revision as either the author or the committer
+ of the corresponding change. A person is associated to a full name and/or
+ an email address.
+
+ release
+ tag
+ milestone
+
+ a revision that has been marked as noteworthy with a specific name (e.g.,
+ a version number), together with associated development metadata (e.g.,
+ author, timestamp, etc).
+
+ revision
+ commit
+ changeset
+
+ A point in time snapshot of the content of a directory, together with
+ associated development metadata (e.g., author, timestamp, log message,
+ etc).
+
+ scheduler
+
+ The component of the |swh| architecture dedicated to the management and
+ the prioritization of the many tasks.
+
+ snapshot
+
+ the state of all visible branches during a specific visit of an origin
+
+ type of origin
+
+ Information about the kind of hosting, e.g., whether it is a forge, a
+ collection of repositories, an homepage publishing tarball, or a one shot
+ source code repository. For all kind of repositories please specify which
+ VCS system is in use (Git, SVN, CVS, etc.) object.
+
+ vault
+ vault service
+
+ User-facing service that allows to retrieve parts of the :term:`archive`
+ as self-contained bundles (e.g., individual releases, entire repository
+ snapshots, etc.)
+
+ visit
+
+ The passage of |swh| on a given :term:`origin`, to retrieve all source
+ code and metadata available there at the time. A visit object stores the
+ state of all visible branches (if any) available at the origin at visit
+ time; each of them points to a revision object in the archive. Future
+ visits of the same origin will create new visit objects, without removing
+ previous ones.
+
+
+
+.. _blob: https://en.wikipedia.org/wiki/Binary_large_object
+.. _DOI: https://www.doi.org
+.. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers
+.. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html
+.. _lister: https://docs.softwareheritage.org/devel/swh-lister/index.html
+.. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -15,6 +15,13 @@
stack
+Architecture
+------------
+
+* :ref:`architecture` ← go there to have a glimpse on the Software Heritage software
+ architecture
+
+
Components
----------
@@ -116,6 +123,7 @@
* :ref:`modindex`
* `URLs index <http-routingtable.html>`_
* :ref:`search`
+* :ref:`glossary`
.. ensure sphinx does not complain about index files not being included
@@ -124,5 +132,6 @@
:hidden:
:glob:
+ architecture
getting-started
swh-*/index
diff --git a/docs/swh_substitutions b/docs/swh_substitutions
new file mode 100644
--- /dev/null
+++ b/docs/swh_substitutions
@@ -0,0 +1 @@
+.. |swh| replace:: *Software Heritage*
diff --git a/docs/tasks-git-loader.mmd b/docs/tasks-git-loader.mmd
new file mode 100644
--- /dev/null
+++ b/docs/tasks-git-loader.mmd
@@ -0,0 +1,63 @@
+sequenceDiagram
+ participant SCH_DB as scheduler DB
+ participant SCH_RUN as scheduler runner
+ participant SCH_LS as scheduler listener
+ participant RMQ as Rabbit-MQ
+ participant OBJSTORE as object storage
+ participant STORAGE_DB as storage DB
+ participant STORAGE_API as storage API
+ participant WORK_GIT as worker@loader-git
+ participant GIT as git server
+
+ Note over SCH_DB,RMQ: Task T2 created beforehand by the lister-gitlab task
+ loop Polling
+ SCH_RUN->>SCH_DB: GET TASK set state=scheduled
+ SCH_DB-->>SCH_RUN: TASK id=T2
+ activate SCH_RUN
+ SCH_RUN->>RMQ: CREATE Celery Task CT2 loader-git
+ deactivate SCH_RUN
+ activate RMQ
+ end
+
+ RMQ->>+WORK_GIT: Start task CT2
+ deactivate RMQ
+
+ WORK_GIT->>+STORAGE_API: GET origin state
+ STORAGE_API-->>-WORK_GIT: 200
+
+ WORK_GIT->>+GIT: GET refs
+ GIT->>-WORK_GIT: 200 / refs
+
+ WORK_GIT->>+GIT: GET new_objects
+ GIT->>-WORK_GIT: 200 / objects
+
+ WORK_GIT->>+GIT: PACKFILE
+ GIT->>-WORK_GIT: 200 / blobs
+
+ WORK_GIT->>+STORAGE_API: LOAD NEW CONTENT
+ loop For each blob
+ STORAGE_API->>OBJSTORE: ADD BLOB
+ end
+ STORAGE_API-->>-WORK_GIT: 200 / blobs
+
+ WORK_GIT->>+STORAGE_API: NEW DIR
+ STORAGE_API->>STORAGE_DB: INSERT DIR
+ STORAGE_API-->>-WORK_GIT: 201
+
+ WORK_GIT->>+STORAGE_API: NEW REV
+ STORAGE_API->>STORAGE_DB: INSERT REV
+ STORAGE_API-->>-WORK_GIT: 201
+
+ WORK_GIT->>+STORAGE_API: NEW REL
+ STORAGE_API->>STORAGE_DB: INSERT REL
+ STORAGE_API-->>-WORK_GIT: 201
+
+ WORK_GIT->>+STORAGE_API: NEW SNAPSHOT
+ STORAGE_API->>STORAGE_DB: INSERT SNAPSHOT
+ STORAGE_API-->>-WORK_GIT: 201
+
+ WORK_GIT-->>-RMQ: SET CT2 status=eventful
+ activate RMQ
+ RMQ->>+SCH_LS: NOTIFY end of task CT2
+ deactivate RMQ
+ SCH_LS->>-SCH_DB: UPDATE T2 set state=end
diff --git a/docs/tasks-lister.mmd b/docs/tasks-lister.mmd
new file mode 100644
--- /dev/null
+++ b/docs/tasks-lister.mmd
@@ -0,0 +1,43 @@
+sequenceDiagram
+ participant WEB as swh-web
+ participant SCH_API as scheduler API
+ participant SCH_DB as scheduler DB
+ participant SCH_RUN as scheduler runner
+ participant RMQ as Rabbit-MQ
+ participant SCH_LS as scheduler listener
+ participant WORK_GITLAB as worker@gitlab-lister
+ participant GITLAB as gitlab API
+ participant STORAGE_API as storage API
+ participant STORAGE_DB as storage DB
+
+ Note over WEB,SCH_API: Save gitlab forge 0xdeadbeef
+ WEB->>+SCH_API: CREATE TASK lister-gitlab
+ SCH_API->>+SCH_DB: INSERT TASK
+ SCH_API-->>-WEB: 201
+ loop Polling
+ SCH_RUN->>SCH_DB: GET TASK set state=scheduled
+ SCH_DB-->>-SCH_RUN: TASK id=T1
+ activate SCH_RUN
+ SCH_RUN->>RMQ: CREATE Celery Task CT1
+ deactivate SCH_RUN
+ activate RMQ
+ end
+
+ RMQ->>+WORK_GITLAB: Start task CT1
+ deactivate RMQ
+ WORK_GITLAB->>+GITLAB: Get git repos
+ GITLAB-->>-WORK_GITLAB: Known git repos
+ loop For Each Repo
+ WORK_GITLAB->>+STORAGE_API: CREATE ORIGIN
+ WORK_GITLAB->>+SCH_API: CREATE TASK loader-git
+ SCH_API->>SCH_DB: INSERT TASK
+ SCH_API-->>-WORK_GITLAB: 201
+ STORAGE_API->>STORAGE_DB: INSERT ORIGIN
+ STORAGE_API-->>-WORK_GITLAB: 201
+ end
+
+ WORK_GITLAB-->>-RMQ: SET CT1 status=eventful
+ activate RMQ
+ RMQ->>+SCH_LS: NOTIFY end of task CT1
+ deactivate RMQ
+ SCH_LS->>-SCH_DB: UPDATE T1 set state=end
diff --git a/requirements.txt b/requirements.txt
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,4 +4,5 @@
vcversioner
sphinx >= 1.3
sphinxcontrib-httpdomain
+sphinxcontrib-mermaid
recommonmark
diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py
--- a/swh/docs/sphinx/conf.py
+++ b/swh/docs/sphinx/conf.py
@@ -19,7 +19,9 @@
'sphinx.ext.napoleon',
# 'sphinx.ext.intersphinx',
'sphinxcontrib.httpdomain',
- 'sphinx.ext.extlinks']
+ 'sphinx.ext.extlinks',
+ 'sphinxcontrib.mermaid',
+ ]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
@@ -37,6 +39,12 @@
# The master toctree document.
master_doc = 'index'
+# A string of reStructuredText that will be included at the beginning of every
+# source file that is read.
+rst_prolog = '''
+.. include:: /swh_substitutions
+'''
+
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Wed, Jul 2, 10:41 AM (2 w, 3 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3218991
Attached To
D668: Add the beginning of a top-level architecture document
Event Timeline
Log In to Comment