diff --git a/docs/architecture.rst b/docs/architecture.rst new file mode 100644 --- /dev/null +++ b/docs/architecture.rst @@ -0,0 +1,74 @@ +.. _architecture: + +Software Architecture +===================== + +From an end-user point of view, the |swh| platform consists in the +:term:`archive`, which can be accessed using the web interface or its REST API. +Behind the scene (and the web app) are several components that expose +different aspects of the |swh| :term:`archive` as internal REST APIs. + +Each of these internal APIs have a dedicated (Postgresql) database. + +A global view of this architecture looks like: + +.. figure:: images/general-architecture.* + + General view of the |swh| architecture. + +The front API components are: + +- :ref:`Storage API ` +- :ref:`Deposit API ` +- :ref:`Vault API ` +- :ref:`Indexer API ` +- :ref:`Scheduler API ` + +On the back stage of this show, a celery_ based game of tasks and workers +occurs to perform all the required work to fill, maintain and update the |swh| +:term:`archive`. + +The main components involved in this choreography are: + +- :term:`Listers `: a lister is a type of task aiming at scrapping a + web site, a forge, etc. to gather all the source code repositories it can + find. For each found source code repository, a :term:`loader` task is + created. + +- :term:`Loaders `: a loader is a type of task aiming at importing or + updating a source code repository. It is the one that inserts :term:`blob` + objects in the :term:`object storage`, and inserts nodes and edges in the + :ref:`graph `. + +- :term:`Indexers `: an indexer is a type of task aiming at crawling + the content of the :term:`archive` to extract derived information (mimetype, + etc.) + + +Tasks +----- + +The following sequence diagram shows the interactions between these components +when a new forge needs to be archived. This example depicts the case of a +gitlab_ forge, but any other supported source type would be very similar. + +.. mermaid:: tasks-lister.mmd + +As one might observe in this diagram, it does create two things: + +- it adds one :term:`origin` objects in the :term:`storage` database for each + source code repository, and + +- it insert one :term:`loader` task for each source code repository that will + be in charge of importing the content of that repository. + + +The sequence diagram below describe this second step of importing the content +of a repository. Once again, we take the example of a git repository, but any +other type of repository would be very similar. + +.. mermaid:: tasks-git-loader.mmd + + +.. _celery: https://www.celeryproject.org +.. _gitlab: https://gitlab.com diff --git a/docs/getting-started.rst b/docs/getting-started.rst --- a/docs/getting-started.rst +++ b/docs/getting-started.rst @@ -119,7 +119,7 @@ Then you will need a local storage service that will archive and serve source code artifacts via a REST API. The Software Heritage storage layer comes in two -parts: a content-addressable object storage on your file system (for file +parts: a content-addressable :term:`object storage` on your file system (for file contents) and a Postgres database (for the graph structure of the archive). See the :ref:`data-model` for more information. The storage layer is configured via a YAML configuration file, located at @@ -137,13 +137,13 @@ root: /srv/softwareheritage/objects/ slicing: 0:2/2:4 -Make sure that the object storage root exists on the filesystem and is writable +Make sure that the :term:`object storage` root exists on the filesystem and is writable to your user, e.g.:: sudo mkdir -p /srv/softwareheritage/objects sudo chown "${USER}:" /srv/softwareheritage/objects -You are done with object storage setup! Let's setup the database:: +You are done with :term:`object storage` setup! Let's setup the database:: swh-db-init storage -d softwareheritage-dev diff --git a/docs/glossary.rst b/docs/glossary.rst new file mode 100644 --- /dev/null +++ b/docs/glossary.rst @@ -0,0 +1,158 @@ +:orphan: + +.. _glossary: + +Glossary +======== + +.. glossary:: + + archive + + An instance of the |swh| data store. + + archiver + + A component dedicated at replicating an :term:`archive`. + + ark + + `Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is + a multi-purpose persistent identifier for information objects of any type. + + artifact + software artifact + + An artifact is one of many kinds of tangible by-products produced during + the development of software. + + content + blob + + A (specific version of a) file stored in the archive, identified by its + cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also + known as: :term:`blob`. Note: it is incorrect to refer to Contents as + "files", because files are usually considered to be named, whereas + Contents are nameless. It is only in the context of specific + :term:`directories ` that :term:`contents ` acquire + (local) names. + + directory + + A set of named pointers to contents (file entries), directories (directory + entries) and revisions (revision entries). All entries are associated to + the local name of the entry (i.e., a relative path without any path + separator) and permission metadata (e.g., ``chmod`` value or equivalent). + + doi + + A Digital Object Identifier or DOI_ is a persistent identifier or handle + used to uniquely identify objects, standardized by the International + Organization for Standardization (ISO). + + journal + + The journal_ is the persistent logger of the |swh| architecture in charge + of logging changes of the archive, with publish-subscribe_ support. + + lister + + A lister_ is a component of the |swh| architecture that is in charge of + enumerating the :term:`software origin` (e.g., VCS, packages, etc.) + available at a source code distribution place. + + loader + + A loader_ is a component of the |swh| architecture responsible for + + hash + cryptographic hash + checksum + digest + + A fixed-size "summary" of a stream of bytes that is easy to compute, and + hard to reverse. (Cryptographic hash function Wikipedia article) also + known as: :term:`checksum`, :term:`digest`. + + indexer + + A component of the |swh| architecture dedicated to producing metadata + linked to the known :term:`blobs ` in the :term:`archive`. + + objstore + objstorage + object store + object storage + + Content-addressable object storage. It is the place where actual object + :term:`blobs ` objects are stored. + + origin + software origin + data source + + A location from which a coherent set of sources has been obtained, like a + git repository, a directory containing tarballs, etc. + + person + + An entity referenced by a revision as either the author or the committer + of the corresponding change. A person is associated to a full name and/or + an email address. + + release + tag + milestone + + a revision that has been marked as noteworthy with a specific name (e.g., + a version number), together with associated development metadata (e.g., + author, timestamp, etc). + + revision + commit + changeset + + A point in time snapshot of the content of a directory, together with + associated development metadata (e.g., author, timestamp, log message, + etc). + + scheduler + + The component of the |swh| architecture dedicated to the management and + the prioritization of the many tasks. + + snapshot + + the state of all visible branches during a specific visit of an origin + + type of origin + + Information about the kind of hosting, e.g., whether it is a forge, a + collection of repositories, an homepage publishing tarball, or a one shot + source code repository. For all kind of repositories please specify which + VCS system is in use (Git, SVN, CVS, etc.) object. + + vault + vault service + + User-facing service that allows to retrieve parts of the :term:`archive` + as self-contained bundles (e.g., individual releases, entire repository + snapshots, etc.) + + visit + + The passage of |swh| on a given :term:`origin`, to retrieve all source + code and metadata available there at the time. A visit object stores the + state of all visible branches (if any) available at the origin at visit + time; each of them points to a revision object in the archive. Future + visits of the same origin will create new visit objects, without removing + previous ones. + + + +.. _blob: https://en.wikipedia.org/wiki/Binary_large_object +.. _DOI: https://www.doi.org +.. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers +.. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html +.. _lister: https://docs.softwareheritage.org/devel/swh-lister/index.html +.. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -15,6 +15,13 @@ stack +Architecture +------------ + +* :ref:`architecture` ← go there to have a glimpse on the Software Heritage software + architecture + + Components ---------- @@ -116,6 +123,7 @@ * :ref:`modindex` * `URLs index `_ * :ref:`search` +* :ref:`glossary` .. ensure sphinx does not complain about index files not being included @@ -124,5 +132,6 @@ :hidden: :glob: + architecture getting-started swh-*/index diff --git a/docs/swh_substitutions b/docs/swh_substitutions new file mode 100644 --- /dev/null +++ b/docs/swh_substitutions @@ -0,0 +1 @@ +.. |swh| replace:: *Software Heritage* diff --git a/docs/tasks-git-loader.mmd b/docs/tasks-git-loader.mmd new file mode 100644 --- /dev/null +++ b/docs/tasks-git-loader.mmd @@ -0,0 +1,63 @@ +sequenceDiagram + participant SCH_DB as scheduler DB + participant SCH_RUN as scheduler runner + participant SCH_LS as scheduler listener + participant RMQ as Rabbit-MQ + participant OBJSTORE as object storage + participant STORAGE_DB as storage DB + participant STORAGE_API as storage API + participant WORK_GIT as worker@loader-git + participant GIT as git server + + Note over SCH_DB,RMQ: Task T2 created beforehand by the lister-gitlab task + loop Polling + SCH_RUN->>SCH_DB: GET TASK set state=scheduled + SCH_DB-->>SCH_RUN: TASK id=T2 + activate SCH_RUN + SCH_RUN->>RMQ: CREATE Celery Task CT2 loader-git + deactivate SCH_RUN + activate RMQ + end + + RMQ->>+WORK_GIT: Start task CT2 + deactivate RMQ + + WORK_GIT->>+STORAGE_API: GET origin state + STORAGE_API-->>-WORK_GIT: 200 + + WORK_GIT->>+GIT: GET refs + GIT->>-WORK_GIT: 200 / refs + + WORK_GIT->>+GIT: GET new_objects + GIT->>-WORK_GIT: 200 / objects + + WORK_GIT->>+GIT: PACKFILE + GIT->>-WORK_GIT: 200 / blobs + + WORK_GIT->>+STORAGE_API: LOAD NEW CONTENT + loop For each blob + STORAGE_API->>OBJSTORE: ADD BLOB + end + STORAGE_API-->>-WORK_GIT: 200 / blobs + + WORK_GIT->>+STORAGE_API: NEW DIR + STORAGE_API->>STORAGE_DB: INSERT DIR + STORAGE_API-->>-WORK_GIT: 201 + + WORK_GIT->>+STORAGE_API: NEW REV + STORAGE_API->>STORAGE_DB: INSERT REV + STORAGE_API-->>-WORK_GIT: 201 + + WORK_GIT->>+STORAGE_API: NEW REL + STORAGE_API->>STORAGE_DB: INSERT REL + STORAGE_API-->>-WORK_GIT: 201 + + WORK_GIT->>+STORAGE_API: NEW SNAPSHOT + STORAGE_API->>STORAGE_DB: INSERT SNAPSHOT + STORAGE_API-->>-WORK_GIT: 201 + + WORK_GIT-->>-RMQ: SET CT2 status=eventful + activate RMQ + RMQ->>+SCH_LS: NOTIFY end of task CT2 + deactivate RMQ + SCH_LS->>-SCH_DB: UPDATE T2 set state=end diff --git a/docs/tasks-lister.mmd b/docs/tasks-lister.mmd new file mode 100644 --- /dev/null +++ b/docs/tasks-lister.mmd @@ -0,0 +1,43 @@ +sequenceDiagram + participant WEB as swh-web + participant SCH_API as scheduler API + participant SCH_DB as scheduler DB + participant SCH_RUN as scheduler runner + participant RMQ as Rabbit-MQ + participant SCH_LS as scheduler listener + participant WORK_GITLAB as worker@gitlab-lister + participant GITLAB as gitlab API + participant STORAGE_API as storage API + participant STORAGE_DB as storage DB + + Note over WEB,SCH_API: Save gitlab forge 0xdeadbeef + WEB->>+SCH_API: CREATE TASK lister-gitlab + SCH_API->>+SCH_DB: INSERT TASK + SCH_API-->>-WEB: 201 + loop Polling + SCH_RUN->>SCH_DB: GET TASK set state=scheduled + SCH_DB-->>-SCH_RUN: TASK id=T1 + activate SCH_RUN + SCH_RUN->>RMQ: CREATE Celery Task CT1 + deactivate SCH_RUN + activate RMQ + end + + RMQ->>+WORK_GITLAB: Start task CT1 + deactivate RMQ + WORK_GITLAB->>+GITLAB: Get git repos + GITLAB-->>-WORK_GITLAB: Known git repos + loop For Each Repo + WORK_GITLAB->>+STORAGE_API: CREATE ORIGIN + WORK_GITLAB->>+SCH_API: CREATE TASK loader-git + SCH_API->>SCH_DB: INSERT TASK + SCH_API-->>-WORK_GITLAB: 201 + STORAGE_API->>STORAGE_DB: INSERT ORIGIN + STORAGE_API-->>-WORK_GITLAB: 201 + end + + WORK_GITLAB-->>-RMQ: SET CT1 status=eventful + activate RMQ + RMQ->>+SCH_LS: NOTIFY end of task CT1 + deactivate RMQ + SCH_LS->>-SCH_DB: UPDATE T1 set state=end diff --git a/requirements.txt b/requirements.txt --- a/requirements.txt +++ b/requirements.txt @@ -4,4 +4,5 @@ vcversioner sphinx >= 1.3 sphinxcontrib-httpdomain +sphinxcontrib-mermaid recommonmark diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py --- a/swh/docs/sphinx/conf.py +++ b/swh/docs/sphinx/conf.py @@ -19,7 +19,9 @@ 'sphinx.ext.napoleon', # 'sphinx.ext.intersphinx', 'sphinxcontrib.httpdomain', - 'sphinx.ext.extlinks'] + 'sphinx.ext.extlinks', + 'sphinxcontrib.mermaid', + ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] @@ -37,6 +39,12 @@ # The master toctree document. master_doc = 'index' +# A string of reStructuredText that will be included at the beginning of every +# source file that is read. +rst_prolog = ''' +.. include:: /swh_substitutions +''' + # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents.