D668.id2125.diff
No OneTemporary
Actions

Size

14 KB

Subscribers

None

D668.id2125.diff
View Options

	diff --git a/docs/architecture.rst b/docs/architecture.rst
	new file mode 100644
	--- /dev/null
	+++ b/docs/architecture.rst
	@@ -0,0 +1,74 @@
	+.. _architecture:
	+
	+Software Architecture
	+=====================
	+
	+From an end-user point of view, the \|swh\| platform consists in the
	+:term:`archive`, which can be accessed using the web interface or its REST API.
	+Behind the scene (and the web app) are several components that expose
	+different aspects of the \|swh\| :term:`archive` as internal REST APIs.
	+
	+Each of these internal APIs have a dedicated (Postgresql) database.
	+
	+A global view of this architecture looks like:
	+
	+.. figure:: images/general-architecture.*
	+
	+ General view of the \|swh\| architecture.
	+
	+The front API components are:
	+
	+- :ref:`Storage API <swh-storage>`
	+- :ref:`Deposit API <swh-deposit>`
	+- :ref:`Vault API <swh-vault>`
	+- :ref:`Indexer API <swh-indexer>`
	+- :ref:`Scheduler API <swh-scheduler>`
	+
	+On the back stage of this show, a celery_ based game of tasks and workers
	+occurs to perform all the required work to fill, maintain and update the \|swh\|
	+:term:`archive`.
	+
	+The main components involved in this choreography are:
	+
	+- :term:`Listers <lister>`: a lister is a type of task aiming at scrapping a
	+ web site, a forge, etc. to gather all the source code repositories it can
	+ find. For each found source code repository, a :term:`loader` task is
	+ created.
	+
	+- :term:`Loaders <loader>`: a loader is a type of task aiming at importing or
	+ updating a source code repository. It is the one that inserts :term:`blob`
	+ objects in the :term:`object storage`, and inserts nodes and edges in the
	+ :ref:`graph <swh-merkle-dag>`.
	+
	+- :term:`Indexers <indexer>`: an indexer is a type of task aiming at crawling
	+ the content of the :term:`archive` to extract derived information (mimetype,
	+ etc.)
	+
	+
	+Tasks
	+-----
	+
	+The following sequence diagram shows the interactions between these components
	+when a new forge needs to be archived. This example depicts the case of a
	+gitlab_ forge, but any other supported source type would be very similar.
	+
	+.. mermaid:: tasks-lister.mmd
	+
	+As one might observe in this diagram, it does create two things:
	+
	+- it adds one :term:`origin` objects in the :term:`storage` database for each
	+ source code repository, and
	+
	+- it insert one :term:`loader` task for each source code repository that will
	+ be in charge of importing the content of that repository.
	+
	+
	+The sequence diagram below describe this second step of importing the content
	+of a repository. Once again, we take the example of a git repository, but any
	+other type of repository would be very similar.
	+
	+.. mermaid:: tasks-git-loader.mmd
	+
	+
	+.. _celery: https://www.celeryproject.org
	+.. _gitlab: https://gitlab.com
	diff --git a/docs/getting-started.rst b/docs/getting-started.rst
	--- a/docs/getting-started.rst
	+++ b/docs/getting-started.rst
	@@ -119,7 +119,7 @@

	Then you will need a local storage service that will archive and serve source
	code artifacts via a REST API. The Software Heritage storage layer comes in two
	-parts: a content-addressable object storage on your file system (for file
	+parts: a content-addressable :term:`object storage` on your file system (for file
	contents) and a Postgres database (for the graph structure of the archive). See
	the :ref:`data-model` for more information. The storage layer is configured via
	a YAML configuration file, located at
	@@ -137,13 +137,13 @@
	root: /srv/softwareheritage/objects/
	slicing: 0:2/2:4

	-Make sure that the object storage root exists on the filesystem and is writable
	+Make sure that the :term:`object storage` root exists on the filesystem and is writable
	to your user, e.g.::

	sudo mkdir -p /srv/softwareheritage/objects
	sudo chown "${USER}:" /srv/softwareheritage/objects

	-You are done with object storage setup! Let's setup the database::
	+You are done with :term:`object storage` setup! Let's setup the database::

	swh-db-init storage -d softwareheritage-dev

	diff --git a/docs/glossary.rst b/docs/glossary.rst
	new file mode 100644
	--- /dev/null
	+++ b/docs/glossary.rst
	@@ -0,0 +1,158 @@
	+:orphan:
	+
	+.. _glossary:
	+
	+Glossary
	+========
	+
	+.. glossary::
	+
	+ archive
	+
	+ An instance of the \|swh\| data store.
	+
	+ archiver
	+
	+ A component dedicated at replicating an :term:`archive`.
	+
	+ ark
	+
	+ `Archival Resource Key`_ (ARK) is a Uniform Resource Locator (URL) that is
	+ a multi-purpose persistent identifier for information objects of any type.
	+
	+ artifact
	+ software artifact
	+
	+ An artifact is one of many kinds of tangible by-products produced during
	+ the development of software.
	+
	+ content
	+ blob
	+
	+ A (specific version of a) file stored in the archive, identified by its
	+ cryptographic hashes (SHA1, "git-like" SHA1, SHA256) and its size. Also
	+ known as: :term:`blob`. Note: it is incorrect to refer to Contents as
	+ "files", because files are usually considered to be named, whereas
	+ Contents are nameless. It is only in the context of specific
	+ :term:`directories <directory>` that :term:`contents <content>` acquire
	+ (local) names.
	+
	+ directory
	+
	+ A set of named pointers to contents (file entries), directories (directory
	+ entries) and revisions (revision entries). All entries are associated to
	+ the local name of the entry (i.e., a relative path without any path
	+ separator) and permission metadata (e.g., ``chmod`` value or equivalent).
	+
	+ doi
	+
	+ A Digital Object Identifier or DOI_ is a persistent identifier or handle
	+ used to uniquely identify objects, standardized by the International
	+ Organization for Standardization (ISO).
	+
	+ journal
	+
	+ The journal_ is the persistent logger of the \|swh\| architecture in charge
	+ of logging changes of the archive, with publish-subscribe_ support.
	+
	+ lister
	+
	+ A lister_ is a component of the \|swh\| architecture that is in charge of
	+ enumerating the :term:`software origin` (e.g., VCS, packages, etc.)
	+ available at a source code distribution place.
	+
	+ loader
	+
	+ A loader_ is a component of the \|swh\| architecture responsible for
	+
	+ hash
	+ cryptographic hash
	+ checksum
	+ digest
	+
	+ A fixed-size "summary" of a stream of bytes that is easy to compute, and
	+ hard to reverse. (Cryptographic hash function Wikipedia article) also
	+ known as: :term:`checksum`, :term:`digest`.
	+
	+ indexer
	+
	+ A component of the \|swh\| architecture dedicated to producing metadata
	+ linked to the known :term:`blobs <blob>` in the :term:`archive`.
	+
	+ objstore
	+ objstorage
	+ object store
	+ object storage
	+
	+ Content-addressable object storage. It is the place where actual object
	+ :term:`blobs <blob>` objects are stored.
	+
	+ origin
	+ software origin
	+ data source
	+
	+ A location from which a coherent set of sources has been obtained, like a
	+ git repository, a directory containing tarballs, etc.
	+
	+ person
	+
	+ An entity referenced by a revision as either the author or the committer
	+ of the corresponding change. A person is associated to a full name and/or
	+ an email address.
	+
	+ release
	+ tag
	+ milestone
	+
	+ a revision that has been marked as noteworthy with a specific name (e.g.,
	+ a version number), together with associated development metadata (e.g.,
	+ author, timestamp, etc).
	+
	+ revision
	+ commit
	+ changeset
	+
	+ A point in time snapshot of the content of a directory, together with
	+ associated development metadata (e.g., author, timestamp, log message,
	+ etc).
	+
	+ scheduler
	+
	+ The component of the \|swh\| architecture dedicated to the management and
	+ the prioritization of the many tasks.
	+
	+ snapshot
	+
	+ the state of all visible branches during a specific visit of an origin
	+
	+ type of origin
	+
	+ Information about the kind of hosting, e.g., whether it is a forge, a
	+ collection of repositories, an homepage publishing tarball, or a one shot
	+ source code repository. For all kind of repositories please specify which
	+ VCS system is in use (Git, SVN, CVS, etc.) object.
	+
	+ vault
	+ vault service
	+
	+ User-facing service that allows to retrieve parts of the :term:`archive`
	+ as self-contained bundles (e.g., individual releases, entire repository
	+ snapshots, etc.)
	+
	+ visit
	+
	+ The passage of \|swh\| on a given :term:`origin`, to retrieve all source
	+ code and metadata available there at the time. A visit object stores the
	+ state of all visible branches (if any) available at the origin at visit
	+ time; each of them points to a revision object in the archive. Future
	+ visits of the same origin will create new visit objects, without removing
	+ previous ones.
	+
	+
	+
	+.. _blob: https://en.wikipedia.org/wiki/Binary_large_object
	+.. _DOI: https://www.doi.org
	+.. _`persistent identifier`: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers
	+.. _`Archival Resource Key`: http://n2t.net/e/ark_ids.html
	+.. _lister: https://docs.softwareheritage.org/devel/swh-lister/index.html
	+.. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern
	diff --git a/docs/index.rst b/docs/index.rst
	--- a/docs/index.rst
	+++ b/docs/index.rst
	@@ -15,6 +15,13 @@
	stack


	+Architecture
	+------------
	+
	+* :ref:`architecture` ← go there to have a glimpse on the Software Heritage software
	+ architecture
	+
	+
	Components
	----------

	@@ -116,6 +123,7 @@
	* :ref:`modindex`
	* `URLs index <http-routingtable.html>`_
	* :ref:`search`
	+* :ref:`glossary`


	.. ensure sphinx does not complain about index files not being included
	@@ -124,5 +132,6 @@
	:hidden:
	:glob:

	+ architecture
	getting-started
	swh-*/index
	diff --git a/docs/swh_substitutions b/docs/swh_substitutions
	new file mode 100644
	--- /dev/null
	+++ b/docs/swh_substitutions
	@@ -0,0 +1 @@
	+.. \|swh\| replace:: Software Heritage
	diff --git a/docs/tasks-git-loader.mmd b/docs/tasks-git-loader.mmd
	new file mode 100644
	--- /dev/null
	+++ b/docs/tasks-git-loader.mmd
	@@ -0,0 +1,63 @@
	+sequenceDiagram
	+ participant SCH_DB as scheduler DB
	+ participant SCH_RUN as scheduler runner
	+ participant SCH_LS as scheduler listener
	+ participant RMQ as Rabbit-MQ
	+ participant OBJSTORE as object storage
	+ participant STORAGE_DB as storage DB
	+ participant STORAGE_API as storage API
	+ participant WORK_GIT as worker@loader-git
	+ participant GIT as git server
	+
	+ Note over SCH_DB,RMQ: Task T2 created beforehand by the lister-gitlab task
	+ loop Polling
	+ SCH_RUN->>SCH_DB: GET TASK set state=scheduled
	+ SCH_DB-->>SCH_RUN: TASK id=T2
	+ activate SCH_RUN
	+ SCH_RUN->>RMQ: CREATE Celery Task CT2 loader-git
	+ deactivate SCH_RUN
	+ activate RMQ
	+ end
	+
	+ RMQ->>+WORK_GIT: Start task CT2
	+ deactivate RMQ
	+
	+ WORK_GIT->>+STORAGE_API: GET origin state
	+ STORAGE_API-->>-WORK_GIT: 200
	+
	+ WORK_GIT->>+GIT: GET refs
	+ GIT->>-WORK_GIT: 200 / refs
	+
	+ WORK_GIT->>+GIT: GET new_objects
	+ GIT->>-WORK_GIT: 200 / objects
	+
	+ WORK_GIT->>+GIT: PACKFILE
	+ GIT->>-WORK_GIT: 200 / blobs
	+
	+ WORK_GIT->>+STORAGE_API: LOAD NEW CONTENT
	+ loop For each blob
	+ STORAGE_API->>OBJSTORE: ADD BLOB
	+ end
	+ STORAGE_API-->>-WORK_GIT: 200 / blobs
	+
	+ WORK_GIT->>+STORAGE_API: NEW DIR
	+ STORAGE_API->>STORAGE_DB: INSERT DIR
	+ STORAGE_API-->>-WORK_GIT: 201
	+
	+ WORK_GIT->>+STORAGE_API: NEW REV
	+ STORAGE_API->>STORAGE_DB: INSERT REV
	+ STORAGE_API-->>-WORK_GIT: 201
	+
	+ WORK_GIT->>+STORAGE_API: NEW REL
	+ STORAGE_API->>STORAGE_DB: INSERT REL
	+ STORAGE_API-->>-WORK_GIT: 201
	+
	+ WORK_GIT->>+STORAGE_API: NEW SNAPSHOT
	+ STORAGE_API->>STORAGE_DB: INSERT SNAPSHOT
	+ STORAGE_API-->>-WORK_GIT: 201
	+
	+ WORK_GIT-->>-RMQ: SET CT2 status=eventful
	+ activate RMQ
	+ RMQ->>+SCH_LS: NOTIFY end of task CT2
	+ deactivate RMQ
	+ SCH_LS->>-SCH_DB: UPDATE T2 set state=end
	diff --git a/docs/tasks-lister.mmd b/docs/tasks-lister.mmd
	new file mode 100644
	--- /dev/null
	+++ b/docs/tasks-lister.mmd
	@@ -0,0 +1,43 @@
	+sequenceDiagram
	+ participant WEB as swh-web
	+ participant SCH_API as scheduler API
	+ participant SCH_DB as scheduler DB
	+ participant SCH_RUN as scheduler runner
	+ participant RMQ as Rabbit-MQ
	+ participant SCH_LS as scheduler listener
	+ participant WORK_GITLAB as worker@gitlab-lister
	+ participant GITLAB as gitlab API
	+ participant STORAGE_API as storage API
	+ participant STORAGE_DB as storage DB
	+
	+ Note over WEB,SCH_API: Save gitlab forge 0xdeadbeef
	+ WEB->>+SCH_API: CREATE TASK lister-gitlab
	+ SCH_API->>+SCH_DB: INSERT TASK
	+ SCH_API-->>-WEB: 201
	+ loop Polling
	+ SCH_RUN->>SCH_DB: GET TASK set state=scheduled
	+ SCH_DB-->>-SCH_RUN: TASK id=T1
	+ activate SCH_RUN
	+ SCH_RUN->>RMQ: CREATE Celery Task CT1
	+ deactivate SCH_RUN
	+ activate RMQ
	+ end
	+
	+ RMQ->>+WORK_GITLAB: Start task CT1
	+ deactivate RMQ
	+ WORK_GITLAB->>+GITLAB: Get git repos
	+ GITLAB-->>-WORK_GITLAB: Known git repos
	+ loop For Each Repo
	+ WORK_GITLAB->>+STORAGE_API: CREATE ORIGIN
	+ WORK_GITLAB->>+SCH_API: CREATE TASK loader-git
	+ SCH_API->>SCH_DB: INSERT TASK
	+ SCH_API-->>-WORK_GITLAB: 201
	+ STORAGE_API->>STORAGE_DB: INSERT ORIGIN
	+ STORAGE_API-->>-WORK_GITLAB: 201
	+ end
	+
	+ WORK_GITLAB-->>-RMQ: SET CT1 status=eventful
	+ activate RMQ
	+ RMQ->>+SCH_LS: NOTIFY end of task CT1
	+ deactivate RMQ
	+ SCH_LS->>-SCH_DB: UPDATE T1 set state=end
	diff --git a/requirements.txt b/requirements.txt
	--- a/requirements.txt
	+++ b/requirements.txt
	@@ -4,4 +4,5 @@
	vcversioner
	sphinx >= 1.3
	sphinxcontrib-httpdomain
	+sphinxcontrib-mermaid
	recommonmark
	diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py
	--- a/swh/docs/sphinx/conf.py
	+++ b/swh/docs/sphinx/conf.py
	@@ -19,7 +19,9 @@
	'sphinx.ext.napoleon',
	# 'sphinx.ext.intersphinx',
	'sphinxcontrib.httpdomain',
	- 'sphinx.ext.extlinks']
	+ 'sphinx.ext.extlinks',
	+ 'sphinxcontrib.mermaid',
	+ ]

	# Add any paths that contain templates here, relative to this directory.
	templates_path = ['_templates']
	@@ -37,6 +39,12 @@
	# The master toctree document.
	master_doc = 'index'

	+# A string of reStructuredText that will be included at the beginning of every
	+# source file that is read.
	+rst_prolog = '''
	+.. include:: /swh_substitutions
	+'''
	+
	# The version info for the project you're documenting, acts as replacement for
	# \|version\| and \|release\|, also used in various other places throughout the
	# built documents.

File Metadata

Mime Type: text/plain
Expires: Wed, Jul 2, 10:41 AM (2 w, 3 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3218991

D668.id2125.diffNo OneTemporaryActions

D668.id2125.diffView Options

File Metadata

Event Timeline

D668.id2125.diff
No OneTemporary
Actions

D668.id2125.diff
View Options