diff --git a/README.md b/README.md index f3a8cd7..072ceed 100644 --- a/README.md +++ b/README.md @@ -1,75 +1,86 @@ swh-docs ======== This module contains (the logics for generating) the Software Heritage development documentation. Specifically, it contains some general information about Software Heritage internals (stuff that would not fit in any other specific software component of the Software Heritage stack) and bundle them together component-specific documentation coming from other modules of the stack. All documentation is written and typeset using [Sphinx][1]. General documentation is shipped as part of this module. Module-specific documentation is centralized here via symlinks to the `docs/` dirs of individual modules. Therefore to build the full documentation you need a working and complete [Software Heritage development environment][2]. [1]: http://www.sphinx-doc.org/ [2]: https://forge.softwareheritage.org/source/swh-environment/ How to build the doc -------------------- +Ensure you have the required tools to generate images ([graphviz][3]'s `dot` +and [plantuml][4]). On a Debian system: + + $ sudo apt install plantuml graphviz + +[3]: https://graphviz.org +[4]: http://plantuml.com + + +Then + $ cd docs $ make html Behind the scene, this will do two things: ### 1. Generate sphinx-apidoc rst documents for all modules $ cd swh-environment $ make docs-apidoc This will *not* build the documentation in each module (there is `make docs` for that), but will use `sphinx-apidoc` to generate documentation indexes for each (sub)modules in the various Software Heritage components. As `sphinx-apidoc` refuses to overwrite old documents, before proceeding you might need to clean up old cruft with: $ cd swh-environment $ make docs-clean ### 2. Build the documentation $ cd swh-docs/docs $ make The HTML documentation is now available starting from `_build/html/index.html`. Cleaning up ----------- $ cd docs $ make distclean The former (`make clean`) will only clean the local Sphinx build, without touching other modules. The latter (`make distclean`) will also clean Sphinx builds in all other modules. Publishing the doc ------------------ $ cd docs $ make install $ xdg-open https://docs.softwareheritage.org/devel/ For the above to work you need to have ssh access into the machine hosting (currently `pergamon`), and write access do the document root directory of that virtual host (currently granted to all members of the `swhdev` UNIX group on Software Heritage machines). diff --git a/docs/architecture.rst b/docs/architecture.rst new file mode 100644 index 0000000..81b4d86 --- /dev/null +++ b/docs/architecture.rst @@ -0,0 +1,74 @@ +.. _architecture: + +Software Architecture +===================== + +From an end-user point of view, the |swh| platform consists in the +:term:`archive`, which can be accessed using the web interface or its REST API. +Behind the scene (and the web app) are several components that expose +different aspects of the |swh| :term:`archive` as internal REST APIs. + +Each of these internal APIs have a dedicated (Postgresql) database. + +A global view of this architecture looks like: + +.. thumbnail:: images/general-architecture.svg + + General view of the |swh| architecture. + +The front API components are: + +- :ref:`Storage API ` +- :ref:`Deposit API ` +- :ref:`Vault API ` +- :ref:`Indexer API ` +- :ref:`Scheduler API ` + +On the back stage of this show, a celery_ based game of tasks and workers +occurs to perform all the required work to fill, maintain and update the |swh| +:term:`archive`. + +The main components involved in this choreography are: + +- :term:`Listers `: a lister is a type of task aiming at scrapping a + web site, a forge, etc. to gather all the source code repositories it can + find. For each found source code repository, a :term:`loader` task is + created. + +- :term:`Loaders `: a loader is a type of task aiming at importing or + updating a source code repository. It is the one that inserts :term:`blob` + objects in the :term:`object storage`, and inserts nodes and edges in the + :ref:`graph `. + +- :term:`Indexers `: an indexer is a type of task aiming at crawling + the content of the :term:`archive` to extract derived information (mimetype, + etc.) + + +Tasks +----- + +The following sequence diagram shows the interactions between these components +when a new forge needs to be archived. This example depicts the case of a +gitlab_ forge, but any other supported source type would be very similar. + +.. thumbnail:: images/tasks-lister.svg + +As one might observe in this diagram, it does create two things: + +- it adds one :term:`origin` objects in the :term:`storage` database for each + source code repository, and + +- it insert one :term:`loader` task for each source code repository that will + be in charge of importing the content of that repository. + + +The sequence diagram below describe this second step of importing the content +of a repository. Once again, we take the example of a git repository, but any +other type of repository would be very similar. + +.. thumbnail:: images/tasks-git-loader.svg + + +.. _celery: https://www.celeryproject.org +.. _gitlab: https://gitlab.com diff --git a/docs/images/Makefile b/docs/images/Makefile index 8a4e40c..abb60b3 100644 --- a/docs/images/Makefile +++ b/docs/images/Makefile @@ -1,27 +1,33 @@ PY_REQUIREMENTS = $(wildcard ../../../*/requirements*.txt) DEP_GRAPHS_base = py-deps-all py-deps-swh py-deps-ext DEP_GRAPHS += $(patsubst %,%.dot,$(DEP_GRAPHS_base)) DEP_GRAPHS += $(patsubst %,%.pdf,$(DEP_GRAPHS_base)) DEP_GRAPHS += $(patsubst %,%.svg,$(DEP_GRAPHS_base)) PY_DEPGRAPH = ../bin/py-depgraph -all: $(DEP_GRAPHS) +UML_DIAGS_SRC = $(wildcard *.uml) +UML_DIAGS = $(patsubst %.uml,%.svg,$(UML_DIAGS_SRC)) + +all: $(DEP_GRAPHS) $(UML_DIAGS) py-deps-all.dot: $(PY_DEPGRAPH) $(PY_REQUIREMENTS) cd ../../.. ; $(CURDIR)/$(PY_DEPGRAPH) > $(CURDIR)/$@ py-deps-swh.dot: $(PY_DEPGRAPH) $(PY_REQUIREMENTS) cd ../../.. ; $(CURDIR)/$(PY_DEPGRAPH) --no-external > $(CURDIR)/$@ py-deps-ext.dot: $(PY_DEPGRAPH) $(PY_REQUIREMENTS) cd ../../.. ; $(CURDIR)/$(PY_DEPGRAPH) --no-internal > $(CURDIR)/$@ %.pdf: %.dot dot -T pdf $< > $@ %.svg: %.dot dot -T svg $< > $@ +%.svg: %.uml + plantuml -tsvg $< + clean: - -rm -f $(DEP_GRAPHS) + -rm -f $(DEP_GRAPHS) $(UML_DIAGS) diff --git a/docs/images/general-architecture.svg b/docs/images/general-architecture.svg new file mode 100644 index 0000000..635043f --- /dev/null +++ b/docs/images/general-architecture.svg @@ -0,0 +1,3374 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Web App + + + + + + + + + + + + + + + + + + + Scheduler + API + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Deposit + API + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Vault + API + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Indexer + API + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Storage + API + + + + + + + + + + + ObjStorage + API + + + + Object Storage + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Journal + + + + + + + + + + + + + Lister + + + + + + + + + + + + + + + + + + + + + + + + + + + Celery Broker + + + Scheduler + + + + + + + + + + listener + workers + + + + Scheduler + + + + + + + + + + runner + + + + + + + + + + + + + Loader + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Indexer + + + + + + + + + + + + + + + workers + workers + diff --git a/docs/images/tasks-git-loader.uml b/docs/images/tasks-git-loader.uml new file mode 100644 index 0000000..ffa45a8 --- /dev/null +++ b/docs/images/tasks-git-loader.uml @@ -0,0 +1,94 @@ +@startuml + participant SCH_DB as "scheduler DB" #B0C4DE + participant SCH_RUN as "scheduler runner" + participant SCH_LS as "scheduler listener" + participant RMQ as "Rabbit-MQ" + participant OBJSTORE as "object storage" + participant STORAGE_DB as "storage DB" #B0C4DE + participant STORAGE_API as "storage API" + participant WORK_GIT as "worker@loader-git" + participant GIT as "git server" + + Note over SCH_DB,SCH_RUN: Task T2 created beforehand \n by the lister-gitlab task + loop Polling + SCH_RUN->>SCH_DB: GET TASK set state=scheduled + SCH_DB-->>SCH_RUN: TASK id=T2 + activate SCH_RUN + SCH_RUN->>RMQ: CREATE Celery Task CT2 loader-git + deactivate SCH_RUN + activate RMQ + end + + RMQ->>WORK_GIT: Start task CT2 + deactivate RMQ + activate WORK_GIT + + WORK_GIT->>STORAGE_API: GET origin state + activate STORAGE_API + STORAGE_API-->>WORK_GIT: 200 + deactivate STORAGE_API + + WORK_GIT->>GIT: GET refs + activate GIT + GIT->>WORK_GIT: 200 / refs + deactivate GIT + + WORK_GIT->>GIT: GET new_objects + activate GIT + GIT->>WORK_GIT: 200 / objects + deactivate GIT + + WORK_GIT->>GIT: PACKFILE + activate GIT + GIT->>WORK_GIT: 200 / blobs + deactivate GIT + + WORK_GIT->>STORAGE_API: LOAD NEW CONTENT + activate STORAGE_API + loop For each blob + STORAGE_API->>OBJSTORE: ADD BLOB + end + STORAGE_API-->>WORK_GIT: 200 / blobs + deactivate STORAGE_API + + WORK_GIT->>STORAGE_API: NEW DIR + activate STORAGE_API + loop For each DIR + STORAGE_API->>STORAGE_DB: INSERT DIR + end + STORAGE_API-->>WORK_GIT: 201 + deactivate STORAGE_API + + WORK_GIT->>STORAGE_API: NEW REV + activate STORAGE_API + loop For each REV + STORAGE_API->>STORAGE_DB: INSERT REV + end + STORAGE_API-->>WORK_GIT: 201 + deactivate STORAGE_API + + WORK_GIT->>STORAGE_API: NEW REL + activate STORAGE_API + loop For each REL + STORAGE_API->>STORAGE_DB: INSERT REL + end + STORAGE_API-->>WORK_GIT: 201 + deactivate STORAGE_API + + WORK_GIT->>STORAGE_API: NEW SNAPSHOT + activate STORAGE_API + loop For each SNAPSHOT + STORAGE_API->>STORAGE_DB: INSERT SNAPSHOT + end + STORAGE_API-->>WORK_GIT: 201 + deactivate STORAGE_API + + WORK_GIT-->>RMQ: SET CT2 status=eventful + deactivate WORK_GIT + activate RMQ + RMQ->>SCH_LS: NOTIFY end of task CT2 + deactivate RMQ + activate SCH_LS + SCH_LS->>SCH_DB: UPDATE T2 set state=end + deactivate SCH_LS +@enduml diff --git a/docs/images/tasks-lister.uml b/docs/images/tasks-lister.uml new file mode 100644 index 0000000..7d1a953 --- /dev/null +++ b/docs/images/tasks-lister.uml @@ -0,0 +1,61 @@ +@startuml + participant WEB as "swh-web" + participant SCH_API as "scheduler API" #ECECFF + participant SCH_DB as "scheduler DB" #B0C4DE + participant SCH_RUN as "scheduler runner" + participant RMQ as "Rabbit-MQ" + participant SCH_LS as "scheduler listener" + participant WORK_GITLAB as "worker@gitlab-lister" + participant GITLAB as "gitlab API" + participant STORAGE_API as "storage API" #ECECFF + participant STORAGE_DB as "storage DB" #B0C4DE + + Note over WEB,SCH_API: Save gitlab forge 0xdeadbeef + WEB->>SCH_API: CREATE TASK lister-gitlab + activate WEB + activate SCH_API + SCH_API->>SCH_DB: INSERT TASK + activate SCH_DB + SCH_API-->>WEB: 201 + deactivate SCH_API + deactivate WEB + loop Polling + SCH_RUN->>SCH_DB: GET TASK set state=scheduled + SCH_DB-->>SCH_RUN: TASK id=T1 + deactivate SCH_DB + activate SCH_RUN + SCH_RUN->>RMQ: CREATE Celery Task CT1 + deactivate SCH_RUN + activate RMQ + end + + RMQ->>WORK_GITLAB: Start task CT1 + deactivate RMQ + activate WORK_GITLAB + WORK_GITLAB->>GITLAB: Get git repos + activate GITLAB + GITLAB-->>WORK_GITLAB: Known git repos + deactivate GITLAB + + loop For Each Repo + WORK_GITLAB->>STORAGE_API: CREATE ORIGIN + activate STORAGE_API + WORK_GITLAB->>SCH_API: CREATE TASK loader-git + activate SCH_API + STORAGE_API->>STORAGE_DB: INSERT ORIGIN + STORAGE_API-->>WORK_GITLAB: 201 + deactivate STORAGE_API + SCH_API->>SCH_DB: INSERT TASK + SCH_API-->>WORK_GITLAB: 201 + deactivate SCH_API + end + + WORK_GITLAB-->>RMQ: SET CT1 status=eventful + deactivate WORK_GITLAB + activate RMQ + RMQ->>SCH_LS: NOTIFY end of task CT1 + activate SCH_LS + deactivate RMQ + SCH_LS->>SCH_DB: UPDATE T1 set state=end + deactivate SCH_LS +@enduml diff --git a/docs/index.rst b/docs/index.rst index b572672..6ed977c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,129 +1,137 @@ .. _swh-docs: Software Heritage - Development Documentation ============================================= .. toctree:: :maxdepth: 2 :caption: Contents: Getting started --------------- * :ref:`getting-started` ← start here to hack on the Software Heritage software stack +Architecture +------------ + +* :ref:`architecture` ← go there to have a glimpse on the Software Heritage software + architecture + + Components ---------- Here is brief overview of the most relevant software components in the Software Heritage stack. Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.archiver ` orchestrator in charge of guaranteeing that object storage content is pristine and available in a sufficient amount of copies :ref:`swh.core ` low-level utilities and helpers used by almost all other modules in the stack :ref:`swh.deposit ` push-based deposit of software artifacts to the archive swh.docs developer documentation (used to generate this doc you are reading) :ref:`swh.indexer ` tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it :ref:`swh.journal ` persistent logger of changes to the archive, with publish-subscribe support :ref:`swh.lister ` collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) :ref:`swh.loader-core ` low-level loading utilities and helpers used by all other loaders :ref:`swh.loader-debian ` loader for `Debian `_ source packages :ref:`swh.loader-dir ` loader for source directories (e.g., expanded tarballs) :ref:`swh.loader-git ` loader for `Git `_ repositories :ref:`swh.loader-mercurial ` loader for `Mercurial `_ repositories :ref:`swh.loader-pypi ` loader for `PyPI `_ source code releases :ref:`swh.loader-svn ` loader for `Subversion `_ repositories :ref:`swh.loader-tar ` loader for source tarballs (including Tar, ZIP and other archive formats) :ref:`swh.model ` implementation of the :ref:`data-model` to archive source code artifacts :ref:`swh.objstorage ` content-addressable object storage :ref:`swh.scheduler ` task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) :ref:`swh.storage ` abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata :ref:`swh.vault ` implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) :ref:`swh.web ` Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use Dependencies ------------ The dependency relationships among the various modules are depicted below. .. _py-deps-swh: .. figure:: images/py-deps-swh.svg :width: 1024px :align: center Dependencies among top-level Python modules (click to zoom). Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * `URLs index `_ * :ref:`search` * :ref:`glossary` .. ensure sphinx does not complain about index files not being included .. toctree:: :hidden: :glob: + architecture getting-started swh-*/index diff --git a/requirements.txt b/requirements.txt index b53259f..8409a3d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,7 +1,8 @@ # Add here external Python modules dependencies, one per line. Module names # should match https://pypi.python.org/pypi names. For the full spec or # dependency lines, see https://pip.readthedocs.org/en/1.1/requirements.html vcversioner sphinx >= 1.3 sphinxcontrib-httpdomain +sphinxcontrib-images recommonmark diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py index b7b5624..41e8c49 100755 --- a/swh/docs/sphinx/conf.py +++ b/swh/docs/sphinx/conf.py @@ -1,143 +1,144 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- # import django import os # General information about the project. project = 'Software Heritage - Development Documentation' copyright = '2015-2018, the Software Heritage developers' author = 'the Software Heritage developers' # -- General configuration ------------------------------------------------ # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon', - # 'sphinx.ext.intersphinx', 'sphinxcontrib.httpdomain', - 'sphinx.ext.extlinks'] + 'sphinx.ext.extlinks', + 'sphinxcontrib.images', + ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # source_suffix = ['.rst', '.md'] # source_suffix = '.rst' source_parsers = { '.md': 'recommonmark.parser.CommonMarkParser', } # The master toctree document. master_doc = 'index' # A string of reStructuredText that will be included at the beginning of every # source file that is read. rst_prolog = ''' .. include:: /swh_substitutions ''' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '' # The full version, including alpha/beta/rc tags. release = '' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = 'en' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This patterns also effect to html_static_path and html_extra_path exclude_patterns = ['_build'] # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = True # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # html_theme = 'alabaster' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. # html_theme_options = { 'logo': 'software-heritage-logo-title-motto-vertical.svg', 'font_family': "'Alegreya Sans', sans-serif", 'head_font_family': "'Alegreya', serif", # equivalent of alabaster's: 'gray_1': '#5b5e6f', # dark gray 'gray_2': '#efeff2', # light gray 'gray_3': '#b1b5ae', # medium gray 'pink_1': '#e5d4cf', # light pink 'pink_2': '#bd9f97', # medium pink 'fixed_sidebar': 'true', } # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # make logo actually appear, avoiding gotcha due to alabaster default conf. # https://github.com/bitprophet/alabaster/issues/97#issuecomment-303722935 html_sidebars = { '**': [ 'about.html', 'localtoc.html', 'relations.html', 'sourcelink.html', 'searchbox.html', ] } # refer to the Python standard library. intersphinx_mapping = {'python': ('https://docs.python.org/3', None)} # -- autodoc configuration ---------------------------------------------- autodoc_default_flags = ['members', 'undoc-members'] autodoc_member_order = 'bysource' autodoc_mock_imports = ['rados'] # for the extlinks extension, sub-projects should fill that dict extlinks = {} # hack to set the adequate django settings when building global swh doc # to avoid build errors def source_read_handler(app, docname, source): if 'swh-deposit' in docname: os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'swh.deposit.settings.development') django.setup() elif 'swh-web' in docname: os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'swh.web.settings.development') django.setup() def setup(app): app.connect('source-read', source_read_handler)