diff --git a/docs/README.md b/docs/README.md deleted file mode 120000 --- a/docs/README.md +++ /dev/null @@ -1 +0,0 @@ -../README.md \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -5,6 +5,7 @@ :caption: Contents: package-loader-tutorial + vcs-loader-overview Reference Documentation diff --git a/docs/vcs-loader-overview.rst b/docs/vcs-loader-overview.rst new file mode 100644 --- /dev/null +++ b/docs/vcs-loader-overview.rst @@ -0,0 +1,144 @@ +.. _vcs-loader-tutorial: + +VCS Loader Overview +=================== + +In this overview, we will see how to write a loader for |swh| that loads +:term:`artifacts ` from a Version Control System, such as Git, +Mercurial, or Subversion + +First, you should be familiar with Python, unit-testing, +|swh|'s :ref:`data-model` and :ref:`architecture`, +and go through the :ref:`developer-setup`. + +As seen in the :ref:`swh-loader-core homepage `, +SWH loaders can be sorted into two large categories: +Package Loaders and VCS loaders. + +This page is an overview of how to write a VCS loader. This is not a tutorial, +because VCS loaders are hooked deeply into their respective VCS' internals; +unlike :ref:`Package Loaders ` which are somewhat uniform +(list tarballs, download tarballs, load content of tarball, done). + + +Architecture +------------ + +A loader is a Python package, usually a subpackage of ``swh.loader`` +but in its own directory (eg. ``swh-loader-git/swh/loader/git``, as ``swh.loader`` +is a :pep:`namespace package <420>`), based on the `swh-py-template`_ repository. + +It has at least one `entrypoint`_, declared in ``setup.py`` to be recognized +by ``swh-loader-core``:: + + entry_points=""" + [swh.workers] + loader.newloader=swh.loader.newloader:register + """, + +This entrypoint declares the task name (to be run by SWH Celery workers) and the +loader class. For example, for the Subversion loader:: + + from typing import Any, Dict + + def register() -> Dict[str, Any]: + from swh.loader.svn.loader import SvnLoader + + return { + "task_modules": ["%s.tasks" % __name__], + "loader": SvnLoader, + } + +The bulk of the work is done by the returned ``loader`` class: it loads +artifacts from the upstream VCS and writes them to the |swh| archive. +Because of the heterogeneity of VCS loaders, it has a lot of freedom in how to +achieve this. Once the initial setup is done (see the next section), its ``load`` +method is called, and it is expected to do all this work as a black box. + +.. _swh-py-template: https://forge.softwareheritage.org/source/swh-py-template/ +.. _entrypoint: https://setuptools.readthedocs.io/en/latest/userguide/entry_point.html + +Base classes +------------ + +All loaders inherit from :class:`swh.loader.core.loader.BaseLoader`, which takes care of +all the SWH-specific setup and finalization: + +* Reading the configuration +* Connecting to the :term:`storage database` +* Storing :term:`origin` and :term:`visit` objects + +It also provides a default implementation of the ``load`` method, which takes care of: + +* calling its ``fetch_data`` (from the VCS) and ``store_data`` (to SWH) in a loop +* on error, notifies swh-storage the loading failed, reports the error to + the monitoring infrastructure (Sentry), and cleanup +* on success, cleanup and notify swh-storage the loading succeeded + +See :meth:`its documentation ` for details. + +Distributed VCS loaders will usually want to inherit from its child, +:class:`swh.loader.core.DVCSLoader`, which takes care of implementing ``store_data``. +Classes inheriting from ``DVCSLoader`` only need to implement ``fetch_data``, and +a method for each object type: ``get_contents``, ``get_directories``, ``get_revisions``, +``get_releases``, and ``get_snapshot``, each returning an iterable of the corresponding +object from :mod:`swh.model.model` +(except ``get_snapshot``, which returns a single one). + +If you are writing a DVCS loader, this allows your loader to fetch all the objects +locally, then return them lazily on demand. + + +Incremental loading +------------------ + +Loading a repository from scratch can be costly, so ``swh-storage`` provides +ways to remember what objects in the repository were already loaded, +through :term:`extids `. +They are represented by :class:`swh.model.model.ExtID`, +which is essentially a 3-tuple that contains a SWHID, an id internal to the VCS type, +(which is the actual "extid" itself), and the type of this id (eg. ``hg-nodeid``). + +When your loader is done loading, it can store extids for some of its objects +(eg. the heads/tips of each branch of the :term:`snapshot` and some intermediate +revisions in the history), +with :meth:`swh.storage.interface.StorageInterface.extid_add`. + +And when it starts loading a known repository, fetches the previous snapshot +using :func:`swh.storage.algos.snapshot.snapshot_get_latest`, then the extids +it stores using :meth:`swh.storage.interface.StorageInterface.extid_get_from_target` +for each of the branch targets. +This way, it can find which objects from the origin were already loaded, +without having to download them first. + +.. note:: + + For legacy reasons, the Subversion loader uses an alternative to ExtID, + which is to encode the repository UUID and the revision ID (an incremental integer) + directly in :attr:`swh.model.model.Revision.extra_headers`. + + This is discouraged because it prevents deduplication across repositories, + and ``extra_headers`` does not have a well-defined schema. + +Integrity +--------- + +Loaders may be interrupted at any point, for various reasons (unhandled crash, +out of memory, hardware failure, blocking IO, system or daemon restart, etc.) + +Therefore, they must take great care that if a load was interrupted, the next load +will finish loading all objects. If they don't, this may happen: + +1. loader loads revision ``R``, pointing to directory ``D`` +2. loader starts loading ``D``, but crashes before it does +3. [loader restarts] +4. loader sees ``R`` is already loaded, so it doesn't load its children + +And ``D`` will never be loaded. + +The solution to this is to load objects in topological order of the DAG. + +Another reason to load objects in topological order is that it avoid having "holes" +in the graph (aka. dangling references), even temporarily. +Holes in the graph cause bad user experiences, when users click a link from +an existing object and get a "not found" error.