diff --git a/docs/getting-started.rst b/docs/getting-started.rst index 08a048a..9525897 100644 --- a/docs/getting-started.rst +++ b/docs/getting-started.rst @@ -1,58 +1,163 @@ .. _getting-started: .. highlight:: bash + Run your own Software Heritage ============================== This walkthrough will guide from the basic step of obtaining the source code of the Software Heritage stack to running a local copy of it in which you can ingest source code of existing repositories and browse them using the archive web application. -Step 0 - get the code ---------------------- +Step 0 --- get the code +----------------------- The `swh-environment `_ Git (meta) repository orchestrates the Git repositories of all Software Heritage modules. Clone it:: git clone https://forge.softwareheritage.org/source/swh-environment.git then recursively clone all Python module repositories. For this step you will -need the `mr `_ tool, see the `README` file of -swh-environment for more information:: +need the `mr `_ tool, see the ``README`` file +of swh-environment for more information:: cd swh-environment readlink -f .mrconfig >> ~/.mrtrust mr up -For periodic code updates in the future you can use the following helper:: +For periodic code you can use the following helper:: cd swh-environment bin/update +From now on you will need to have a ``PYTHONPATH`` environment variable that +allows to find Python modules in the ``swh`` namespace. To that end you can +source the ``pythonpath.sh`` snippet from swh-environment:: -Step 1 - set up storage ------------------------ + source pythonpath.sh -Then you will need a local storage to archive source code artifacts. It comes -in two parts: a content-addressable object storage on your file system (for -file contents) and a Postgres database (for the graph structure of the -archive). See the :ref:`data-model` for more information. +To make setting ``PYTHONPATH`` easier in the future, you might want to define a +shell alias, e.g.:: -**TO BE WRITTEN** + alias swh-pythonpath='cd /path/to/swh-environment/ ; source pythonpath.sh ; cd - > /dev/null' -Step 2 - ingest repositories ----------------------------- +Step 1 --- install software dependencies +---------------------------------------- **TO BE WRITTEN** -Step 3 - browse the archive ---------------------------- +Step 2 --- set up storage +------------------------- + +Then you will need a local storage service that will archive and serve source +code artifacts via a REST API. The Software Heritage storage layer comes in two +parts: a content-addressable object storage on your file system (for file +contents) and a Postgres database (for the graph structure of the archive). See +the :ref:`data-model` for more information. The storage layer is configured via +a YAML configuration file, located at +``~/.config/swh/storage/storage.yml``. Create it with a content like: + +.. code-block:: yaml + + storage: + cls: local + args: + db: "host=localhost port=5432 dbname=softwareheritage-dev user=swhdev password=foobar" + objstorage: + cls: pathslicing + args: + root: /srv/softwareheritage/objects/ + slicing: 0:2/2:4 + +Make sure that the object storage root exists on the filesystem and is writable +to your user, e.g.:: + + sudo mkdir /srv/softwareheritage/objects + sudo chown "${USER}:" /srv/softwareheritage/objects + +You are done with object storage setup! Let's setup the database:: + + cd swh-environment/swh-storage/sql/ + sudo -u postgres bin/db-init 5432 softwareheritage-dev swhdev + +Let's unpack the second line. You should have Postgres administrator privileges +to be able to create databases, hence the ``sudo -u postgres``; if your user +has Postgres admin privileges, you can avoid ``sudo`` here. ``5432`` is the +default port of the main Postgres cluster, adapt as needed. +``softwareheritage-dev`` is the name of the DB that will be created, it should +match the ``db`` line in ``storage.yml``; same goes for ``swhdev``, the DB user +name. You will be interactively asked for a password for the DB user; you +should provide one that matches the ``db`` line value. + +To check that you can successfully connect to the DB (you will be interactively +asked for the DB password):: + + psql -h localhost -p 5432 -U swhdev softwareheritage-dev + +Note that you can simplify interactive use and reduce configuration clutter +using Postgres `password +`_ and +`service +`_ +configuration files. Any valid `libpq connection string +`_ +will make the ``db`` line of ``storage.yml`` happy. + +You can now run the storage server like this:: + + python3 -m swh.storage.api.server --host localhost --port 5002 ~/.config/swh/storage/storage.yml + + +Step 3 --- ingest repositories +------------------------------ + +You are now ready to ingest your first repository into your local Software +Heritage. For the sake of example, we will ingest a few Git repositories. The +module in charge of ingesting Git repositories is the *Git loader*, Python +module ``swh.loader.git``. Its configuration file is at +``~/.config/swh/loader/git-updater.yml``. Create it with a content like: + +.. code-block:: yaml + + storage: + cls: remote + args: + url: http://localhost:5002 + +It just informs the Git loader to use the storage server running on your +machine. The ``url`` line should match the command line used to run the storage +server. + +You can now ingest Git repository on the command line using the command:: + + python3 -m swh.loader.git.updater --origin-url GIT_CLONE_URL + +For instance, you can try ingesting the following repositories, in increasing +size order (note that the last two might take a few hours to complete and will +occupy several GB on both the Postgres DB and the object storage):: + + python3 -m swh.loader.git.updater --origin-url https://github.com/SoftwareHeritage/swh-storage.git + python3 -m swh.loader.git.updater --origin-url https://github.com/hylang/hy.git + python3 -m swh.loader.git.updater --origin-url https://github.com/ocaml/ocaml.git + + # WARNING: next repo is big + python3 -m swh.loader.git.updater --origin-url https://github.com/torvalds/linux.git + +Congratulations, you have just archived your first source code repositories! + +To re-archive the same repositories later on you can rerun the same commands: +only objects *added* since the previous visit will be archived upon the next +one. + + +Step 4 --- browse the archive +----------------------------- **TO BE WRITTEN**