diff --git a/docs/dev-info.rst b/docs/dev-info.rst new file mode 100644 index 0000000..23cf957 --- /dev/null +++ b/docs/dev-info.rst @@ -0,0 +1,211 @@ +Hacking on swh-indexer +====================== + +This tutorial will guide you through the hacking on the swh-indexer. +If you do not have a local copy of the Software Heritage archive, go to the +`getting started tutorial +`_ + +Configuration files +------------------- +You will need the following YAML configuration files to run the swh-indexer +commands: + +- Orchestrator at + ``~/.config/swh/indexer/orchestrator.yml`` + +.. code-block:: yaml + + indexers: + mimetype: + check_presence: false + batch_size: 100 + +- Orchestrator-text at + ``~/.config/swh/indexer/orchestrator-text.yml`` + +.. code-block:: yaml + + indexers: + # language: + # batch_size: 10 + # check_presence: false + fossology_license: + batch_size: 10 + check_presence: false + # ctags: + # batch_size: 2 + # check_presence: false + +- Mimetype indexer at + ``~/.config/swh/indexer/mimetype.yml`` + +.. code-block:: yaml + + # storage to read sha1's metadata (path) + # storage: + # cls: local + # args: + # db: "service=swh-dev" + # objstorage: + # cls: pathslicing + # args: + # root: /home/storage/swh-storage/ + # slicing: 0:1/1:5 + + storage: + cls: remote + args: + url: http://localhost:5002/ + + indexer_storage: + cls: remote + args: + url: http://localhost:5007/ + + # storage to read sha1's content + # adapt this to your need + # locally: this needs to match your storage's setup + objstorage: + cls: pathslicing + args: + slicing: 0:1/1:5 + root: /home/storage/swh-storage/ + + destination_queue: swh.indexer.tasks.SWHOrchestratorTextContentsTask + rescheduling_task: swh.indexer.tasks.SWHContentMimetypeTask + + +- Fossology indexer at + ``~/.config/swh/indexer/fossology_license.yml`` + +.. code-block:: yaml + + # storage to read sha1's metadata (path) + # storage: + # cls: local + # args: + # db: "service=swh-dev" + # objstorage: + # cls: pathslicing + # args: + # root: /home/storage/swh-storage/ + # slicing: 0:1/1:5 + + storage: + cls: remote + url: http://localhost:5002/ + + indexer_storage: + cls: remote + args: + url: http://localhost:5007/ + + # storage to read sha1's content + # adapt this to your need + # locally: this needs to match your storage's setup + objstorage: + cls: pathslicing + args: + slicing: 0:1/1:5 + root: /home/storage/swh-storage/ + + workdir: /tmp/swh/worker.indexer/license/ + + tools: + name: 'nomos' + version: '3.1.0rc2-31-ga2cbb8c' + configuration: + command_line: 'nomossa ' + + +- Worker at + ``~/.config/swh/worker.yml`` + +.. code-block:: yaml + + task_broker: amqp://guest@localhost// + task_modules: + - swh.loader.svn.tasks + - swh.loader.tar.tasks + - swh.loader.git.tasks + - swh.storage.archiver.tasks + - swh.indexer.tasks + - swh.indexer.orchestrator + task_queues: + - swh_loader_svn + - swh_loader_tar + - swh_reader_git_to_azure_archive + - swh_storage_archive_worker_to_backend + - swh_indexer_orchestrator_content_all + - swh_indexer_orchestrator_content_text + - swh_indexer_content_mimetype + - swh_indexer_content_language + - swh_indexer_content_ctags + - swh_indexer_content_fossology_license + - swh_loader_svn_mount_and_load + - swh_loader_git_express + - swh_loader_git_archive + - swh_loader_svn_archive + task_soft_time_limit: 0 + +- [1] P233 - ~/.config/swh/loader/git-updater.yml +- [2] P232 - list-sha1.sh + + + + +Database +-------- + +swh-indxer uses a database to store the indexed content. The default +db is expected to be called swh-indexer-dev. + +Create or add ``swh-dev`` and ``swh-indexer-dev`` to +the ``~/.pg_service.conf`` and ``~/.pgpass`` files, which are postgresql's +configuration files. + +Add data to local DB +-------------------- +from within the ``swh-environment``, run the following command:: + + make rebuild-testdata + +and fetch some real data to work with, using:: + + python3 -m swh.loader.git.updater --origin-url + +Then you can list all content files using this script:: + + #!/usr/bin/env bash + + psql service=swh-dev -c "copy (select sha1 from content) to stdin" | sed -e 's/^\\\\x//g' + +Run the indexers +----------------- +Use the list off contents to feed the indexers with with the +following command:: + + ./list-sha1.sh | python3 -m swh.indexer.producer --batch 100 --task-name orchestrator_all + +Activate the workers +-------------------- +To send messages to different queues using rabbitmq +(which should already be installed through dependencies installation), +run the following command in a dedicated terminal:: + + python3 -m celery worker --app=swh.scheduler.celery_backend.config.app \ + --pool=prefork \ + --concurrency=1 \ + -Ofair \ + --loglevel=info \ + --without-gossip \ + --without-mingle \ + --without-heartbeat 2>&1 + +With this command rabbitmq will consume message using the worker +configuration file. + +Note: for the fossology_license indexer, you need a package fossology-nomossa +which is in our `public debian repository +`_. diff --git a/docs/index.rst b/docs/index.rst index 78a5071..498f7df 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,20 +1,86 @@ .. _swh-indexer: Software Heritage - Indexer =========================== Tools and workers used to mine the content of the archive and extract derived information from archive source code artifacts. +Workers +------- +There are two types of workers: + - orchestrators (orchestrator, orchestrator-text) + - indexers (mimetype, language, ctags, fossology-license) + +Orchestrator +************ +The orchestrator is in charge of dispatching a batch of sha1 hashes to +different indexers. + +There are two types of orchestrators: + - orchestrator (swh_indexer_orchestrator_content_all): Receives and + broadcast sha1 ids (of contents) to indexers (currently only the + mimetype indexer) + - orchestrator-text (swh_indexer_orchestrator_content_text): Receives + batch of sha1 ids (of textual contents) and broadcast those to + indexers (currently language, ctags, and fossology-license + indexers). + +Orchestration procedure: + - receive batch of sha1s + - split into small batches + - broadcast batches to indexers + + + +Indexers +******** +An indexer is in charge of the content retrieval and indexation of the +extracted information in the swh-indexer db. + +There are two types of indexers: + - content indexer: works with content sha1 hashes + - revision indexer: works with revision sha1 hashes + +Indexation procedure: + - receive batch of ids + - retrieve the associated data depending on object type + - compute for that object some index + - store the result to swh's storage + - (and possibly do some broadcast itself) + + +Current content indexers: +------------------------- + - mimetype: computes the mimetype, + filter out the textual contents and broadcast the list to the + orchestrator-text + + - language : detect the programming language with pygments + + - ctags : try and compute tags + information + + - fossology-license : try and compute the license + + - metadata : translate file into translated_metadata dict + +Current revision indexers: +-------------------------- + - metadata : detects files containing metadata and creates a minimal + metadata set kept with the revision. + + .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :caption: Contents: - README + dev-info.rst + Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`