Page MenuHomeSoftware Heritage
Paste P238

How to load and index data
ActivePublic

Authored by ardumont on Mar 14 2018, 12:19 PM.
I usually do something like this:
```
make -C $SWH_ENVIRONMENT_HOME rebuild-testdata
# as pythonpath is correctly set, i can do this (yours is set as well according to prior comments)
# Now we need some contents in the empty db, for this i use one of the loader, for example loader-git
#
# You need some configuration first, check ~/.config/swh/loader/git-updater.yml [1] below
# then:
python3 -m swh.loader.git.updater --origin-url https://github.com/ardumont/org-trello
# loader git will load the repository @ url mentioned (you can change this of course ;)
# Now you have local contents in the softwareheritage-dev db
# They are not yet indexed though, so we need to index those contents (softwareheritage-indexer-dev db)
# I use something like this:
./list-sha1.sh | python3 -m swh.indexer.producer --batch 100 --task-name orchestrator_all
# [2] list-sha1s.sh will list the current sha1 (identifier) as expected by the indexer and dumps them to stdin
# The indexer producer will read those sha1s from stdin and sends them to the orchestrator queue
# Orchestrator will sends those sha1s in a batch to the indexer setup-ed in its configuration file
# [3] ~/.config/swh/indexer/orchestrator.yml
# Orchestrator is setup-ed to send to all indexer (mimetype in this current setup).
# Mimetype indexes and store result, then send to the orchestrator text, only the text/* mimetypes sha1
# for the next indexation
# [4] ~/.config/swh/indexer/mimetype.yml
# Finally, orchestrator text sends to the next indexers (fossology_license in the current setup)
# [5] ~/.config/swh/orchestrator_text.yml
# Fossology license goes on to index those sha1s, store result in db
# [6] ~/.config/swh/indexer/fossology_license.yml
# I forgot one important part, the workers, you send messages to different queues (rabbitmq
# which should already be installed through dependencies installation)
# So we need to actually consume those messages. In a dedicated terminal (or tmux pane), run:
python3 -m celery worker --app=swh.scheduler.celery_backend.config.app \
--pool=prefork \
--concurrency=1 \
-Ofair \
--loglevel=info \
--without-gossip \
--without-mingle \
--without-heartbeat 2>&1
# this will consume message and actually do all that i mentioned above (async).
# This needs another configuration file which is at ~/.config/swh/worker.yml [7]
```
Source:
- [1] P233 - ~/.config/swh/loader/git-updater.yml
- [2] P232 - list-sha1.sh
- [3] P234 - ~/.config/swh/indexer/orchestrator.yml
- [4] P235 - ~/.config/swh/indexer/orchestrator-text.yml
- [5] P236 - ~/.config/swh/indexer/mimetype.yml
- [6] P237 - ~/.config/swh/indexer/fossology_license.yml
- [7] P136 - ~/.config/swh/worker.yml
```
At the end of it all, you should have indexed data in softwareheritage-indexer-dev (table content_mimetype, content_fossology_license).
Note: for the fossology_license indexer, you need a package fossology-nomossa which is in our [[ https://wiki.softwareheritage.org/index.php?title=Debian_packaging#Package_repository | public debian repository]].

Event Timeline

ardumont changed the title of this paste from untitled to Draft: How to load and index data.
ardumont added a project: Indexer.
ardumont changed the title of this paste from Draft: How to load and index data to How to load and index data.Mar 14 2018, 12:23 PM