I usually do something like this: ``` make -C $SWH_ENVIRONMENT_HOME rebuild-testdata # as pythonpath is correctly set, i can do this (yours is set as well according to prior comments) # Now we need some contents in the empty db, for this i use one of the loader, for example loader-git # # You need some configuration first, check ~/.config/swh/loader/git-updater.yml [1] below # then: python3 -m swh.loader.git.updater --origin-url https://github.com/ardumont/org-trello # loader git will load the repository @ url mentioned (you can change this of course ;) # Now you have local contents in the softwareheritage-dev db # They are not yet indexed though, so we need to index those contents (softwareheritage-indexer-dev db) # I use something like this: ./list-sha1.sh | python3 -m swh.indexer.producer --batch 100 --task-name orchestrator_all # [2] list-sha1s.sh will list the current sha1 (identifier) as expected by the indexer and dumps them to stdin # The indexer producer will read those sha1s from stdin and sends them to the orchestrator queue # Orchestrator will sends those sha1s in a batch to the indexer setup-ed in its configuration file # [3] ~/.config/swh/indexer/orchestrator.yml # Orchestrator is setup-ed to send to all indexer (mimetype in this current setup). # Mimetype indexes and store result, then send to the orchestrator text, only the text/* mimetypes sha1 # for the next indexation # [4] ~/.config/swh/indexer/mimetype.yml # Finally, orchestrator text sends to the next indexers (fossology_license in the current setup) # [5] ~/.config/swh/orchestrator_text.yml # Fossology license goes on to index those sha1s, store result in db # [6] ~/.config/swh/indexer/fossology_license.yml # I forgot one important part, the workers, you send messages to different queues (rabbitmq # which should already be installed through dependencies installation) # So we need to actually consume those messages. In a dedicated terminal (or tmux pane), run: python3 -m celery worker --app=swh.scheduler.celery_backend.config.app \ --pool=prefork \ --concurrency=1 \ -Ofair \ --loglevel=info \ --without-gossip \ --without-mingle \ --without-heartbeat 2>&1 # this will consume message and actually do all that i mentioned above (async). # This needs another configuration file which is at ~/.config/swh/worker.yml [7] ``` Source: - [1] P233 - ~/.config/swh/loader/git-updater.yml - [2] P232 - list-sha1.sh - [3] P234 - ~/.config/swh/indexer/orchestrator.yml - [4] P235 - ~/.config/swh/indexer/orchestrator-text.yml - [5] P236 - ~/.config/swh/indexer/mimetype.yml - [6] P237 - ~/.config/swh/indexer/fossology_license.yml - [7] P136 - ~/.config/swh/worker.yml ``` At the end of it all, you should have indexed data in softwareheritage-indexer-dev (table content_mimetype, content_fossology_license). Note: for the fossology_license indexer, you need a package fossology-nomossa which is in our [[ https://wiki.softwareheritage.org/index.php?title=Debian_packaging#Package_repository | public debian repository]].