How to load and index data
ActivePublic
Actions

Authored by ardumont on Mar 14 2018, 12:19 PM.

Tags

Subscribers

None

	I usually do something like this:

	```
	make -C $SWH_ENVIRONMENT_HOME rebuild-testdata

	# as pythonpath is correctly set, i can do this (yours is set as well according to prior comments)
	# Now we need some contents in the empty db, for this i use one of the loader, for example loader-git
	#
	# You need some configuration first, check ~/.config/swh/loader/git-updater.yml [1] below
	# then:
	python3 -m swh.loader.git.updater --origin-url https://github.com/ardumont/org-trello

	# loader git will load the repository @ url mentioned (you can change this of course ;)
	# Now you have local contents in the softwareheritage-dev db

	# They are not yet indexed though, so we need to index those contents (softwareheritage-indexer-dev db)

	# I use something like this:
	./list-sha1.sh \| python3 -m swh.indexer.producer --batch 100 --task-name orchestrator_all

	# [2] list-sha1s.sh will list the current sha1 (identifier) as expected by the indexer and dumps them to stdin
	# The indexer producer will read those sha1s from stdin and sends them to the orchestrator queue
	# Orchestrator will sends those sha1s in a batch to the indexer setup-ed in its configuration file
	# [3] ~/.config/swh/indexer/orchestrator.yml

	# Orchestrator is setup-ed to send to all indexer (mimetype in this current setup).
	# Mimetype indexes and store result, then send to the orchestrator text, only the text/* mimetypes sha1
	# for the next indexation
	# [4] ~/.config/swh/indexer/mimetype.yml

	# Finally, orchestrator text sends to the next indexers (fossology_license in the current setup)
	# [5] ~/.config/swh/orchestrator_text.yml

	# Fossology license goes on to index those sha1s, store result in db
	# [6] ~/.config/swh/indexer/fossology_license.yml

	# I forgot one important part, the workers, you send messages to different queues (rabbitmq
	# which should already be installed through dependencies installation)
	# So we need to actually consume those messages. In a dedicated terminal (or tmux pane), run:
	python3 -m celery worker --app=swh.scheduler.celery_backend.config.app \
	--pool=prefork \
	--concurrency=1 \
	-Ofair \
	--loglevel=info \
	--without-gossip \
	--without-mingle \
	--without-heartbeat 2>&1

	# this will consume message and actually do all that i mentioned above (async).
	# This needs another configuration file which is at ~/.config/swh/worker.yml [7]

	```
	Source:
	- [1] P233 - ~/.config/swh/loader/git-updater.yml
	- [2] P232 - list-sha1.sh
	- [3] P234 - ~/.config/swh/indexer/orchestrator.yml
	- [4] P235 - ~/.config/swh/indexer/orchestrator-text.yml
	- [5] P236 - ~/.config/swh/indexer/mimetype.yml
	- [6] P237 - ~/.config/swh/indexer/fossology_license.yml
	- [7] P136 - ~/.config/swh/worker.yml
	```

	At the end of it all, you should have indexed data in softwareheritage-indexer-dev (table content_mimetype, content_fossology_license).

	Note: for the fossology_license indexer, you need a package fossology-nomossa which is in our [[ https://wiki.softwareheritage.org/index.php?title=Debian_packaging#Package_repository \| public debian repository]].

Event Timeline

ardumont created this paste.Mar 14 2018, 12:19 PM

ardumont changed the title of this paste from untitled to Draft: How to load and index data.

ardumont added a project: Indexer.

ardumont changed the title of this paste from Draft: How to load and index data to How to load and index data.Mar 14 2018, 12:23 PM

ardumont mentioned this in T782: Web API: make endpoints that expose extracted metadata return *lists* of factual information.Mar 14 2018, 12:25 PM

ardumont mentioned this in T1230: Indexers: Improve readme to be more explicit on how to run locally.Oct 3 2018, 12:19 PM

How to load and index dataActivePublicActions

Event Timeline

How to load and index data
ActivePublic
Actions