Differential D7711 Diff 27913 docs/export.rst

Changeset View

Standalone View

docs/export.rst

				.. _swh-graph-export:

	===================			===================
	Exporting a dataset			Exporting a dataset
	===================			===================

	This repository aims to contain various pipelines to generate datasets of			This repository aims to contain various pipelines to generate datasets of
	Software Heritage data, so that they can be used internally or by external			Software Heritage data, so that they can be used internally or by external
	researchers.			researchers.

	Graph dataset			Graph dataset
	=============			=============

				Exporting the full dataset
				--------------------------

	Right now, the only supported export pipeline is the Graph Dataset, a set of			Right now, the only supported export pipeline is the Graph Dataset, a set of
	relational tables representing the Software Heritage Graph, as documented in			relational tables representing the Software Heritage Graph, as documented in
	:ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export``			:ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export``
	command.			command.

	This dataset can be exported in two different formats: ``orc`` and ``edges``.			This dataset can be exported in two different formats: ``orc`` and ``edges``.
	To export a graph, you need to provide a comma-separated list of formats to			To export a graph, you need to provide a comma-separated list of formats to
	export with the ``--formats`` option. You also need an export ID, a unique			export with the ``--formats`` option. You also need an export ID, a unique
	identifier used by the Kafka server to store the current progress of the			identifier used by the Kafka server to store the current progress of the
	export.			export.

				Note: exporting as the ``edges`` format is discouraged, as it is redundant
				and can easily be generated directly from the ORC format.

	Here is an example command to start a graph dataset export::			Here is an example command to start a graph dataset export::

	swh dataset -C graph_export_config.yml graph export \			swh dataset -C graph_export_config.yml graph export \
	--formats orc \			--formats orc \
	--export-id seirl-2022-04-25 \			--export-id 2022-04-25 \
	-p 64 \			-p 64 \
	/srv/softwareheritage/hdd/graph/2022-04-25			/srv/softwareheritage/hdd/graph/2022-04-25

	This command usually takes more than a week for a full export, it is			This command usually takes more than a week for a full export, it is
	therefore advised to run it in a service or a tmux session.			therefore advised to run it in a service or a tmux session.

	The configuration file should contain the configuration for the swh-journal			The configuration file should contain the configuration for the swh-journal
	clients, as well as various configuration options for the exporters. Here is an			clients, as well as various configuration options for the exporters. Here is an
	Show All 13 Lines


	The following configuration options can be used for the export:			The following configuration options can be used for the export:

	- ``remove_pull_requests``: remove all edges from origin to snapshot matching			- ``remove_pull_requests``: remove all edges from origin to snapshot matching
	``refs/`` but not matching ``refs/heads/`` or ``refs/tags/*``. This removes			``refs/`` but not matching ``refs/heads/`` or ``refs/tags/*``. This removes
	all the pull requests that are present in Software Heritage (archived with			all the pull requests that are present in Software Heritage (archived with
	``git clone --mirror``).			``git clone --mirror``).


				Uploading on S3 & on the annex
				------------------------------

				The dataset should then be made available publicly by uploading it on S3 and on
				the public annex.

				For S3::

				aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc

				For the annex::

				scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/
				ssh saam.internal.softwareheritage.org
				cd /srv/softwareheritage/annex/public/dataset/graph
				git annex add 2022-04-25
				git annex sync --content


				Documenting the new dataset
				---------------------------

				In the ``swh-dataset`` repository, edit the the file ``docs/graph/dataset.rst``
				to document the availability of the new dataset. You should usually mention:

				- the name of the dataset version (e.g., 2022-04-25)
				- the number of nodes
				- the number of edges
				- the available formats (notably whether the graph is also available in its
				compressed representation).
				- the total on-disk size of the dataset
				- the buckets/URIs to obtain the graph from S3 and from the annex