export.rst
No OneTemporary
Actions

Size

3 KB

Subscribers

None

export.rst
View Options

	.. _swh-graph-export:

	===================
	Exporting a dataset
	===================

	This repository aims to contain various pipelines to generate datasets of
	Software Heritage data, so that they can be used internally or by external
	researchers.

	Graph dataset
	=============

	Exporting the full dataset
	--------------------------

	Right now, the only supported export pipeline is the Graph Dataset, a set of
	relational tables representing the Software Heritage Graph, as documented in
	:ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export``
	command.

	This dataset can be exported in two different formats: ``orc`` and ``edges``.
	To export a graph, you need to provide a comma-separated list of formats to
	export with the ``--formats`` option. You also need an export ID, a unique
	identifier used by the Kafka server to store the current progress of the
	export.

	Note: exporting as the ``edges`` format is discouraged, as it is redundant
	and can easily be generated directly from the ORC format.

	Here is an example command to start a graph dataset export::

	swh dataset -C graph_export_config.yml graph export \
	--formats orc \
	--export-id 2022-04-25 \
	-p 64 \
	/srv/softwareheritage/hdd/graph/2022-04-25

	This command usually takes more than a week for a full export, it is
	therefore advised to run it in a service or a tmux session.

	The configuration file should contain the configuration for the swh-journal
	clients, as well as various configuration options for the exporters. Here is an
	example configuration file::

	journal:
	brokers:
	- kafka1.internal.softwareheritage.org:9094
	- kafka2.internal.softwareheritage.org:9094
	- kafka3.internal.softwareheritage.org:9094
	- kafka4.internal.softwareheritage.org:9094
	security.protocol: SASL_SSL
	sasl.mechanisms: SCRAM-SHA-512
	max.poll.interval.ms: 1000000

	remove_pull_requests: true


	The following configuration options can be used for the export:

	- ``remove_pull_requests``: remove all edges from origin to snapshot matching
	``refs/`` but not matching ``refs/heads/`` or ``refs/tags/*``. This removes
	all the pull requests that are present in Software Heritage (archived with
	``git clone --mirror``).


	Uploading on S3 & on the annex
	------------------------------

	The dataset should then be made available publicly by uploading it on S3 and on
	the public annex.

	For S3::

	aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc

	For the annex::

	scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/
	ssh saam.internal.softwareheritage.org
	cd /srv/softwareheritage/annex/public/dataset/graph
	git annex add 2022-04-25
	git annex sync --content


	Documenting the new dataset
	---------------------------

	In the ``swh-dataset`` repository, edit the the file ``docs/graph/dataset.rst``
	to document the availability of the new dataset. You should usually mention:

	- the name of the dataset version (e.g., 2022-04-25)
	- the number of nodes
	- the number of edges
	- the available formats (notably whether the graph is also available in its
	compressed representation).
	- the total on-disk size of the dataset
	- the buckets/URIs to obtain the graph from S3 and from the annex

File Metadata

Mime Type: text/plain
Expires: Thu, Jul 3, 11:23 AM (1 w, 2 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3238180

export.rstNo OneTemporaryActions

export.rstView Options

File Metadata

Event Timeline

export.rst
No OneTemporary
Actions

export.rst
View Options