Changeset View
Changeset View
Standalone View
Standalone View
docs/export.rst
.. _swh-graph-export: | |||||
=================== | =================== | ||||
Exporting a dataset | Exporting a dataset | ||||
=================== | =================== | ||||
This repository aims to contain various pipelines to generate datasets of | This repository aims to contain various pipelines to generate datasets of | ||||
Software Heritage data, so that they can be used internally or by external | Software Heritage data, so that they can be used internally or by external | ||||
researchers. | researchers. | ||||
Graph dataset | Graph dataset | ||||
============= | ============= | ||||
Exporting the full dataset | |||||
-------------------------- | |||||
Right now, the only supported export pipeline is the *Graph Dataset*, a set of | Right now, the only supported export pipeline is the *Graph Dataset*, a set of | ||||
relational tables representing the Software Heritage Graph, as documented in | relational tables representing the Software Heritage Graph, as documented in | ||||
:ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export`` | :ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export`` | ||||
command. | command. | ||||
This dataset can be exported in two different formats: ``orc`` and ``edges``. | This dataset can be exported in two different formats: ``orc`` and ``edges``. | ||||
To export a graph, you need to provide a comma-separated list of formats to | To export a graph, you need to provide a comma-separated list of formats to | ||||
export with the ``--formats`` option. You also need an export ID, a unique | export with the ``--formats`` option. You also need an export ID, a unique | ||||
identifier used by the Kafka server to store the current progress of the | identifier used by the Kafka server to store the current progress of the | ||||
export. | export. | ||||
**Note**: exporting as the ``edges`` format is discouraged, as it is redundant | |||||
and can easily be generated directly from the ORC format. | |||||
Here is an example command to start a graph dataset export:: | Here is an example command to start a graph dataset export:: | ||||
swh dataset -C graph_export_config.yml graph export \ | swh dataset -C graph_export_config.yml graph export \ | ||||
--formats orc \ | --formats orc \ | ||||
--export-id seirl-2022-04-25 \ | --export-id 2022-04-25 \ | ||||
-p 64 \ | -p 64 \ | ||||
/srv/softwareheritage/hdd/graph/2022-04-25 | /srv/softwareheritage/hdd/graph/2022-04-25 | ||||
This command usually takes more than a week for a full export, it is | This command usually takes more than a week for a full export, it is | ||||
therefore advised to run it in a service or a tmux session. | therefore advised to run it in a service or a tmux session. | ||||
The configuration file should contain the configuration for the swh-journal | The configuration file should contain the configuration for the swh-journal | ||||
clients, as well as various configuration options for the exporters. Here is an | clients, as well as various configuration options for the exporters. Here is an | ||||
Show All 13 Lines | |||||
The following configuration options can be used for the export: | The following configuration options can be used for the export: | ||||
- ``remove_pull_requests``: remove all edges from origin to snapshot matching | - ``remove_pull_requests``: remove all edges from origin to snapshot matching | ||||
``refs/*`` but not matching ``refs/heads/*`` or ``refs/tags/*``. This removes | ``refs/*`` but not matching ``refs/heads/*`` or ``refs/tags/*``. This removes | ||||
all the pull requests that are present in Software Heritage (archived with | all the pull requests that are present in Software Heritage (archived with | ||||
``git clone --mirror``). | ``git clone --mirror``). | ||||
Uploading on S3 & on the annex | |||||
------------------------------ | |||||
The dataset should then be made available publicly by uploading it on S3 and on | |||||
the public annex. | |||||
For S3:: | |||||
aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc | |||||
For the annex:: | |||||
scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/ | |||||
ssh saam.internal.softwareheritage.org | |||||
cd /srv/softwareheritage/annex/public/dataset/graph | |||||
git annex add 2022-04-25 | |||||
git annex sync --content | |||||
Documenting the new dataset | |||||
--------------------------- | |||||
In the ``swh-dataset`` repository, edit the the file ``docs/graph/dataset.rst`` | |||||
to document the availability of the new dataset. You should usually mention: | |||||
- the name of the dataset version (e.g., 2022-04-25) | |||||
- the number of nodes | |||||
- the number of edges | |||||
- the available formats (notably whether the graph is also available in its | |||||
compressed representation). | |||||
- the total on-disk size of the dataset | |||||
- the buckets/URIs to obtain the graph from S3 and from the annex |