Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F7163661
D7693.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
2 KB
Subscribers
None
D7693.diff
View Options
diff --git a/docs/export.rst b/docs/export.rst
new file mode 100644
--- /dev/null
+++ b/docs/export.rst
@@ -0,0 +1,56 @@
+===================
+Exporting a dataset
+===================
+
+This repository aims to contain various pipelines to generate datasets of
+Software Heritage data, so that they can be used internally or by external
+researchers.
+
+Graph dataset
+=============
+
+Right now, the only supported export pipeline is the *Graph Dataset*, a set of
+relational tables representing the Software Heritage Graph, as documented in
+:ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export``
+command.
+
+This dataset can be exported in two different formats: ``orc`` and ``edges``.
+To export a graph, you need to provide a comma-separated list of formats to
+export with the ``--formats`` option. You also need an export ID, a unique
+identifier used by the Kafka server to store the current progress of the
+export.
+
+Here is an example command to start a graph dataset export::
+
+ swh dataset -C graph_export_config.yml graph export \
+ --formats orc \
+ --export-id seirl-2022-04-25 \
+ -p 64 \
+ /srv/softwareheritage/hdd/graph/2022-04-25
+
+This command usually takes more than a week for a full export, it is
+therefore advised to run it in a service or a tmux session.
+
+The configuration file should contain the configuration for the swh-journal
+clients, as well as various configuration options for the exporters. Here is an
+example configuration file::
+
+ journal:
+ brokers:
+ - kafka1.internal.softwareheritage.org:9094
+ - kafka2.internal.softwareheritage.org:9094
+ - kafka3.internal.softwareheritage.org:9094
+ - kafka4.internal.softwareheritage.org:9094
+ security.protocol: SASL_SSL
+ sasl.mechanisms: SCRAM-SHA-512
+ max.poll.interval.ms: 1000000
+
+ remove_pull_requests: true
+
+
+The following configuration options can be used for the export:
+
+- ``remove_pull_requests``: remove all edges from origin to snapshot matching
+ ``refs/*`` but not matching ``refs/heads/*`` or ``refs/tags/*``. This removes
+ all the pull requests that are present in Software Heritage (archived with
+ ``git clone --mirror``).
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -12,3 +12,4 @@
:titlesonly:
graph/index
+ export
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Thu, Jan 30, 12:10 PM (1 w, 18 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3219682
Attached To
D7693: docs: document graph dataset export
Event Timeline
Log In to Comment