diff --git a/docs/export.rst b/docs/export.rst --- a/docs/export.rst +++ b/docs/export.rst @@ -1,3 +1,5 @@ +.. _swh-graph-export: + =================== Exporting a dataset =================== @@ -9,6 +11,9 @@ Graph dataset ============= +Exporting the full dataset +-------------------------- + Right now, the only supported export pipeline is the *Graph Dataset*, a set of relational tables representing the Software Heritage Graph, as documented in :ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export`` @@ -20,11 +25,14 @@ identifier used by the Kafka server to store the current progress of the export. +**Note**: exporting as the ``edges`` format is discouraged, as it is redundant +and can easily be generated directly from the ORC format. + Here is an example command to start a graph dataset export:: swh dataset -C graph_export_config.yml graph export \ --formats orc \ - --export-id seirl-2022-04-25 \ + --export-id 2022-04-25 \ -p 64 \ /srv/softwareheritage/hdd/graph/2022-04-25 @@ -54,3 +62,37 @@ ``refs/*`` but not matching ``refs/heads/*`` or ``refs/tags/*``. This removes all the pull requests that are present in Software Heritage (archived with ``git clone --mirror``). + + +Uploading on S3 & on the annex +------------------------------ + +The dataset should then be made available publicly by uploading it on S3 and on +the public annex. + +For S3:: + + aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc + +For the annex:: + + scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/ + ssh saam.internal.softwareheritage.org + cd /srv/softwareheritage/annex/public/dataset/graph + git annex add 2022-04-25 + git annex sync --content + + +Documenting the new dataset +--------------------------- + +In the ``swh-dataset`` repository, edit the the file ``docs/graph/dataset.rst`` +to document the availability of the new dataset. You should usually mention: + +- the name of the dataset version (e.g., 2022-04-25) +- the number of nodes +- the number of edges +- the available formats (notably whether the graph is also available in its + compressed representation). +- the total on-disk size of the dataset +- the buckets/URIs to obtain the graph from S3 and from the annex diff --git a/docs/generate_subdataset.rst b/docs/generate_subdataset.rst new file mode 100644 --- /dev/null +++ b/docs/generate_subdataset.rst @@ -0,0 +1,84 @@ +.. _swh-graph-export-subdataset: + +====================== +Exporting a subdataset +====================== + +.. highlight:: bash + +Because the entire graph is often too big to be practical for many research use +cases, notably for prototyping, it is generally useful to publish "subdatasets" +which only contain a subset of the entire graph. +An example of a very useful subdataset is the graph containing only the top +1000 most popular GitHub repositories (sorted by number of stars). + +This page details the various steps required to export a graph subdataset using +swh-graph and Amazon Athena. + + +Step 1. Obtain the list of origins +---------------------------------- + +You first need to obtain a list of origins that you want to include in the +subdataset. Depending on the type of subdataset you want to create, this can be +done in various ways, either manual or automated. The following is an example +of how to get the list of the 1000 most popular GitHub repositories in the +Python language, sorted by number of stars:: + + for i in $( seq 1 10 ); do \ + curl -G https://api.github.com/search/repositories \ + -d "page=$i" \ + -d "s=stars" -d "order=desc" -d "q=language:python" -d 'per_page=100' | \ + jq --raw-output '.items[].html_url'; \ + sleep 6; \ + done > origins.txt + + +Step 2. Build the list of SWHIDs +-------------------------------- + +To generate a subdataset from an existing dataset, you need to generate the +list of all the SWHIDs to include in the subdataset. The best way to achieve +that is to use the compressed graph to perform a full visit of the compressed +graph starting from the origin nodes, and to return the list of all the SWHIDs +that are reachable from these origins. + +Unfortunately, there is currently no endpoint in the HTTP API to start a +traversal from multiple nodes. The current best way to achieve this is +therefore to visit the graph starting from each origin, one by one, and then to +merge all the resulting lists of SWHIDs into a single sorted list of unique +SWHIDs. + +If you use the internal graph API, you might need to convert the origin URLs in +the Extended SWHID format (``swh:ori:1:``) to query the API. + + +Step 3. Generate the subdataset on Athena +----------------------------------------- + +Once you have obtained a text file containing all the SWHIDs to be included in +the new dataset, it is possible to use AWS Athena to JOIN this list of SWHIDs +with the tables of an existing dataset, and write the output as a new ORC +dataset. + +First, make sure that your base dataset containing the entire graph is +available as a database on AWS Athena, which can be set up by +following the steps described in :ref:`swh-graph-athena`. + +The subdataset can then be generated with the ``aws dataset athena +gensubdataset`` command:: + + swh dataset athena gensubdataset \ + --swhids swhids.csv \ + --database swh_20210323 + --subdataset-database swh_20210323_popular3kpython \ + --subdataset-location s3://softwareheritage/graph/2021-03-23-popular-3k-python/ + + +Step 4. Upload and document the newly generated subdataset +---------------------------------------------------------- + +After having executed the previous step, there should now be a new dataset +located at the S3 path given as the parameter to ``--subdataset-location``. +You can upload, publish and document this new subdataset by following the +procedure described in :ref:`swh-graph-export`. diff --git a/docs/graph/athena.rst b/docs/graph/athena.rst --- a/docs/graph/athena.rst +++ b/docs/graph/athena.rst @@ -1,3 +1,5 @@ +.. _swh-graph-athena: + Setup on Amazon Athena ====================== diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -13,3 +13,4 @@ graph/index export + generate_subdataset