Changeset View
Changeset View
Standalone View
Standalone View
docs/generate_subdataset.rst
- This file was added.
.. _swh-graph-export-subdataset: | |||||
====================== | |||||
Exporting a subdataset | |||||
====================== | |||||
.. highlight:: bash | |||||
Because the entire graph is often too big to be practical for many research use | |||||
cases, notably for prototyping, it is generally useful to publish "subdatasets" | |||||
which only contain a subset of the entire graph. | |||||
An example of a very useful subdataset is the graph containing only the top | |||||
1000 most popular GitHub repositories (sorted by number of stars). | |||||
This page details the various steps required to export a graph subdataset using | |||||
swh-graph and Amazon Athena. | |||||
Step 1. Obtain the list of origins | |||||
---------------------------------- | |||||
You first need to obtain a list of origins that you want to include in the | |||||
subdataset. Depending on the type of subdataset you want to create, this can be | |||||
done in various ways, either manual or automated. The following is an example | |||||
of how to get the list of the 1000 most popular GitHub repositories in the | |||||
Python language, sorted by number of stars:: | |||||
for i in $( seq 1 10 ); do \ | |||||
curl -G https://api.github.com/search/repositories \ | |||||
-d "page=$i" \ | |||||
-d "s=stars" -d "order=desc" -d "q=language:python" -d 'per_page=100' | \ | |||||
jq --raw-output '.items[].html_url'; \ | |||||
sleep 6; \ | |||||
done > origins.txt | |||||
Step 2. Build the list of SWHIDs | |||||
-------------------------------- | |||||
To generate a subdataset from an existing dataset, you need to generate the | |||||
list of all the SWHIDs to include in the subdataset. The best way to achieve | |||||
that is to use the compressed graph to perform a full visit of the compressed | |||||
graph starting from the origin nodes, and to return the list of all the SWHIDs | |||||
that are reachable from these origins. | |||||
Unfortunately, there is currently no endpoint in the HTTP API to start a | |||||
traversal from multiple nodes. The current best way to achieve this is | |||||
therefore to visit the graph starting from each origin, one by one, and then to | |||||
merge all the resulting lists of SWHIDs into a single sorted list of unique | |||||
SWHIDs. | |||||
If you use the internal graph API, you might need to convert the origin URLs in | |||||
the Extended SWHID format (``swh:ori:1:<sha1(url)>``) to query the API. | |||||
Step 3. Generate the subdataset on Athena | |||||
----------------------------------------- | |||||
Once you have obtained a text file containing all the SWHIDs to be included in | |||||
the new dataset, it is possible to use AWS Athena to JOIN this list of SWHIDs | |||||
with the tables of an existing dataset, and write the output as a new ORC | |||||
dataset. | |||||
First, make sure that your base dataset containing the entire graph is | |||||
available as a database on AWS Athena, which can be set up by | |||||
following the steps described in :ref:`swh-graph-athena`. | |||||
The subdataset can then be generated with the ``aws dataset athena | |||||
gensubdataset`` command:: | |||||
swh dataset athena gensubdataset \ | |||||
--swhids swhids.csv \ | |||||
--database swh_20210323 | |||||
--subdataset-database swh_20210323_popular3kpython \ | |||||
--subdataset-location s3://softwareheritage/graph/2021-03-23-popular-3k-python/ | |||||
Step 4. Upload and document the newly generated subdataset | |||||
---------------------------------------------------------- | |||||
After having executed the previous step, there should now be a new dataset | |||||
located at the S3 path given as the parameter to ``--subdataset-location``. | |||||
You can upload, publish and document this new subdataset by following the | |||||
procedure described in :ref:`swh-graph-export`. |