diff --git a/docs/export.rst b/docs/export.rst index 56a17d3..c7711d4 100644 --- a/docs/export.rst +++ b/docs/export.rst @@ -1,56 +1,98 @@ +.. _swh-graph-export: + =================== Exporting a dataset =================== This repository aims to contain various pipelines to generate datasets of Software Heritage data, so that they can be used internally or by external researchers. Graph dataset ============= +Exporting the full dataset +-------------------------- + Right now, the only supported export pipeline is the *Graph Dataset*, a set of relational tables representing the Software Heritage Graph, as documented in :ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export`` command. This dataset can be exported in two different formats: ``orc`` and ``edges``. To export a graph, you need to provide a comma-separated list of formats to export with the ``--formats`` option. You also need an export ID, a unique identifier used by the Kafka server to store the current progress of the export. +**Note**: exporting as the ``edges`` format is discouraged, as it is redundant +and can easily be generated directly from the ORC format. + Here is an example command to start a graph dataset export:: swh dataset -C graph_export_config.yml graph export \ --formats orc \ - --export-id seirl-2022-04-25 \ + --export-id 2022-04-25 \ -p 64 \ /srv/softwareheritage/hdd/graph/2022-04-25 This command usually takes more than a week for a full export, it is therefore advised to run it in a service or a tmux session. The configuration file should contain the configuration for the swh-journal clients, as well as various configuration options for the exporters. Here is an example configuration file:: journal: brokers: - kafka1.internal.softwareheritage.org:9094 - kafka2.internal.softwareheritage.org:9094 - kafka3.internal.softwareheritage.org:9094 - kafka4.internal.softwareheritage.org:9094 security.protocol: SASL_SSL sasl.mechanisms: SCRAM-SHA-512 max.poll.interval.ms: 1000000 remove_pull_requests: true The following configuration options can be used for the export: - ``remove_pull_requests``: remove all edges from origin to snapshot matching ``refs/*`` but not matching ``refs/heads/*`` or ``refs/tags/*``. This removes all the pull requests that are present in Software Heritage (archived with ``git clone --mirror``). + + +Uploading on S3 & on the annex +------------------------------ + +The dataset should then be made available publicly by uploading it on S3 and on +the public annex. + +For S3:: + + aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc + +For the annex:: + + scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/ + ssh saam.internal.softwareheritage.org + cd /srv/softwareheritage/annex/public/dataset/graph + git annex add 2022-04-25 + git annex sync --content + + +Documenting the new dataset +--------------------------- + +In the ``swh-dataset`` repository, edit the the file ``docs/graph/dataset.rst`` +to document the availability of the new dataset. You should usually mention: + +- the name of the dataset version (e.g., 2022-04-25) +- the number of nodes +- the number of edges +- the available formats (notably whether the graph is also available in its + compressed representation). +- the total on-disk size of the dataset +- the buckets/URIs to obtain the graph from S3 and from the annex diff --git a/docs/generate_subdataset.rst b/docs/generate_subdataset.rst new file mode 100644 index 0000000..2d729e6 --- /dev/null +++ b/docs/generate_subdataset.rst @@ -0,0 +1,84 @@ +.. _swh-graph-export-subdataset: + +====================== +Exporting a subdataset +====================== + +.. highlight:: bash + +Because the entire graph is often too big to be practical for many research use +cases, notably for prototyping, it is generally useful to publish "subdatasets" +which only contain a subset of the entire graph. +An example of a very useful subdataset is the graph containing only the top +1000 most popular GitHub repositories (sorted by number of stars). + +This page details the various steps required to export a graph subdataset using +swh-graph and Amazon Athena. + + +Step 1. Obtain the list of origins +---------------------------------- + +You first need to obtain a list of origins that you want to include in the +subdataset. Depending on the type of subdataset you want to create, this can be +done in various ways, either manual or automated. The following is an example +of how to get the list of the 1000 most popular GitHub repositories in the +Python language, sorted by number of stars:: + + for i in $( seq 1 10 ); do \ + curl -G https://api.github.com/search/repositories \ + -d "page=$i" \ + -d "s=stars" -d "order=desc" -d "q=language:python" -d 'per_page=100' | \ + jq --raw-output '.items[].html_url'; \ + sleep 6; \ + done > origins.txt + + +Step 2. Build the list of SWHIDs +-------------------------------- + +To generate a subdataset from an existing dataset, you need to generate the +list of all the SWHIDs to include in the subdataset. The best way to achieve +that is to use the compressed graph to perform a full visit of the compressed +graph starting from the origin nodes, and to return the list of all the SWHIDs +that are reachable from these origins. + +Unfortunately, there is currently no endpoint in the HTTP API to start a +traversal from multiple nodes. The current best way to achieve this is +therefore to visit the graph starting from each origin, one by one, and then to +merge all the resulting lists of SWHIDs into a single sorted list of unique +SWHIDs. + +If you use the internal graph API, you might need to convert the origin URLs in +the Extended SWHID format (``swh:ori:1:``) to query the API. + + +Step 3. Generate the subdataset on Athena +----------------------------------------- + +Once you have obtained a text file containing all the SWHIDs to be included in +the new dataset, it is possible to use AWS Athena to JOIN this list of SWHIDs +with the tables of an existing dataset, and write the output as a new ORC +dataset. + +First, make sure that your base dataset containing the entire graph is +available as a database on AWS Athena, which can be set up by +following the steps described in :ref:`swh-graph-athena`. + +The subdataset can then be generated with the ``aws dataset athena +gensubdataset`` command:: + + swh dataset athena gensubdataset \ + --swhids swhids.csv \ + --database swh_20210323 + --subdataset-database swh_20210323_popular3kpython \ + --subdataset-location s3://softwareheritage/graph/2021-03-23-popular-3k-python/ + + +Step 4. Upload and document the newly generated subdataset +---------------------------------------------------------- + +After having executed the previous step, there should now be a new dataset +located at the S3 path given as the parameter to ``--subdataset-location``. +You can upload, publish and document this new subdataset by following the +procedure described in :ref:`swh-graph-export`. diff --git a/docs/graph/athena.rst b/docs/graph/athena.rst index 158e5a8..2e6f551 100644 --- a/docs/graph/athena.rst +++ b/docs/graph/athena.rst @@ -1,110 +1,112 @@ +.. _swh-graph-athena: + Setup on Amazon Athena ====================== The Software Heritage Graph Dataset is available as a public dataset in `Amazon Athena `_. Athena uses `presto `_, a distributed SQL query engine, to automatically scale queries on large datasets. The pricing of Athena depends on the amount of data scanned by each query, generally at a cost of $5 per TiB of data scanned. Full pricing details are available `here `_. Note that because the Software Heritage Graph Dataset is available as a public dataset, you **do not have to pay for the storage, only for the queries** (except for the data you store on S3 yourself, like query results). Loading the tables ------------------ .. highlight:: bash AWS account ~~~~~~~~~~~ In order to use Amazon Athena, you will first need to `create an AWS account and setup billing `_. You will also need to create an **output S3 bucket**: this is the place where Athena will store your query results, so that you can retrieve them and analyze them afterwards. To do that, go on the `S3 console `_ and create a new bucket. Setup ~~~~~ Athena needs to be made aware of the location and the schema of the Parquet files available as a public dataset. Unfortunately, since Athena does not support queries that contain multiple commands, it is not as simple as pasting an installation script in the console. Instead, you can use the ``swh dataset athena`` command on your local machine, which will query Athena to create the tables automatically with the appropriate schema. First, install the ``swh.dataset`` Python module from PyPI:: pip install swh.dataset Once the dependencies are installed, run:: aws configure This will ask for an AWS Access Key ID and an AWS Secret Access Key in order to give the Boto3 library access to your AWS account. These keys can be generated at `this address `_. It will also ask for the region in which you want to run the queries. We recommend to use ``us-east-1``, since that's where the public dataset is located. Creating the tables ~~~~~~~~~~~~~~~~~~~ The ``swh dataset athena create`` command can be used to create the tables on your Athena instance. For example, to create the tables of the 2021-03-23 graph:: swh dataset athena create \ --database-name swh_graph_2021_03_23 --location-prefix s3://softwareheritage/graph/2021-03-23/orc --output-location s3://YOUR_OUTPUT_BUCKET/ To check that the tables have been successfully created in your account, you can open your `Amazon Athena console `_. You should be able to select the database corresponding to your dataset, and see the tables: .. image:: _images/athena_tables.png Running queries --------------- From the console, once you have selected the database of your dataset, you can run SQL queries directly from the Query Editor. Try for instance this query that computes the most frequent file names in the archive: .. code-block:: sql SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt FROM directory_entry GROUP BY name ORDER BY cnt DESC LIMIT 10; Other examples are available in the preprint of our article: `The Software Heritage Graph Dataset: Public software development under one roof. `_ It is also possible to query Athena directly from the command line, using the ``swh dataset athena query`` command:: echo "select message from revision limit 10;" | swh dataset athena query \ --database-name swh_graph_2021_03_23 --output-location s3://YOUR_OUTPUT_BUCKET/ diff --git a/docs/index.rst b/docs/index.rst index dbc4026..79fca1b 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,15 +1,16 @@ .. _swh-dataset: .. include:: README.rst :ref:`The Software Heritage Graph Dataset ` the entire graph of Software Heritage in a fully-deduplicated Merkle DAG representation. .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: graph/index export + generate_subdataset