diff --git a/docs/graph/athena.rst b/docs/graph/athena.rst index 15e20f2..25ee257 100644 --- a/docs/graph/athena.rst +++ b/docs/graph/athena.rst @@ -1,115 +1,125 @@ Setup on Amazon Athena ====================== The Software Heritage Graph Dataset is available as a public dataset in `Amazon Athena `_. Athena uses `presto `_, a distributed SQL query engine, to automatically scale queries on large datasets. The pricing of Athena depends on the amount of data scanned by each query, generally at a cost of $5 per TiB of data scanned. Full pricing details are available `here `_. Note that because the Software Heritage Graph Dataset is available as a public dataset, you **do not have to pay for the storage, only for the queries** (except for the data you store on S3 yourself, like query results). Loading the tables ------------------ .. highlight:: bash AWS account ~~~~~~~~~~~ In order to use Amazon Athena, you will first need to `create an AWS account and setup billing `_. +You will also need to create an **output S3 bucket**: this is the place where +Athena will store your query results, so that you can retrieve them and analyze +them afterwards. To do that, go on the `S3 console +`_ and create a new bucket. + Setup ~~~~~ Athena needs to be made aware of the location and the schema of the Parquet files available as a public dataset. Unfortunately, since Athena does not support queries that contain multiple commands, it is not as simple as pasting an installation script in the console. Instead, we provide a Python script that can be run locally on your machine, that will communicate with Athena to create the tables automatically with the appropriate schema. To run this script, you will need to install a few dependencies on your machine: - For **Ubuntu** and **Debian**:: sudo apt install python3 python3-boto3 awscli - For **Archlinux**:: sudo pacman -S --needed python python-boto3 aws-cli Once the dependencies are installed, run:: aws configure This will ask for an AWS Access Key ID and an AWS Secret Access Key in order to give Python access to your AWS account. These keys can be generated at `this address `_. It will also ask for the region in which you want to run the queries. We recommand to use ``us-east-1``, since that's where the public dataset is located. Creating the tables ~~~~~~~~~~~~~~~~~~~ Download and run the Python script that will create the tables on your account: .. tabs:: .. group-tab:: full :: - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/tables.py - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/gen_schema.py - ./gen_schema.py + wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py + python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' .. group-tab:: teaser: popular-4k - This teaser is not available on Athena yet. + :: + + wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py + python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular4k -l 's3://softwareheritage/teasers/popular-4k' .. group-tab:: teaser: popular-3k-python - This teaser is not available on Athena yet. + :: + + wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py + python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular3kpython -l 's3://softwareheritage/teasers/popular-3k-python' To check that the tables have been successfully created in your account, you can open your `Amazon Athena console `_. You should be able to select the database corresponding to your dataset, and see the tables: .. image:: _images/athena_tables.png Running queries --------------- .. highlight:: sql From the console, once you have selected the database of your dataset, you can run SQL queries directly from the Query Editor. Try for instance this query that computes the most frequent file names in the archive:: SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt FROM directory_entry_file GROUP BY name ORDER BY cnt DESC LIMIT 10; Other examples are available in the preprint of our article: `The Software Heritage Graph Dataset: Public software development under one roof. `_