diff --git a/docs/graph/athena.rst b/docs/graph/athena.rst --- a/docs/graph/athena.rst +++ b/docs/graph/athena.rst @@ -39,28 +39,21 @@ Athena needs to be made aware of the location and the schema of the Parquet files available as a public dataset. Unfortunately, since Athena does not support queries that contain multiple commands, it is not as simple as pasting -an installation script in the console. Instead, we provide a Python script that -can be run locally on your machine, that will communicate with Athena to create +an installation script in the console. Instead, you can use the ``swh dataset +athena`` command on your local machine, which will query Athena to create the tables automatically with the appropriate schema. -To run this script, you will need to install a few dependencies on your -machine: +First, install the ``swh.dataset`` Python module from PyPI:: -- For **Ubuntu** and **Debian**:: - - sudo apt install python3 python3-boto3 awscli - -- For **Archlinux**:: - - sudo pacman -S --needed python python-boto3 aws-cli + pip install swh.dataset Once the dependencies are installed, run:: - aws configure + aws configure This will ask for an AWS Access Key ID and an AWS Secret Access Key in -order to give Python access to your AWS account. These keys can be generated at -`this address +order to give the Boto3 library access to your AWS account. These keys can be +generated at `this address `_. It will also ask for the region in which you want to run the queries. We @@ -70,30 +63,14 @@ Creating the tables ~~~~~~~~~~~~~~~~~~~ -Download and run the Python script that will create the tables on your account: - -.. tabs:: - - .. group-tab:: full - - :: - - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py - python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' +The ``swh dataset athena create`` command can be used to create the tables on +your Athena instance. For example, to create the tables of the 2021-03-23 +graph:: - .. group-tab:: teaser: popular-4k - - :: - - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py - python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular4k -l 's3://softwareheritage/teasers/popular-4k' - - .. group-tab:: teaser: popular-3k-python - - :: - - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py - python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular3kpython -l 's3://softwareheritage/teasers/popular-3k-python' + swh dataset athena create \ + --database-name swh_graph_2021_03_23 + --location-prefix s3://softwareheritage/graph/2021-03-23/orc + --output-location s3://YOUR_OUTPUT_BUCKET/ To check that the tables have been successfully created in your account, you can open your `Amazon Athena console @@ -115,7 +92,7 @@ archive:: SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt - FROM directory_entry_file + FROM directory_entry GROUP BY name ORDER BY cnt DESC LIMIT 10; @@ -123,3 +100,11 @@ Other examples are available in the preprint of our article: `The Software Heritage Graph Dataset: Public software development under one roof. `_ + +It is also possible to query Athena directly from the command line, using the +``swh dataset athena query`` command:: + + echo "select message from revision limit 10;" | + swh dataset athena query \ + --database-name swh_graph_2021_03_23 + --output-location s3://YOUR_OUTPUT_BUCKET/ diff --git a/docs/graph/databricks.rst b/docs/graph/databricks.rst --- a/docs/graph/databricks.rst +++ b/docs/graph/databricks.rst @@ -37,7 +37,6 @@ [FileInfo(path='abfss://.../swh/content/', name='content/', size=0), FileInfo(path='abfss://.../swh/directory/', name='directory/', size=0), - FileInfo(path='abfss://.../swh/directory_entry_dir/', name='directory_entry_dir/', size=0), ...] Loading the tables @@ -55,19 +54,16 @@ tables = [ 'content', 'directory', - 'directory_entry_dir', - 'directory_entry_file', - 'directory_entry_rev', + 'directory_entry', 'origin', 'origin_visit', - 'person', + 'origin_visit_status', 'release', 'revision', 'revision_history', 'skipped_content', 'snapshot', 'snapshot_branch', - 'snapshot_branches' ] for table in tables: diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst --- a/docs/graph/schema.rst +++ b/docs/graph/schema.rst @@ -3,12 +3,14 @@ The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables. + +This page documents the relational schema of the **latest version** of the +graph dataset. + A simplified view of the corresponding database schema is shown here: .. image:: _images/dataset-schema.svg -This page documents the details of the schema. - **Note**: To limit abuse, some columns containing personal information are pseudonimized in the dataset using a hash algorithm. Individual authors may be retrieved by querying the Software Heritage API.