diff --git a/docs/graph/athena.rst b/docs/graph/athena.rst
index 188334e..158e5a8 100644
--- a/docs/graph/athena.rst
+++ b/docs/graph/athena.rst
@@ -1,125 +1,110 @@
Setup on Amazon Athena
======================
The Software Heritage Graph Dataset is available as a public dataset in `Amazon
Athena `_. Athena uses `presto
`_, a distributed SQL query engine, to
automatically scale queries on large datasets.
The pricing of Athena depends on the amount of data scanned by each query,
generally at a cost of $5 per TiB of data scanned. Full pricing details are
available `here `_.
Note that because the Software Heritage Graph Dataset is available as a public
dataset, you **do not have to pay for the storage, only for the queries**
(except for the data you store on S3 yourself, like query results).
Loading the tables
------------------
.. highlight:: bash
AWS account
~~~~~~~~~~~
In order to use Amazon Athena, you will first need to `create an AWS account
and setup billing
`_.
You will also need to create an **output S3 bucket**: this is the place where
Athena will store your query results, so that you can retrieve them and analyze
them afterwards. To do that, go on the `S3 console
`_ and create a new bucket.
Setup
~~~~~
Athena needs to be made aware of the location and the schema of the Parquet
files available as a public dataset. Unfortunately, since Athena does not
support queries that contain multiple commands, it is not as simple as pasting
-an installation script in the console. Instead, we provide a Python script that
-can be run locally on your machine, that will communicate with Athena to create
+an installation script in the console. Instead, you can use the ``swh dataset
+athena`` command on your local machine, which will query Athena to create
the tables automatically with the appropriate schema.
-To run this script, you will need to install a few dependencies on your
-machine:
+First, install the ``swh.dataset`` Python module from PyPI::
-- For **Ubuntu** and **Debian**::
-
- sudo apt install python3 python3-boto3 awscli
-
-- For **Archlinux**::
-
- sudo pacman -S --needed python python-boto3 aws-cli
+ pip install swh.dataset
Once the dependencies are installed, run::
- aws configure
+ aws configure
This will ask for an AWS Access Key ID and an AWS Secret Access Key in
-order to give Python access to your AWS account. These keys can be generated at
-`this address
+order to give the Boto3 library access to your AWS account. These keys can be
+generated at `this address
`_.
It will also ask for the region in which you want to run the queries. We
recommend to use ``us-east-1``, since that's where the public dataset is
located.
Creating the tables
~~~~~~~~~~~~~~~~~~~
-Download and run the Python script that will create the tables on your account:
-
-.. tabs::
-
- .. group-tab:: full
-
- ::
-
- wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py
- python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/'
-
- .. group-tab:: teaser: popular-4k
+The ``swh dataset athena create`` command can be used to create the tables on
+your Athena instance. For example, to create the tables of the 2021-03-23
+graph::
- ::
-
- wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py
- python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular4k -l 's3://softwareheritage/teasers/popular-4k'
-
- .. group-tab:: teaser: popular-3k-python
-
- ::
-
- wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py
- python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular3kpython -l 's3://softwareheritage/teasers/popular-3k-python'
+ swh dataset athena create \
+ --database-name swh_graph_2021_03_23
+ --location-prefix s3://softwareheritage/graph/2021-03-23/orc
+ --output-location s3://YOUR_OUTPUT_BUCKET/
To check that the tables have been successfully created in your account, you
can open your `Amazon Athena console
`_. You should be able to select
the database corresponding to your dataset, and see the tables:
.. image:: _images/athena_tables.png
Running queries
---------------
-.. highlight:: sql
-
From the console, once you have selected the database of your dataset, you can
run SQL queries directly from the Query Editor.
Try for instance this query that computes the most frequent file names in the
-archive::
+archive:
+
+.. code-block:: sql
- SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt
- FROM directory_entry_file
- GROUP BY name
- ORDER BY cnt DESC
- LIMIT 10;
+ SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt
+ FROM directory_entry
+ GROUP BY name
+ ORDER BY cnt DESC
+ LIMIT 10;
Other examples are available in the preprint of our article: `The Software
Heritage Graph Dataset: Public software development under one roof.
`_
+
+It is also possible to query Athena directly from the command line, using the
+``swh dataset athena query`` command::
+
+ echo "select message from revision limit 10;" |
+ swh dataset athena query \
+ --database-name swh_graph_2021_03_23
+ --output-location s3://YOUR_OUTPUT_BUCKET/
diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst
index 3b3ede2..8905106 100644
--- a/docs/graph/schema.rst
+++ b/docs/graph/schema.rst
@@ -1,134 +1,136 @@
Relational schema
=================
The Merkle DAG of the Software Heritage archive is encoded in the dataset as a
set of relational tables.
+
+This page documents the relational schema of the **latest version** of the
+graph dataset.
+
A simplified view of the corresponding database schema is shown here:
.. image:: _images/dataset-schema.svg
-This page documents the details of the schema.
-
**Note**: To limit abuse, some columns containing personal information are
pseudonimized in the dataset using a hash algorithm. Individual authors may be
retrieved by querying the Software Heritage API.
- **content**: contains information on the contents stored in
the archive.
- ``sha1`` (string): the SHA-1 of the content (hexadecimal)
- ``sha1_git`` (string): the Git SHA-1 of the content (hexadecimal)
- ``sha256`` (string): the SHA-256 of the content (hexadecimal)
- ``blake2s256`` (bytes): the BLAKE2s-256 of the content (hexadecimal)
- ``length`` (integer): the length of the content
- ``status`` (string): the visibility status of the content
- **skipped_content**: contains information on the contents that were not
archived for various reasons.
- ``sha1`` (string): the SHA-1 of the skipped content (hexadecimal)
- ``sha1_git`` (string): the Git SHA-1 of the skipped content (hexadecimal)
- ``sha256`` (string): the SHA-256 of the skipped content (hexadecimal)
- ``blake2s256`` (bytes): the BLAKE2s-256 of the skipped content
(hexadecimal)
- ``length`` (integer): the length of the skipped content
- ``status`` (string): the visibility status of the skipped content
- ``reason`` (string): the reason why the content was skipped
- **directory**: contains the directories stored in the archive.
- ``id`` (string): the intrinsic hash of the directory (hexadecimal),
recursively computed with the Git SHA-1 algorithm
- **directory_entry**: contains the entries in directories.
- ``directory_id`` (string): the Git SHA-1 of the directory
containing the entry (hexadecimal).
- ``name`` (bytes): the name of the file (basename of its path)
- ``type`` (string): the type of object the branch points to (either
``revision``, ``directory`` or ``content``).
- ``target`` (string): the Git SHA-1 of the object this
entry points to (hexadecimal).
- ``perms`` (integer): the permissions of the object
- **revision**: contains the revisions stored in the archive.
- ``id`` (string): the intrinsic hash of the revision (hexadecimal),
recursively computed with the Git SHA-1 algorithm. For Git repositories,
this corresponds to the commit hash.
- ``message`` (bytes): the revision message
- ``author`` (string): an anonymized hash of the author of the revision.
- ``date`` (timestamp): the date the revision was authored
- ``date_offset`` (integer): the offset of the timezone of ``date``
- ``committer`` (string): an anonymized hash of the committer of the revision.
- ``committer_date`` (timestamp): the date the revision was committed
- ``committer_date_offset`` (integer): the offset of the timezone of
``committer_date``
- ``directory`` (string): the Git SHA-1 of the directory the revision points
to (hexadecimal). Every revision points to the root directory of the
project source tree to which it corresponds.
- **revision_history**: contains the ordered set of parents of each revision.
Each revision has an ordered set of parents (0 for the initial commit of a
repository, 1 for a regular commit, 2 for a regular merge commit and 3 or
more for octopus-style merge commits).
- ``id`` (string): the Git SHA-1 identifier of the revision (hexadecimal)
- ``parent_id`` (string): the Git SHA-1 identifier of the parent (hexadecimal)
- ``parent_rank`` (integer): the rank of the parent, which defines the
ordering between the parents of the revision
- **release**: contains the releases stored in the archive.
- ``id`` (string): the intrinsic hash of the release (hexadecimal),
recursively computed with the Git SHA-1 algorithm
- ``target`` (string): the Git SHA-1 of the object the release points to
(hexadecimal)
- ``date`` (timestamp): the date the release was created
- ``author`` (integer): the author of the revision
- ``name`` (bytes): the release name
- ``message`` (bytes): the release message
- **snapshot**: contains the list of snapshots stored in the archive.
- ``id`` (string): the intrinsic hash of the snapshot (hexadecimal),
recursively computed with the Git SHA-1 algorithm.
- **snapshot_branch**: contains the list of branches associated with
each snapshot.
- ``snapshot_id`` (string): the intrinsic hash of the snapshot (hexadecimal)
- ``name`` (bytes): the name of the branch
- ``target`` (string): the intrinsic hash of the object the branch points to
(hexadecimal)
- ``target_type`` (string): the type of object the branch points to (either
``release``, ``revision``, ``directory`` or ``content``).
- **origin**: the software origins from which the projects in the dataset were
archived.
- ``url`` (bytes): the URL of the origin
- **origin_visit**: the different visits of each origin. Since Software
Heritage archives software continuously, software origins are crawled more
than once. Each of these "visits" is an entry in this table.
- ``origin``: (string) the URL of the origin visited
- ``visit``: (integer) an integer identifier of the visit
- ``date``: (timestamp) the date at which the origin was visited
- ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``,
``svn``, ``git``, ``ftp``, ``deb``, ...)
- **origin_visit_status**: the status of each visit.
- ``origin``: (string) the URL of the origin visited
- ``visit``: (integer) an integer identifier of the visit
- ``date``: (timestamp) the date at which the origin was visited
- ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``,
``svn``, ``git``, ``ftp``, ``deb``, ...)
- ``snapshot_id`` (string): the intrinsic hash of the snapshot archived in
this visit (hexadecimal).
- ``status`` (string): the integer identifier of the snapshot archived in
this visit, either ``partial`` for partial visits or ``full`` for full
visits.