diff --git a/docs/graph/athena.rst b/docs/graph/athena.rst index 188334e..158e5a8 100644 --- a/docs/graph/athena.rst +++ b/docs/graph/athena.rst @@ -1,125 +1,110 @@ Setup on Amazon Athena ====================== The Software Heritage Graph Dataset is available as a public dataset in `Amazon Athena `_. Athena uses `presto `_, a distributed SQL query engine, to automatically scale queries on large datasets. The pricing of Athena depends on the amount of data scanned by each query, generally at a cost of $5 per TiB of data scanned. Full pricing details are available `here `_. Note that because the Software Heritage Graph Dataset is available as a public dataset, you **do not have to pay for the storage, only for the queries** (except for the data you store on S3 yourself, like query results). Loading the tables ------------------ .. highlight:: bash AWS account ~~~~~~~~~~~ In order to use Amazon Athena, you will first need to `create an AWS account and setup billing `_. You will also need to create an **output S3 bucket**: this is the place where Athena will store your query results, so that you can retrieve them and analyze them afterwards. To do that, go on the `S3 console `_ and create a new bucket. Setup ~~~~~ Athena needs to be made aware of the location and the schema of the Parquet files available as a public dataset. Unfortunately, since Athena does not support queries that contain multiple commands, it is not as simple as pasting -an installation script in the console. Instead, we provide a Python script that -can be run locally on your machine, that will communicate with Athena to create +an installation script in the console. Instead, you can use the ``swh dataset +athena`` command on your local machine, which will query Athena to create the tables automatically with the appropriate schema. -To run this script, you will need to install a few dependencies on your -machine: +First, install the ``swh.dataset`` Python module from PyPI:: -- For **Ubuntu** and **Debian**:: - - sudo apt install python3 python3-boto3 awscli - -- For **Archlinux**:: - - sudo pacman -S --needed python python-boto3 aws-cli + pip install swh.dataset Once the dependencies are installed, run:: - aws configure + aws configure This will ask for an AWS Access Key ID and an AWS Secret Access Key in -order to give Python access to your AWS account. These keys can be generated at -`this address +order to give the Boto3 library access to your AWS account. These keys can be +generated at `this address `_. It will also ask for the region in which you want to run the queries. We recommend to use ``us-east-1``, since that's where the public dataset is located. Creating the tables ~~~~~~~~~~~~~~~~~~~ -Download and run the Python script that will create the tables on your account: - -.. tabs:: - - .. group-tab:: full - - :: - - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py - python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' - - .. group-tab:: teaser: popular-4k +The ``swh dataset athena create`` command can be used to create the tables on +your Athena instance. For example, to create the tables of the 2021-03-23 +graph:: - :: - - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py - python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular4k -l 's3://softwareheritage/teasers/popular-4k' - - .. group-tab:: teaser: popular-3k-python - - :: - - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py - python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular3kpython -l 's3://softwareheritage/teasers/popular-3k-python' + swh dataset athena create \ + --database-name swh_graph_2021_03_23 + --location-prefix s3://softwareheritage/graph/2021-03-23/orc + --output-location s3://YOUR_OUTPUT_BUCKET/ To check that the tables have been successfully created in your account, you can open your `Amazon Athena console `_. You should be able to select the database corresponding to your dataset, and see the tables: .. image:: _images/athena_tables.png Running queries --------------- -.. highlight:: sql - From the console, once you have selected the database of your dataset, you can run SQL queries directly from the Query Editor. Try for instance this query that computes the most frequent file names in the -archive:: +archive: + +.. code-block:: sql - SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt - FROM directory_entry_file - GROUP BY name - ORDER BY cnt DESC - LIMIT 10; + SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt + FROM directory_entry + GROUP BY name + ORDER BY cnt DESC + LIMIT 10; Other examples are available in the preprint of our article: `The Software Heritage Graph Dataset: Public software development under one roof. `_ + +It is also possible to query Athena directly from the command line, using the +``swh dataset athena query`` command:: + + echo "select message from revision limit 10;" | + swh dataset athena query \ + --database-name swh_graph_2021_03_23 + --output-location s3://YOUR_OUTPUT_BUCKET/ diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst index 3b3ede2..8905106 100644 --- a/docs/graph/schema.rst +++ b/docs/graph/schema.rst @@ -1,134 +1,136 @@ Relational schema ================= The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables. + +This page documents the relational schema of the **latest version** of the +graph dataset. + A simplified view of the corresponding database schema is shown here: .. image:: _images/dataset-schema.svg -This page documents the details of the schema. - **Note**: To limit abuse, some columns containing personal information are pseudonimized in the dataset using a hash algorithm. Individual authors may be retrieved by querying the Software Heritage API. - **content**: contains information on the contents stored in the archive. - ``sha1`` (string): the SHA-1 of the content (hexadecimal) - ``sha1_git`` (string): the Git SHA-1 of the content (hexadecimal) - ``sha256`` (string): the SHA-256 of the content (hexadecimal) - ``blake2s256`` (bytes): the BLAKE2s-256 of the content (hexadecimal) - ``length`` (integer): the length of the content - ``status`` (string): the visibility status of the content - **skipped_content**: contains information on the contents that were not archived for various reasons. - ``sha1`` (string): the SHA-1 of the skipped content (hexadecimal) - ``sha1_git`` (string): the Git SHA-1 of the skipped content (hexadecimal) - ``sha256`` (string): the SHA-256 of the skipped content (hexadecimal) - ``blake2s256`` (bytes): the BLAKE2s-256 of the skipped content (hexadecimal) - ``length`` (integer): the length of the skipped content - ``status`` (string): the visibility status of the skipped content - ``reason`` (string): the reason why the content was skipped - **directory**: contains the directories stored in the archive. - ``id`` (string): the intrinsic hash of the directory (hexadecimal), recursively computed with the Git SHA-1 algorithm - **directory_entry**: contains the entries in directories. - ``directory_id`` (string): the Git SHA-1 of the directory containing the entry (hexadecimal). - ``name`` (bytes): the name of the file (basename of its path) - ``type`` (string): the type of object the branch points to (either ``revision``, ``directory`` or ``content``). - ``target`` (string): the Git SHA-1 of the object this entry points to (hexadecimal). - ``perms`` (integer): the permissions of the object - **revision**: contains the revisions stored in the archive. - ``id`` (string): the intrinsic hash of the revision (hexadecimal), recursively computed with the Git SHA-1 algorithm. For Git repositories, this corresponds to the commit hash. - ``message`` (bytes): the revision message - ``author`` (string): an anonymized hash of the author of the revision. - ``date`` (timestamp): the date the revision was authored - ``date_offset`` (integer): the offset of the timezone of ``date`` - ``committer`` (string): an anonymized hash of the committer of the revision. - ``committer_date`` (timestamp): the date the revision was committed - ``committer_date_offset`` (integer): the offset of the timezone of ``committer_date`` - ``directory`` (string): the Git SHA-1 of the directory the revision points to (hexadecimal). Every revision points to the root directory of the project source tree to which it corresponds. - **revision_history**: contains the ordered set of parents of each revision. Each revision has an ordered set of parents (0 for the initial commit of a repository, 1 for a regular commit, 2 for a regular merge commit and 3 or more for octopus-style merge commits). - ``id`` (string): the Git SHA-1 identifier of the revision (hexadecimal) - ``parent_id`` (string): the Git SHA-1 identifier of the parent (hexadecimal) - ``parent_rank`` (integer): the rank of the parent, which defines the ordering between the parents of the revision - **release**: contains the releases stored in the archive. - ``id`` (string): the intrinsic hash of the release (hexadecimal), recursively computed with the Git SHA-1 algorithm - ``target`` (string): the Git SHA-1 of the object the release points to (hexadecimal) - ``date`` (timestamp): the date the release was created - ``author`` (integer): the author of the revision - ``name`` (bytes): the release name - ``message`` (bytes): the release message - **snapshot**: contains the list of snapshots stored in the archive. - ``id`` (string): the intrinsic hash of the snapshot (hexadecimal), recursively computed with the Git SHA-1 algorithm. - **snapshot_branch**: contains the list of branches associated with each snapshot. - ``snapshot_id`` (string): the intrinsic hash of the snapshot (hexadecimal) - ``name`` (bytes): the name of the branch - ``target`` (string): the intrinsic hash of the object the branch points to (hexadecimal) - ``target_type`` (string): the type of object the branch points to (either ``release``, ``revision``, ``directory`` or ``content``). - **origin**: the software origins from which the projects in the dataset were archived. - ``url`` (bytes): the URL of the origin - **origin_visit**: the different visits of each origin. Since Software Heritage archives software continuously, software origins are crawled more than once. Each of these "visits" is an entry in this table. - ``origin``: (string) the URL of the origin visited - ``visit``: (integer) an integer identifier of the visit - ``date``: (timestamp) the date at which the origin was visited - ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``, ``svn``, ``git``, ``ftp``, ``deb``, ...) - **origin_visit_status**: the status of each visit. - ``origin``: (string) the URL of the origin visited - ``visit``: (integer) an integer identifier of the visit - ``date``: (timestamp) the date at which the origin was visited - ``type`` (string): the type of origin visited (e.g ``git``, ``pypi``, ``hg``, ``svn``, ``git``, ``ftp``, ``deb``, ...) - ``snapshot_id`` (string): the intrinsic hash of the snapshot archived in this visit (hexadecimal). - ``status`` (string): the integer identifier of the snapshot archived in this visit, either ``partial`` for partial visits or ``full`` for full visits.