diff --git a/docs/graph/_images/athena_tables.png b/docs/graph/_images/athena_tables.png new file mode 100644 index 0000000..94f67de Binary files /dev/null and b/docs/graph/_images/athena_tables.png differ diff --git a/docs/graph/athena.rst b/docs/graph/athena.rst new file mode 100644 index 0000000..15e20f2 --- /dev/null +++ b/docs/graph/athena.rst @@ -0,0 +1,115 @@ +Setup on Amazon Athena +====================== + +The Software Heritage Graph Dataset is available as a public dataset in `Amazon +Athena `_. Athena uses `presto +`_, a distributed SQL query engine, to +automatically scale queries on large datasets. + +The pricing of Athena depends on the amount of data scanned by each query, +generally at a cost of $5 per TiB of data scanned. Full pricing details are +available `here `_. + +Note that because the Software Heritage Graph Dataset is available as a public +dataset, you **do not have to pay for the storage, only for the queries** +(except for the data you store on S3 yourself, like query results). + + +Loading the tables +------------------ + +.. highlight:: bash + +AWS account +~~~~~~~~~~~ + +In order to use Amazon Athena, you will first need to `create an AWS account +and setup billing +`_. + + +Setup +~~~~~ + +Athena needs to be made aware of the location and the schema of the Parquet +files available as a public dataset. Unfortunately, since Athena does not +support queries that contain multiple commands, it is not as simple as pasting +an installation script in the console. Instead, we provide a Python script that +can be run locally on your machine, that will communicate with Athena to create +the tables automatically with the appropriate schema. + +To run this script, you will need to install a few dependencies on your +machine: + +- For **Ubuntu** and **Debian**:: + + sudo apt install python3 python3-boto3 awscli + +- For **Archlinux**:: + + sudo pacman -S --needed python python-boto3 aws-cli + +Once the dependencies are installed, run:: + + aws configure + +This will ask for an AWS Access Key ID and an AWS Secret Access Key in +order to give Python access to your AWS account. These keys can be generated at +`this address +`_. + +It will also ask for the region in which you want to run the queries. We +recommand to use ``us-east-1``, since that's where the public dataset is +located. + +Creating the tables +~~~~~~~~~~~~~~~~~~~ + +Download and run the Python script that will create the tables on your account: + +.. tabs:: + + .. group-tab:: full + + :: + + wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/tables.py + wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/gen_schema.py + ./gen_schema.py + + .. group-tab:: teaser: popular-4k + + This teaser is not available on Athena yet. + + .. group-tab:: teaser: popular-3k-python + + This teaser is not available on Athena yet. + +To check that the tables have been successfully created in your account, you +can open your `Amazon Athena console +`_. You should be able to select +the database corresponding to your dataset, and see the tables: + +.. image:: _images/athena_tables.png + + +Running queries +--------------- + +.. highlight:: sql + +From the console, once you have selected the database of your dataset, you can +run SQL queries directly from the Query Editor. + +Try for instance this query that computes the most frequent file names in the +archive:: + + SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt + FROM directory_entry_file + GROUP BY name + ORDER BY cnt DESC + LIMIT 10; + +Other examples are available in the preprint of our article: `The Software +Heritage Graph Dataset: Public software development under one roof. +`_ diff --git a/docs/graph/datasets.rst b/docs/graph/datasets.rst new file mode 100644 index 0000000..95cb56f --- /dev/null +++ b/docs/graph/datasets.rst @@ -0,0 +1,83 @@ +Dataset +======= + +We provide the full graph dataset along with two "teaser" datasets that can be +used for trying out smaller-scale experiments before using the full graph. + +All the main URLs are relative to our dataset prefix: +`https://annex.softwareheritage.org/public/dataset/ `__. + +The Software Heritage Graph Dataset contains a table representation of the full +Software Heritage Graph. It is available in the following formats: + +- **PostgreSQL (compressed)**: + + - **URL**: `/graph/latest/sql/ + `_ + - **Total size**: 1.2 TiB + +- **Apache Parquet**: + + - **URL**: `/graph/latest/parquet/ + `_ + - **Total size**: 1.2 TiB + +Teaser datasets +--------------- + +popular-4k +~~~~~~~~~~ + +The ``popular-4k`` teaser contains a subset of 4000 popular +repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to +pick the software origins was the following: + +- The 1000 most popular GitHub projects (by number of stars) +- The 1000 most popular Gitlab projects (by number of stars) +- The 1000 most popular PyPI projects (by usage statistics, according to the + `Top PyPI Packages `_ database), +- The 1000 most popular Debian packages (by "votes" according to the `Debian + Popularity Contest `_ database) + +This teaser is available in the following formats: + +- **PostgreSQL (compressed)**: + + - **URL**: `/graph/latest/popular-4k/sql/ + `_ + - **Total size**: TODO + +- **Apache Parquet**: + + - **URL**: `/graph/latest/popular-4k/parquet/ + `_ + - **Total size**: TODO + +popular-3k-python +~~~~~~~~~~~~~~~~~ + +The ``popular-3k-python`` teaser contains a subset of 3052 popular +repositories **tagged as being written in the Python language**, from GitHub, +Gitlab, PyPI and Debian. The selection criteria to pick the software origins +was the following, similar to ``popular-4k``: + +- the 1000 most popular GitHub projects written in Python (by number of stars), +- the 131 Gitlab projects written in Python that have 2 stars or more, +- the 1000 most popular PyPI projects (by usage statistics, according to the + `Top PyPI Packages `_ database), +- the 1000 most popular Debian packages with the + `debtag `_ ``implemented-in::python`` (by + "votes" according to the `Debian Popularity Contest + `_ database). + +- **PostgreSQL (compressed)**: + + - **URL**: `/graph/latest/popular-3k-python/sql/ + `_ + - **Total size**: TODO + +- **Apache Parquet**: + + - **URL**: `/graph/latest/popular-3k-python/sql/ + `_ + - **Total size**: TODO diff --git a/docs/graph/index.rst b/docs/graph/index.rst new file mode 100644 index 0000000..c44e806 --- /dev/null +++ b/docs/graph/index.rst @@ -0,0 +1,53 @@ +.. _swh-graph-dataset: + +Software Heritage Graph Dataset +=============================== + +This is the Software Heritage graph dataset: a fully-deduplicated Merkle +DAG representation of the Software Heritage archive. The dataset links +together file content identifiers, source code directories, Version +Control System (VCS) commits tracking evolution over time, up to the +full states of VCS repositories as observed by Software Heritage during +periodic crawls. The dataset’s contents come from major development +forges (including `GitHub `__ and +`GitLab `__), FOSS distributions (e.g., +`Debian `__), and language-specific package managers (e.g., +`PyPI `__). Crawling information is also included, +providing timestamps about when and where all archived source code +artifacts have been observed in the wild. + +The Software Heritage graph dataset is available in multiple formats, +including downloadable CSV dumps and Apache Parquet files for local use, +as well as a public instance on Amazon Athena interactive query service +for ready-to-use powerful analytical processing. + +By accessing the dataset, you agree with the Software Heritage `Ethical +Charter for using the archive +data `__, +and the `terms of use for bulk +access `__. + + +If you use this dataset for research purposes, please cite the following paper: + +* + | Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. + | *The Software Heritage Graph Dataset: Public software development under one roof.* + | In proceedings of `MSR 2019 `_: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with `ICSE 2019 `_. + | `preprint `_, `bibtex `_ + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + datasets + postgresql + athena + + +Indices and tables +================== + +* :ref:`genindex` +* :ref:`modindex` +* :ref:`search` diff --git a/docs/graph/postgresql.rst b/docs/graph/postgresql.rst new file mode 100644 index 0000000..02ab33f --- /dev/null +++ b/docs/graph/postgresql.rst @@ -0,0 +1,98 @@ +Setup on a PostgreSQL instance +============================== + +This tutorial will guide you through the steps required to setup the Software +Heritage Graph Dataset in a PostgreSQL database. + +.. highlight:: bash + +PostgreSQL local setup +---------------------- + +You need to have access to a running PostgreSQL instance to load the dataset. +This section contains information on how to setup PostgreSQL for the first +time. + +*If you already have a PostgreSQL server running on your machine, you can skip +to the next section.* + +- For **Ubuntu** and **Debian**:: + + sudo apt install postgresql + +- For **Archlinux**:: + + sudo pacman -S --needed postgresql + sudo -u postgres initdb -D '/var/lib/postgres/data' + sudo systemctl enable --now postgresql + +Once PostgreSQL is running, you also need an user that will be able to create +databases and run queries. The easiest way to achieve that is simply to create +an account that has the same name as your username and that can create +databases:: + + sudo -u postgres createuser --createdb $USER + + +Retrieving the dataset +---------------------- + +You need to download the dataset in SQL format. Use the following command on +your machine, after making sure that it has enough available space for the +dataset you chose: + +.. tabs:: + + .. group-tab:: full + + :: + + mkdir full && cd full + wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/sql/ + + .. group-tab:: teaser: popular-4k + + :: + + mkdir full && cd full + wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-4k/sql/ + + .. group-tab:: teaser: popular-3k-python + + :: + + mkdir full && cd full + wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-3k-python/sql/ + +Loading the dataset +------------------- + +Once you have retrieved the dataset of your choice, create a database that will +contain it, and load the database: + +.. tabs:: + + .. group-tab:: full + + :: + + createdb swhgd + psql swhgd < swh_import.sql + + .. group-tab:: teaser: popular-4k + + :: + + createdb swhgd-popular-4k + psql swhgd-popular-4k < swh_import.sql + + .. group-tab:: teaser: popular-3k-python + + :: + + createdb swhgd-popular-3k-python + psql swhgd-popular-3k-python < swh_import.sql + + +You can now run SQL queries on your database. Run ``psql `` to +start an interactive PostgreSQL console.