diff --git a/docs/_images/athena_tables.png b/docs/_images/athena_tables.png deleted file mode 100644 index 94f67de..0000000 Binary files a/docs/_images/athena_tables.png and /dev/null differ diff --git a/docs/athena.rst b/docs/athena.rst deleted file mode 100644 index cf80875..0000000 --- a/docs/athena.rst +++ /dev/null @@ -1,115 +0,0 @@ -Setup on Amazon Athena -====================== - -The Software Heritage Graph Dataset is available as a public dataset in `Amazon -Athena `_. Athena uses `presto -`_, a distributed SQL query engine, to -automatically scale queries on large datasets. - -The pricing of Athena depends on the amount of data scanned by each query, -generally at a cost of $5 per TiB of data scanned. Full pricing details are -available `here `_. - -Note that because the Software Heritage Graph Dataset is available as a public -dataset, you **do not have to pay for the storage, only for the queries** -(except for the data you store on S3 yourself, like query results). - - -Loading the tables ------------------- - -.. highlight:: bash - -AWS account -~~~~~~~~~~~ - -In order to use Amazon Athena, you will first need to `create an AWS account -and setup billing -`_. - - -Setup -~~~~~ - -Athena needs to be made aware of the location and the schema of the Parquet -files available as a public dataset. Unfortunately, since Athena does not -support queries that contain multiple commands, it is not as simple as pasting -an installation script in the console. Instead, we provide a Python script that -can be run locally on your machine, that will communicate with Athena to create -the tables automatically with the appropriate schema. - -To run this script, you will need to install a few dependencies on your -machine: - -- For **Ubuntu** and **Debian**:: - - sudo apt install python3 python3-boto3 awscli - -- For **Archlinux**:: - - sudo pacman -S --needed python python-boto3 aws-cli - -Once the dependencies are installed, run:: - - aws configure - -This will ask for an AWS Access Key ID and an AWS Secret Access Key in -order to give Python access to your AWS account. These keys can be generated at -`this address -`_. - -It will also ask for the region in which you want to run the queries. We -recommand to use ``us-east-1``, since that's where the public dataset is -located. - -Creating the tables -~~~~~~~~~~~~~~~~~~~ - -Download and run the Python script that will create the tables on your account: - -.. tabs:: - - .. group-tab:: full - - :: - - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/tables.py - wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/gen_schema.py - ./gen_schema.py - - .. group-tab:: popular-4k - - This dataset is not available on Athena yet. - - .. group-tab:: popular-3k-python - - This dataset is not available on Athena yet. - -To check that the tables have been successfully created in your account, you -can open your `Amazon Athena console -`_. You should be able to select -the database corresponding to your dataset, and see the tables: - -.. image:: _images/athena_tables.png - - -Running queries ---------------- - -.. highlight:: sql - -From the console, once you have selected the database of your dataset, you can -run SQL queries directly from the Query Editor. - -Try for instance this query that computes the most frequent file names in the -archive:: - - SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt - FROM directory_entry_file - GROUP BY name - ORDER BY cnt DESC - LIMIT 10; - -Other examples are available in the preprint of our article: `The Software -Heritage Graph Dataset: Public software development under one roof. -`_ diff --git a/docs/datasets.rst b/docs/datasets.rst deleted file mode 100644 index dcd5cd1..0000000 --- a/docs/datasets.rst +++ /dev/null @@ -1,87 +0,0 @@ -Dataset -======= - -We provide the full graph dataset along with two "teaser" datasets that can be -used for trying out smaller-scale experiments before using the full graph. - -The main URLs of the datasets are relative to our dataset prefix: -`https://annex.softwareheritage.org/public/dataset/ `__ - - -Main dataset ------------- - -The main dataset contains the full Software Heritage Graph. It is available -in the following formats: - -- **PostgreSQL (compressed)**: - - - **URL**: `/graph/latest/sql/ - `_ - - **Total size**: 1.2 TiB - -- **Apache Parquet**: - - - **URL**: `/graph/latest/parquet/ - `_ - - **Total size**: 1.2 TiB - -Teaser datasets ---------------- - -popular-4k -~~~~~~~~~~ - -The ``popular-4k`` teaser contains a subset of 4000 popular -repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to -pick the software origins was the following: - -- The 1000 most popular GitHub projects (by number of stars) -- The 1000 most popular Gitlab projects (by number of stars) -- The 1000 most popular PyPI projects (by usage statistics, according to the - `Top PyPI Packages `_ database), -- The 1000 most popular Debian packages (by "votes" according to the `Debian - Popularity Contest `_ database) - -This teaser is available in the following formats: - -- **PostgreSQL (compressed)**: - - - **URL**: `/graph/latest/popular-4k/sql/ - `_ - - **Total size**: TODO - -- **Apache Parquet**: - - - **URL**: `/graph/latest/popular-4k/parquet/ - `_ - - **Total size**: TODO - -popular-3k-python -~~~~~~~~~~~~~~~~~ - -The ``popular-3k-python`` teaser contains a subset of 3052 popular -repositories **tagged as being written in the Python language**, from GitHub, -Gitlab, PyPI and Debian. The selection criteria to pick the software origins -was the following, similar to ``popular-4k``: - -- the 1000 most popular GitHub projects written in Python (by number of stars), -- the 131 Gitlab projects written in Python that have 2 stars or more, -- the 1000 most popular PyPI projects (by usage statistics, according to the - `Top PyPI Packages `_ database), -- the 1000 most popular Debian packages with the - `debtag `_ ``implemented-in::python`` (by - "votes" according to the `Debian Popularity Contest - `_ database). - -- **PostgreSQL (compressed)**: - - - **URL**: `/graph/latest/popular-3k-python/sql/ - `_ - - **Total size**: TODO - -- **Apache Parquet**: - - - **URL**: `/graph/latest/popular-3k-python/sql/ - `_ - - **Total size**: TODO diff --git a/docs/index.rst b/docs/index.rst index 0e99243..f251325 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,53 +1,11 @@ .. _swh-dataset: -Software Heritage Graph Dataset -=============================== +Software Heritage Datasets +========================== -This is the Software Heritage graph dataset: a fully-deduplicated Merkle -DAG representation of the Software Heritage archive. The dataset links -together file content identifiers, source code directories, Version -Control System (VCS) commits tracking evolution over time, up to the -full states of VCS repositories as observed by Software Heritage during -periodic crawls. The dataset’s contents come from major development -forges (including `GitHub `__ and -`GitLab `__), FOSS distributions (e.g., -`Debian `__), and language-specific package managers (e.g., -`PyPI `__). Crawling information is also included, -providing timestamps about when and where all archived source code -artifacts have been observed in the wild. +This page lists the different public datasets and periodic data dumps of the +archive published by Software Heritage. -The Software Heritage graph dataset is available in multiple formats, -including downloadable CSV dumps and Apache Parquet files for local use, -as well as a public instance on Amazon Athena interactive query service -for ready-to-use powerful analytical processing. - -By accessing the dataset, you agree with the Software Heritage `Ethical -Charter for using the archive -data `__, -and the `terms of use for bulk -access `__. - - -If you use this dataset for research purposes, please cite the following paper: - -* - | Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. - | *The Software Heritage Graph Dataset: Public software development under one roof.* - | In proceedings of `MSR 2019 `_: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with `ICSE 2019 `_. - | `preprint `_, `bibtex `_ - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - datasets - postgresql - athena - - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` +:ref:`The Software Heritage Graph Dataset ` + the entire graph of Software Heritage in a fully-deduplicated Merkle DAG + representation. diff --git a/docs/postgresql.rst b/docs/postgresql.rst deleted file mode 100644 index b3e7556..0000000 --- a/docs/postgresql.rst +++ /dev/null @@ -1,98 +0,0 @@ -Setup on a PostgreSQL instance -============================== - -This tutorial will guide you through the steps required to setup the Software -Heritage Graph Dataset in a PostgreSQL database. - -.. highlight:: bash - -PostgreSQL local setup ----------------------- - -You need to have access to a running PostgreSQL instance to load the dataset. -This section contains information on how to setup PostgreSQL for the first -time. - -*If you already have a PostgreSQL server running on your machine, you can skip -to the next section.* - -- For **Ubuntu** and **Debian**:: - - sudo apt install postgresql - -- For **Archlinux**:: - - sudo pacman -S --needed postgresql - sudo -u postgres initdb -D '/var/lib/postgres/data' - sudo systemctl enable --now postgresql - -Once PostgreSQL is running, you also need an user that will be able to create -databases and run queries. The easiest way to achieve that is simply to create -an account that has the same name as your username and that can create -databases:: - - sudo -u postgres createuser --createdb $USER - - -Retrieving the dataset ----------------------- - -You need to download the dataset in SQL format. Use the following command on -your machine, after making sure that it has enough available space for the -dataset you chose: - -.. tabs:: - - .. group-tab:: full - - :: - - mkdir full && cd full - wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/sql/ - - .. group-tab:: popular-4k - - :: - - mkdir full && cd full - wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-4k/sql/ - - .. group-tab:: popular-3k-python - - :: - - mkdir full && cd full - wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-3k-python/sql/ - -Loading the dataset -------------------- - -Once you have retrieved the dataset of your choice, create a database that will -contain it, and load the database: - -.. tabs:: - - .. group-tab:: full - - :: - - createdb swhgd - psql swhgd < swh_import.sql - - .. group-tab:: popular-4k - - :: - - createdb swhgd-popular-4k - psql swhgd-popular-4k < swh_import.sql - - .. group-tab:: popular-3k-python - - :: - - createdb swhgd-popular-3k-python - psql swhgd-popular-3k-python < swh_import.sql - - -You can now run SQL queries on your database. Run ``psql `` to -start an interactive PostgreSQL console.