diff --git a/docs/datasets.rst b/docs/datasets.rst
index 47d78c4..dcd5cd1 100644
--- a/docs/datasets.rst
+++ b/docs/datasets.rst
@@ -1,83 +1,87 @@
-Datasets
-========
+Dataset
+=======
-We provide the full graph dataset along with two, smaller datasets that can be
-used for smaller-scale experiments.
+We provide the full graph dataset along with two "teaser" datasets that can be
+used for trying out smaller-scale experiments before using the full graph.
The main URLs of the datasets are relative to our dataset prefix:
`https://annex.softwareheritage.org/public/dataset/ `__
-full
-----
-The ``full`` dataset contains the full Software Heritage Graph. It is available
+Main dataset
+------------
+
+The main dataset contains the full Software Heritage Graph. It is available
in the following formats:
- **PostgreSQL (compressed)**:
- **URL**: `/graph/latest/sql/
`_
- **Total size**: 1.2 TiB
- **Apache Parquet**:
- **URL**: `/graph/latest/parquet/
`_
- **Total size**: 1.2 TiB
+Teaser datasets
+---------------
+
popular-4k
-----------
+~~~~~~~~~~
-The ``popular-4k`` dataset contains a subset of 4000 popular
+The ``popular-4k`` teaser contains a subset of 4000 popular
repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to
pick the software origins was the following:
- The 1000 most popular GitHub projects (by number of stars)
- The 1000 most popular Gitlab projects (by number of stars)
- The 1000 most popular PyPI projects (by usage statistics, according to the
`Top PyPI Packages `_ database),
- The 1000 most popular Debian packages (by "votes" according to the `Debian
Popularity Contest `_ database)
-This dataset is available in the following formats:
+This teaser is available in the following formats:
- **PostgreSQL (compressed)**:
- **URL**: `/graph/latest/popular-4k/sql/
`_
- **Total size**: TODO
- **Apache Parquet**:
- **URL**: `/graph/latest/popular-4k/parquet/
`_
- **Total size**: TODO
popular-3k-python
------------------
+~~~~~~~~~~~~~~~~~
-The ``popular-3k-python`` dataset contains a subset of 3052 popular
+The ``popular-3k-python`` teaser contains a subset of 3052 popular
repositories **tagged as being written in the Python language**, from GitHub,
Gitlab, PyPI and Debian. The selection criteria to pick the software origins
was the following, similar to ``popular-4k``:
- the 1000 most popular GitHub projects written in Python (by number of stars),
- the 131 Gitlab projects written in Python that have 2 stars or more,
- the 1000 most popular PyPI projects (by usage statistics, according to the
`Top PyPI Packages `_ database),
- the 1000 most popular Debian packages with the
`debtag `_ ``implemented-in::python`` (by
"votes" according to the `Debian Popularity Contest
`_ database).
- **PostgreSQL (compressed)**:
- **URL**: `/graph/latest/popular-3k-python/sql/
`_
- **Total size**: TODO
- **Apache Parquet**:
- **URL**: `/graph/latest/popular-3k-python/sql/
`_
- **Total size**: TODO
diff --git a/docs/postgresql.rst b/docs/postgresql.rst
index 143bec8..b3e7556 100644
--- a/docs/postgresql.rst
+++ b/docs/postgresql.rst
@@ -1,98 +1,98 @@
Setup on a PostgreSQL instance
==============================
This tutorial will guide you through the steps required to setup the Software
Heritage Graph Dataset in a PostgreSQL database.
.. highlight:: bash
PostgreSQL local setup
----------------------
You need to have access to a running PostgreSQL instance to load the dataset.
This section contains information on how to setup PostgreSQL for the first
time.
*If you already have a PostgreSQL server running on your machine, you can skip
to the next section.*
- For **Ubuntu** and **Debian**::
- sudo apt install postgresql
+ sudo apt install postgresql
- For **Archlinux**::
- sudo pacman -S --needed postgresql
- sudo -u postgres initdb -D '/var/lib/postgres/data'
- sudo systemctl enable --now postgresql
+ sudo pacman -S --needed postgresql
+ sudo -u postgres initdb -D '/var/lib/postgres/data'
+ sudo systemctl enable --now postgresql
Once PostgreSQL is running, you also need an user that will be able to create
databases and run queries. The easiest way to achieve that is simply to create
an account that has the same name as your username and that can create
databases::
sudo -u postgres createuser --createdb $USER
Retrieving the dataset
----------------------
You need to download the dataset in SQL format. Use the following command on
your machine, after making sure that it has enough available space for the
dataset you chose:
.. tabs::
.. group-tab:: full
::
mkdir full && cd full
wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/sql/
.. group-tab:: popular-4k
::
mkdir full && cd full
wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-4k/sql/
.. group-tab:: popular-3k-python
::
mkdir full && cd full
wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-3k-python/sql/
Loading the dataset
-------------------
Once you have retrieved the dataset of your choice, create a database that will
contain it, and load the database:
.. tabs::
.. group-tab:: full
::
- createdb softwareheritage-full
- psql softwareheritage-full < swh_import.sql
+ createdb swhgd
+ psql swhgd < swh_import.sql
.. group-tab:: popular-4k
::
- createdb softwareheritage-popular-4k
- psql softwareheritage-popular-4k < swh_import.sql
+ createdb swhgd-popular-4k
+ psql swhgd-popular-4k < swh_import.sql
.. group-tab:: popular-3k-python
::
- createdb softwareheritage-popular-3k-python
- psql softwareheritage-popular-3k-python < swh_import.sql
+ createdb swhgd-popular-3k-python
+ psql swhgd-popular-3k-python < swh_import.sql
You can now run SQL queries on your database. Run ``psql `` to
start an interactive PostgreSQL console.