diff --git a/docs/datasets.rst b/docs/datasets.rst index 47d78c4..dcd5cd1 100644 --- a/docs/datasets.rst +++ b/docs/datasets.rst @@ -1,83 +1,87 @@ -Datasets -======== +Dataset +======= -We provide the full graph dataset along with two, smaller datasets that can be -used for smaller-scale experiments. +We provide the full graph dataset along with two "teaser" datasets that can be +used for trying out smaller-scale experiments before using the full graph. The main URLs of the datasets are relative to our dataset prefix: `https://annex.softwareheritage.org/public/dataset/ `__ -full ----- -The ``full`` dataset contains the full Software Heritage Graph. It is available +Main dataset +------------ + +The main dataset contains the full Software Heritage Graph. It is available in the following formats: - **PostgreSQL (compressed)**: - **URL**: `/graph/latest/sql/ `_ - **Total size**: 1.2 TiB - **Apache Parquet**: - **URL**: `/graph/latest/parquet/ `_ - **Total size**: 1.2 TiB +Teaser datasets +--------------- + popular-4k ----------- +~~~~~~~~~~ -The ``popular-4k`` dataset contains a subset of 4000 popular +The ``popular-4k`` teaser contains a subset of 4000 popular repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following: - The 1000 most popular GitHub projects (by number of stars) - The 1000 most popular Gitlab projects (by number of stars) - The 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - The 1000 most popular Debian packages (by "votes" according to the `Debian Popularity Contest `_ database) -This dataset is available in the following formats: +This teaser is available in the following formats: - **PostgreSQL (compressed)**: - **URL**: `/graph/latest/popular-4k/sql/ `_ - **Total size**: TODO - **Apache Parquet**: - **URL**: `/graph/latest/popular-4k/parquet/ `_ - **Total size**: TODO popular-3k-python ------------------ +~~~~~~~~~~~~~~~~~ -The ``popular-3k-python`` dataset contains a subset of 3052 popular +The ``popular-3k-python`` teaser contains a subset of 3052 popular repositories **tagged as being written in the Python language**, from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following, similar to ``popular-4k``: - the 1000 most popular GitHub projects written in Python (by number of stars), - the 131 Gitlab projects written in Python that have 2 stars or more, - the 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - the 1000 most popular Debian packages with the `debtag `_ ``implemented-in::python`` (by "votes" according to the `Debian Popularity Contest `_ database). - **PostgreSQL (compressed)**: - **URL**: `/graph/latest/popular-3k-python/sql/ `_ - **Total size**: TODO - **Apache Parquet**: - **URL**: `/graph/latest/popular-3k-python/sql/ `_ - **Total size**: TODO diff --git a/docs/postgresql.rst b/docs/postgresql.rst index 143bec8..b3e7556 100644 --- a/docs/postgresql.rst +++ b/docs/postgresql.rst @@ -1,98 +1,98 @@ Setup on a PostgreSQL instance ============================== This tutorial will guide you through the steps required to setup the Software Heritage Graph Dataset in a PostgreSQL database. .. highlight:: bash PostgreSQL local setup ---------------------- You need to have access to a running PostgreSQL instance to load the dataset. This section contains information on how to setup PostgreSQL for the first time. *If you already have a PostgreSQL server running on your machine, you can skip to the next section.* - For **Ubuntu** and **Debian**:: - sudo apt install postgresql + sudo apt install postgresql - For **Archlinux**:: - sudo pacman -S --needed postgresql - sudo -u postgres initdb -D '/var/lib/postgres/data' - sudo systemctl enable --now postgresql + sudo pacman -S --needed postgresql + sudo -u postgres initdb -D '/var/lib/postgres/data' + sudo systemctl enable --now postgresql Once PostgreSQL is running, you also need an user that will be able to create databases and run queries. The easiest way to achieve that is simply to create an account that has the same name as your username and that can create databases:: sudo -u postgres createuser --createdb $USER Retrieving the dataset ---------------------- You need to download the dataset in SQL format. Use the following command on your machine, after making sure that it has enough available space for the dataset you chose: .. tabs:: .. group-tab:: full :: mkdir full && cd full wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/sql/ .. group-tab:: popular-4k :: mkdir full && cd full wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-4k/sql/ .. group-tab:: popular-3k-python :: mkdir full && cd full wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-3k-python/sql/ Loading the dataset ------------------- Once you have retrieved the dataset of your choice, create a database that will contain it, and load the database: .. tabs:: .. group-tab:: full :: - createdb softwareheritage-full - psql softwareheritage-full < swh_import.sql + createdb swhgd + psql swhgd < swh_import.sql .. group-tab:: popular-4k :: - createdb softwareheritage-popular-4k - psql softwareheritage-popular-4k < swh_import.sql + createdb swhgd-popular-4k + psql swhgd-popular-4k < swh_import.sql .. group-tab:: popular-3k-python :: - createdb softwareheritage-popular-3k-python - psql softwareheritage-popular-3k-python < swh_import.sql + createdb swhgd-popular-3k-python + psql swhgd-popular-3k-python < swh_import.sql You can now run SQL queries on your database. Run ``psql `` to start an interactive PostgreSQL console.