diff --git a/docs/graph/index.rst b/docs/graph/index.rst index bba72f9..033b99d 100644 --- a/docs/graph/index.rst +++ b/docs/graph/index.rst @@ -1,56 +1,55 @@ .. _swh-graph-dataset: Software Heritage Graph Dataset =============================== This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including `GitHub `__ and `GitLab `__), FOSS distributions (e.g., `Debian `__), and language-specific package managers (e.g., `PyPI `__). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including relational Apache ORC files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. By accessing the dataset, you agree with the Software Heritage `Ethical Charter for using the archive data `__, and the `terms of use for bulk access `__. If you use this dataset for research purposes, please cite the following paper: * | Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. | *The Software Heritage Graph Dataset: Public software development under one roof.* | In proceedings of `MSR 2019 `_: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with `ICSE 2019 `_. | `preprint `_, `bibtex `_ .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: dataset schema - postgresql athena databricks Indices and tables ------------------ * :ref:`genindex` * :ref:`modindex` * :ref:`search` diff --git a/docs/graph/postgresql.rst b/docs/graph/postgresql.rst deleted file mode 100644 index 5d8c17e..0000000 --- a/docs/graph/postgresql.rst +++ /dev/null @@ -1,98 +0,0 @@ -Setup on a PostgreSQL instance -============================== - -This tutorial will guide you through the steps required to setup the Software -Heritage Graph Dataset in a PostgreSQL database. - -.. highlight:: bash - -PostgreSQL local setup ----------------------- - -You need to have access to a running PostgreSQL instance to load the dataset. -This section contains information on how to setup PostgreSQL for the first -time. - -*If you already have a PostgreSQL server running on your machine, you can skip -to the next section.* - -- For **Ubuntu** and **Debian**:: - - sudo apt install postgresql - -- For **Archlinux**:: - - sudo pacman -S --needed postgresql - sudo -u postgres initdb -D '/var/lib/postgres/data' - sudo systemctl enable --now postgresql - -Once PostgreSQL is running, you also need an user that will be able to create -databases and run queries. The easiest way to achieve that is simply to create -an account that has the same name as your username and that can create -databases:: - - sudo -u postgres createuser --createdb $USER - - -Retrieving the dataset ----------------------- - -You need to download the dataset in SQL format. Use the following command on -your machine, after making sure that it has enough available space for the -dataset you chose: - -.. tabs:: - - .. group-tab:: full - - :: - - mkdir swhgd && cd swhgd - wget -c -q --show-progress -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/latest/sql/ - - .. group-tab:: teaser: popular-4k - - :: - - mkdir popular-4k && cd popular-4k - wget -c -q --show-progress -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/latest/popular-4k/sql/ - - .. group-tab:: teaser: popular-3k-python - - :: - - mkdir popular-3k-python && cd popular-3k-python - wget -c -q --show-progress -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/sql/ - -Loading the dataset -------------------- - -Once you have retrieved the dataset of your choice, create a database that will -contain it, and load the database: - -.. tabs:: - - .. group-tab:: full - - :: - - createdb swhgd - psql swhgd < load.sql - - .. group-tab:: teaser: popular-4k - - :: - - createdb swhgd-popular-4k - psql swhgd-popular-4k < load.sql - - .. group-tab:: teaser: popular-3k-python - - :: - - createdb swhgd-popular-3k-python - psql swhgd-popular-3k-python < load.sql - - -You can now run SQL queries on your database. Run ``psql `` to -start an interactive PostgreSQL console.