diff --git a/docs/graph/datasets.rst b/docs/graph/datasets.rst index 95cb56f..3fbe822 100644 --- a/docs/graph/datasets.rst +++ b/docs/graph/datasets.rst @@ -1,83 +1,86 @@ Dataset ======= We provide the full graph dataset along with two "teaser" datasets that can be used for trying out smaller-scale experiments before using the full graph. All the main URLs are relative to our dataset prefix: `https://annex.softwareheritage.org/public/dataset/ `__. The Software Heritage Graph Dataset contains a table representation of the full Software Heritage Graph. It is available in the following formats: - **PostgreSQL (compressed)**: - **URL**: `/graph/latest/sql/ `_ - **Total size**: 1.2 TiB - **Apache Parquet**: - **URL**: `/graph/latest/parquet/ `_ - **Total size**: 1.2 TiB Teaser datasets --------------- +If the above dataset is too big, we also provide the following "teaser" +datasets that can get you started and have a smaller size fingerprint. + popular-4k ~~~~~~~~~~ The ``popular-4k`` teaser contains a subset of 4000 popular repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following: - The 1000 most popular GitHub projects (by number of stars) - The 1000 most popular Gitlab projects (by number of stars) - The 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - The 1000 most popular Debian packages (by "votes" according to the `Debian Popularity Contest `_ database) This teaser is available in the following formats: - **PostgreSQL (compressed)**: - **URL**: `/graph/latest/popular-4k/sql/ `_ - **Total size**: TODO - **Apache Parquet**: - **URL**: `/graph/latest/popular-4k/parquet/ `_ - **Total size**: TODO popular-3k-python ~~~~~~~~~~~~~~~~~ The ``popular-3k-python`` teaser contains a subset of 3052 popular repositories **tagged as being written in the Python language**, from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following, similar to ``popular-4k``: - the 1000 most popular GitHub projects written in Python (by number of stars), - the 131 Gitlab projects written in Python that have 2 stars or more, - the 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - the 1000 most popular Debian packages with the `debtag `_ ``implemented-in::python`` (by "votes" according to the `Debian Popularity Contest `_ database). - **PostgreSQL (compressed)**: - **URL**: `/graph/latest/popular-3k-python/sql/ `_ - **Total size**: TODO - **Apache Parquet**: - **URL**: `/graph/latest/popular-3k-python/sql/ `_ - **Total size**: TODO