diff --git a/docs/graph/athena.rst b/docs/graph/athena.rst index 25ee257..188334e 100644 --- a/docs/graph/athena.rst +++ b/docs/graph/athena.rst @@ -1,125 +1,125 @@ Setup on Amazon Athena ====================== The Software Heritage Graph Dataset is available as a public dataset in `Amazon Athena `_. Athena uses `presto `_, a distributed SQL query engine, to automatically scale queries on large datasets. The pricing of Athena depends on the amount of data scanned by each query, generally at a cost of $5 per TiB of data scanned. Full pricing details are available `here `_. Note that because the Software Heritage Graph Dataset is available as a public dataset, you **do not have to pay for the storage, only for the queries** (except for the data you store on S3 yourself, like query results). Loading the tables ------------------ .. highlight:: bash AWS account ~~~~~~~~~~~ In order to use Amazon Athena, you will first need to `create an AWS account and setup billing `_. You will also need to create an **output S3 bucket**: this is the place where Athena will store your query results, so that you can retrieve them and analyze them afterwards. To do that, go on the `S3 console `_ and create a new bucket. Setup ~~~~~ Athena needs to be made aware of the location and the schema of the Parquet files available as a public dataset. Unfortunately, since Athena does not support queries that contain multiple commands, it is not as simple as pasting an installation script in the console. Instead, we provide a Python script that can be run locally on your machine, that will communicate with Athena to create the tables automatically with the appropriate schema. To run this script, you will need to install a few dependencies on your machine: - For **Ubuntu** and **Debian**:: sudo apt install python3 python3-boto3 awscli - For **Archlinux**:: sudo pacman -S --needed python python-boto3 aws-cli Once the dependencies are installed, run:: aws configure This will ask for an AWS Access Key ID and an AWS Secret Access Key in order to give Python access to your AWS account. These keys can be generated at `this address `_. It will also ask for the region in which you want to run the queries. We -recommand to use ``us-east-1``, since that's where the public dataset is +recommend to use ``us-east-1``, since that's where the public dataset is located. Creating the tables ~~~~~~~~~~~~~~~~~~~ Download and run the Python script that will create the tables on your account: .. tabs:: .. group-tab:: full :: wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' .. group-tab:: teaser: popular-4k :: wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular4k -l 's3://softwareheritage/teasers/popular-4k' .. group-tab:: teaser: popular-3k-python :: wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular3kpython -l 's3://softwareheritage/teasers/popular-3k-python' To check that the tables have been successfully created in your account, you can open your `Amazon Athena console `_. You should be able to select the database corresponding to your dataset, and see the tables: .. image:: _images/athena_tables.png Running queries --------------- .. highlight:: sql From the console, once you have selected the database of your dataset, you can run SQL queries directly from the Query Editor. Try for instance this query that computes the most frequent file names in the archive:: SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt FROM directory_entry_file GROUP BY name ORDER BY cnt DESC LIMIT 10; Other examples are available in the preprint of our article: `The Software Heritage Graph Dataset: Public software development under one roof. `_ diff --git a/docs/graph/index.rst b/docs/graph/index.rst index 991414f..58b96ef 100644 --- a/docs/graph/index.rst +++ b/docs/graph/index.rst @@ -1,56 +1,56 @@ .. _swh-graph-dataset: Software Heritage Graph Dataset =============================== This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including `GitHub `__ and `GitLab `__), FOSS distributions (e.g., `Debian `__), and language-specific package managers (e.g., `PyPI `__). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. By accessing the dataset, you agree with the Software Heritage `Ethical Charter for using the archive data `__, and the `terms of use for bulk access `__. If you use this dataset for research purposes, please cite the following paper: -* +* | Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. | *The Software Heritage Graph Dataset: Public software development under one roof.* | In proceedings of `MSR 2019 `_: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with `ICSE 2019 `_. | `preprint `_, `bibtex `_ .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: dataset schema postgresql athena databricks Indices and tables ------------------ * :ref:`genindex` * :ref:`modindex` * :ref:`search` diff --git a/docs/graph/schema.rst b/docs/graph/schema.rst index e2518f6..13409b7 100644 --- a/docs/graph/schema.rst +++ b/docs/graph/schema.rst @@ -1,142 +1,142 @@ Relational schema ================= The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables. A simplified view of the corresponding database schema is shown here: .. image:: _images/db-schema.svg This page documents the details of the schema. - **content**: contains information on the contents stored in the archive. - ``sha1`` (bytes): the SHA-1 of the content - ``sha1_git`` (bytes): the Git SHA-1 of the content - ``length`` (integer): the length of the content - **skipped_content**: contains information on the contents that were not archived for various reasons. - ``sha1`` (bytes): the SHA-1 of the missing content - ``sha1_git`` (bytes): the Git SHA-1 of the missing content - ``length`` (integer): the length of the missing content - **directory**: contains the directories stored in the archive. - ``id`` (bytes): the intrinsic identifier of the directory, recursively computed with the Git SHA-1 algorithm - ``dir_entries`` (array of integers): the list of directories contained in this directory, as references to an entry in the ``directory_entry_dir`` table. - ``file_entries`` (array of integers): the list of files contained in this directory, as references to an entry in the ``directory_entry_file`` table. - ``rev_entries`` (array of integers): the list of revisions contained in this directory, as references to an entry in the ``directory_entry_rev`` table. -- **directory_entry_file**: contains informations about file entries in +- **directory_entry_file**: contains information about file entries in directories. - ``id`` (integer): unique identifier for the entry - ``target`` (bytes): the Git SHA-1 of the content this entry points to - ``name`` (bytes): the name of the file (basename of its path) - ``perms`` (integer): the permissions of the file -- **directory_entry_dir**: contains informations about directory entries in +- **directory_entry_dir**: contains information about directory entries in directories. - ``id`` (integer): unique identifier for the entry - ``target`` (bytes): the Git SHA-1 of the directory this entry points to - ``name`` (bytes): the name of the directory - ``perms`` (integer): the permissions of the directory -- **directory_entry_rev**: contains informations about revision entries in +- **directory_entry_rev**: contains information about revision entries in directories. - ``id`` (integer): unique identifier for the entry - ``target`` (bytes): the Git SHA-1 of the revision this entry points to - ``name`` (bytes): the name of the directory that contains this revision - ``perms`` (integer): the permissions of the revision - **person**: deduplicates commit authors by their names and e-mail addresses. For pseudonymization purposes and in order to prevent abuse, these columns were removed from the dataset, and this table only contains the ID of the author. Individual authors may be retrieved using this ID from the Software Heritage api. - ``id`` (integer): the identifier of the person - **revision**: contains the revisions stored in the archive. - ``id`` (bytes): the intrinsic identifier of the revision, recursively computed with the Git SHA-1 algorithm. For Git repositories, this corresponds to the revision hash. - ``date`` (timestamp): the date the revision was authored - ``committer_date`` (timestamp): the date the revision was committed - ``author`` (integer): the author of the revision - ``committer`` (integer): the committer of the revision - ``message`` (bytes): the revision message - ``directory`` (bytes): the Git SHA-1 of the directory the revision points to. Every revision points to the root directory of the project source tree to which it corresponds. - **revision_history**: contains the ordered set of parents of each revision. Each revision has an ordered set of parents (0 for the initial commit of a repository, 1 for a regular commit, 2 for a regular merge commit and 3 or more for octopus-style merge commits). - ``id`` (bytes): the Git SHA-1 identifier of the revision - ``parent_id`` (bytes): the Git SHA-1 identifier of the parent - ``parent_rank`` (integer): the rank of the parent which defines the total order of the parents of the revision - **release**: contains the releases stored in the archive. - ``id`` (bytes): the intrinsic identifier of the release, recursively computed with the Git SHA-1 algorithm. - ``target`` (bytes): the Git SHA-1 of the object the release points to. - ``date`` (timestamp): the date the release was created - ``author`` (integer): the author of the revision - ``name`` (bytes): the release name - ``message`` (bytes): the release message - **snapshot**: contains the list of snapshots stored in the archive. - ``id`` (bytes): the intrinsic identifier of the snapshot, recursively computed with the Git SHA-1 algorithm. - ``object_id`` (integer): the primary key of the snapshot - **snapshot_branches**: contains the identifiers of branches associated with each snapshot. This is an intermediary table through which is represented the many-to-many relationship between snapshots and branches. - ``snapshot_id`` (integer): the integer identifier of the snapshot - ``branch_id`` (integer): the identifier of the branch - **snapshot_branch**: contains the list of branches. - ``object_id`` (integer): the identifier of the branch - ``name`` (bytes): the name of the branch - ``target`` (bytes): the Git SHA-1 of the object the branch points to - ``target_type`` (string): the type of object the branch points to (either ``release``, ``revision``, ``directory`` or ``content``). - **origin**: the software origins from which the projects in the dataset were archived. - ``id`` (integer): the identifier of the origin - ``url`` (bytes): the URL of the origin - ``type`` (string): the type of origin (e.g ``git``, ``pypi``, ``hg``, ``svn``, ``git``, ``ftp``, ``deb``, ...) - **origin_visit**: the different visits of each origin. Since Software Heritage archives software continuously, software origins are crawled more than once. Each of these "visits" is an entry in this table. - ``origin``: (integer) the identifier of the origin visited - ``date``: (timestamp) the date at which the origin was visited - ``snapshot_id`` (integer): the integer identifier of the snapshot archived in this visit.