(An Untitled Masterwork)
ActivePublic
Actions

Authored by seirl on Mar 11 2019, 4:06 PM.

Tags

None

Subscribers

None

	Using the Software Heritage Graph Dataset
	=========================================

	This README contains instructions on how to use the different formats the
	Software Heritage graph dataset is distributed as.


	Schema
	------

	The detailed schema of the database dumps is available in
	`sql_swh_import_scripts.zip/30-schema.sql`.
	The different fields are documented in the comments of the schema itself.


	PostgreSQL dumps
	----------------

	[PostgreSQL](https://www.postgresql.org/) dumps are available using the
	`sql_` prefix. They can be imported in a local database using:

	```
	createdb softwareheritage
	unzip sql_swh_import_scripts.zip
	psql softwareheritage < sql_swh_import.sql
	```


	Parquet dumps
	-------------

	[Parquet](https://parquet.apache.org/) dumps are available using the
	`parquet_` prefix. They can be imported in a Hadoop cluster to be
	analyzed with any data processing framework that supports Parquet files (Hive,
	Drill, Spark, ...)

	The parquet dataset is stored in tarballs that can be unpacked using:

	```
	tar xvf parquet_*
	```


	Using Amazon Athena
	-------------------

	The Software Heritage graph dataset is available as a public dataset in
	[Amazon Athena](https://aws.amazon.com/athena/).

	### Setup

	In order to query the dataset using Athena, you will first need to [create an
	AWS account and setup billing](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/).

	Once your AWS account is ready, you will need to install a few dependencies on
	your machine:

	- Python 3
	- The [aws cli](https://docs.aws.amazon.com/cli/index.html)
	- The [boto3 Python package](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)

	On Debian, the dependencies can be installed with the following commands:

	```
	sudo apt install python3 python3-boto3 awscli
	```

	Once the dependencies are installed, run:

	```
	aws configure
	```

	and add your AWS Access Key ID and your AWS Secret Access Key, to give Python
	access to your AWS account.

	### Create the tables

	To import the schema of the dataset into your account, extract `athena.zip`,
	then run the following command from the [athena/](athena/) folder:

	```
	./gen_schema.py
	```

	This will create the required tables in your AWS account. You can check that
	the tables were successfuly created by going to the [Amazon Athena
	console](https://console.aws.amazon.com/athena/home) and selecting the "swh"
	database.

	### Run queries

	From the console, once you have selected the "swh" database, you can directly
	run queries from the Query Editor.

	Here is an example query that computes the most frequent file names in the
	archive:

	```
	SELECT FROM_UTF8(name, '?') AS name,
	COUNT(DISTINCT target) AS cnt
	FROM directory_entry_file
	GROUP BY name
	ORDER BY cnt DESC
	LIMIT 1;
	```

	More documentation on Amazon Athena is available
	[here](https://docs.aws.amazon.com/athena/index.html).



	Software Heritage Environment
	=============================

	Reproducing the dataset requires a local instance of the Software Heritage
	stack. In general, it is possible to run this stack locally by following the
	steps outlined in the [Getting Started
	guide](https://docs.softwareheritage.org/devel/getting-started.html) of the
	Software Heritage documentation.

	For reproducibility purposes, an archive of the exact version of the software
	stack that was used during the publication of the dataset is also available as
	part of the dataset itself, in `swh-environment.tar.gz`. To extract it, you can
	run:

	```
	tar xvf swh-environment.tar.gz
	```

Event Timeline

seirl created this paste.Mar 11 2019, 4:06 PM

(An Untitled Masterwork)ActivePublicActions

Event Timeline

(An Untitled Masterwork)
ActivePublic
Actions