No OneTemporary
Actions

Size

5 KB

Subscribers

None

View Options

	diff --git a/docs/graph/databricks.rst b/docs/graph/databricks.rst
	new file mode 100644
	index 0000000..0699a7f
	--- /dev/null
	+++ b/docs/graph/databricks.rst
	@@ -0,0 +1,90 @@
	+Setup on Azure Databricks
	+=========================
	+
	+.. highlight:: python
	+
	+This tutorial will explain you how you can load the dataset in an Azure Spark
	+cluster, and interface with it using a Python notebook in Azure Databricks.
	+
	+
	+Preliminaries
	+-------------
	+
	+Make sure you have:
	+
	+- familiarized yourself with the `Azure Databricks Getting Started Guide
	+ <https://docs.azuredatabricks.net/getting-started/index.html>`_
	+
	+- uploaded the dataset in the Parquet format on Azure (the most efficient place
	+ to upload it is an `Azure Data Lake Storage Gen2
	+ <https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction>`_
	+ container).
	+
	+- created a Spark cluster in the Databricks interface and attached a Python
	+ notebook to it.
	+
	+- set the OAuth credentials in the Notebook so that your parquet files are
	+ accessible from the notebook, as described `here
	+ <https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake-gen2.html#dataframe-or-dataset-api>`_.
	+
	+To ensure that you have completed all the preliminary steps, run the following
	+command in your Notebook::
	+
	+ dataset_path = 'abfss://YOUR_CONTAINER@YOUR_ACCOUNT.dfs.core.windows.net/PARQUET_FILES_PATH'
	+ dbutils.fs.ls(dataset_path)
	+
	+You should see an output like this::
	+
	+ [FileInfo(path='abfss://.../swh/content/', name='content/', size=0),
	+ FileInfo(path='abfss://.../swh/directory/', name='directory/', size=0),
	+ FileInfo(path='abfss://.../swh/directory_entry_dir/', name='directory_entry_dir/', size=0),
	+ ...]
	+
	+Loading the tables
	+------------------
	+
	+We need to load the Parquet tables as temporary views in Spark::
	+
	+ def register_table(table):
	+ abfss_path = dataset_path + '/' + table
	+ df = spark.read.parquet(abfss_path)
	+ print("Register the DataFrame as a SQL temporary view: {} (path: {})"
	+ .format(table, abfss_path))
	+ df.createOrReplaceTempView(table_name)
	+
	+ tables = [
	+ 'content',
	+ 'directory',
	+ 'directory_entry_dir',
	+ 'directory_entry_file',
	+ 'directory_entry_rev',
	+ 'origin',
	+ 'origin_visit',
	+ 'person',
	+ 'release',
	+ 'revision',
	+ 'revision_history',
	+ 'skipped_content',
	+ 'snapshot',
	+ 'snapshot_branch',
	+ 'snapshot_branches'
	+ ]
	+
	+ for table in tables:
	+ register_table(table)
	+
	+Running queries
	+---------------
	+
	+You can now execute PySpark methods on the tables::
	+
	+ df = spark.sql("select id from origin limit 10")
	+ display(df)
	+
	+.. highlight:: sql
	+
	+It is also possible to use the ``%sql`` magic command in the Notebook to
	+directly preview SQL results::
	+
	+ %sql
	+ select id from origin limit 10
	diff --git a/docs/graph/datasets.rst b/docs/graph/dataset.rst
	similarity index 100%
	rename from docs/graph/datasets.rst
	rename to docs/graph/dataset.rst
	diff --git a/docs/graph/index.rst b/docs/graph/index.rst
	index c44e806..26fea8b 100644
	--- a/docs/graph/index.rst
	+++ b/docs/graph/index.rst
	@@ -1,53 +1,54 @@
	.. _swh-graph-dataset:

	Software Heritage Graph Dataset
	===============================

	This is the Software Heritage graph dataset: a fully-deduplicated Merkle
	DAG representation of the Software Heritage archive. The dataset links
	together file content identifiers, source code directories, Version
	Control System (VCS) commits tracking evolution over time, up to the
	full states of VCS repositories as observed by Software Heritage during
	periodic crawls. The dataset’s contents come from major development
	forges (including `GitHub <https://github.com/>`__ and
	`GitLab <https://gitlab.com>`__), FOSS distributions (e.g.,
	`Debian <debian.org>`__), and language-specific package managers (e.g.,
	`PyPI <https://pypi.org/>`__). Crawling information is also included,
	providing timestamps about when and where all archived source code
	artifacts have been observed in the wild.

	The Software Heritage graph dataset is available in multiple formats,
	including downloadable CSV dumps and Apache Parquet files for local use,
	as well as a public instance on Amazon Athena interactive query service
	for ready-to-use powerful analytical processing.

	By accessing the dataset, you agree with the Software Heritage `Ethical
	Charter for using the archive
	data <https://www.softwareheritage.org/legal/users-ethical-charter/>`__,
	and the `terms of use for bulk
	access <https://www.softwareheritage.org/legal/bulk-access-terms-of-use/>`__.


	If you use this dataset for research purposes, please cite the following paper:

	*
	\| Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.
	\| The Software Heritage Graph Dataset: Public software development under one roof.
	\| In proceedings of `MSR 2019 <http://2019.msrconf.org/>`_: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with `ICSE 2019 <https://2019.icse-conferences.org/>`_.
	\| `preprint <https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf>`_, `bibtex <https://upsilon.cc/~zack/research/publications/msr-2019-swh.bib>`_

	.. toctree::
	:maxdepth: 2
	:caption: Contents:

	- datasets
	+ dataset
	postgresql
	athena
	+ databricks


	Indices and tables
	==================

	* :ref:`genindex`
	* :ref:`modindex`
	* :ref:`search`

File Metadata

Mime Type: text/x-diff
Expires: Fri, Jul 4, 12:35 PM (2 w, 2 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3273946

No OneTemporaryActions

View Options

File Metadata

Event Timeline

No OneTemporary
Actions