databricks.rst
No OneTemporary
Actions

Size

2 KB

Subscribers

None

databricks.rst
View Options

	Setup on Azure Databricks
	=========================

	.. highlight:: python

	This tutorial will explain you how you can load the dataset in an Azure Spark
	cluster, and interface with it using a Python notebook in Azure Databricks.


	Preliminaries
	-------------

	Make sure you have:

	- familiarized yourself with the `Azure Databricks Getting Started Guide
	<https://docs.azuredatabricks.net/getting-started/index.html>`_

	- uploaded the dataset in the Parquet format on Azure (the most efficient place
	to upload it is an `Azure Data Lake Storage Gen2
	<https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction>`_
	container).

	- created a Spark cluster in the Databricks interface and attached a Python
	notebook to it.

	- set the OAuth credentials in the Notebook so that your parquet files are
	accessible from the notebook, as described `here
	<https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake-gen2.html#dataframe-or-dataset-api>`_.

	To ensure that you have completed all the preliminary steps, run the following
	command in your Notebook::

	dataset_path = 'abfss://YOUR_CONTAINER@YOUR_ACCOUNT.dfs.core.windows.net/PARQUET_FILES_PATH'
	dbutils.fs.ls(dataset_path)

	You should see an output like this::

	[FileInfo(path='abfss://.../swh/content/', name='content/', size=0),
	FileInfo(path='abfss://.../swh/directory/', name='directory/', size=0),
	FileInfo(path='abfss://.../swh/directory_entry_dir/', name='directory_entry_dir/', size=0),
	...]

	Loading the tables
	------------------

	We need to load the Parquet tables as temporary views in Spark::

	def register_table(table):
	abfss_path = dataset_path + '/' + table
	df = spark.read.parquet(abfss_path)
	print("Register the DataFrame as a SQL temporary view: {} (path: {})"
	.format(table, abfss_path))
	df.createOrReplaceTempView(table_name)

	tables = [
	'content',
	'directory',
	'directory_entry_dir',
	'directory_entry_file',
	'directory_entry_rev',
	'origin',
	'origin_visit',
	'person',
	'release',
	'revision',
	'revision_history',
	'skipped_content',
	'snapshot',
	'snapshot_branch',
	'snapshot_branches'
	]

	for table in tables:
	register_table(table)

	Running queries
	---------------

	You can now execute PySpark methods on the tables::

	df = spark.sql("select id from origin limit 10")
	display(df)

	.. highlight:: sql

	It is also possible to use the ``%sql`` magic command in the Notebook to
	directly preview SQL results::

	%sql
	select id from origin limit 10

File Metadata

Mime Type: text/x-python
Expires: Thu, Jul 3, 12:01 PM (3 d, 8 h ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3290806

databricks.rstNo OneTemporaryActions

databricks.rstView Options

File Metadata

Event Timeline

databricks.rst
No OneTemporary
Actions

databricks.rst
View Options