Page MenuHomeSoftware Heritage

databricks.rst
No OneTemporary

databricks.rst

Setup on Azure Databricks
=========================
.. highlight:: python
This tutorial will explain you how you can load the dataset in an Azure Spark
cluster, and interface with it using a Python notebook in Azure Databricks.
Preliminaries
-------------
Make sure you have:
- familiarized yourself with the `Azure Databricks Getting Started Guide
<https://docs.azuredatabricks.net/getting-started/index.html>`_
- uploaded the dataset in the Parquet format on Azure (the most efficient place
to upload it is an `Azure Data Lake Storage Gen2
<https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction>`_
container).
- created a Spark cluster in the Databricks interface and attached a Python
notebook to it.
- set the OAuth credentials in the Notebook so that your parquet files are
accessible from the notebook, as described `here
<https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake-gen2.html#dataframe-or-dataset-api>`_.
To ensure that you have completed all the preliminary steps, run the following
command in your Notebook::
dataset_path = 'abfss://YOUR_CONTAINER@YOUR_ACCOUNT.dfs.core.windows.net/PARQUET_FILES_PATH'
dbutils.fs.ls(dataset_path)
You should see an output like this::
[FileInfo(path='abfss://.../swh/content/', name='content/', size=0),
FileInfo(path='abfss://.../swh/directory/', name='directory/', size=0),
FileInfo(path='abfss://.../swh/directory_entry_dir/', name='directory_entry_dir/', size=0),
...]
Loading the tables
------------------
We need to load the Parquet tables as temporary views in Spark::
def register_table(table):
abfss_path = dataset_path + '/' + table
df = spark.read.parquet(abfss_path)
print("Register the DataFrame as a SQL temporary view: {} (path: {})"
.format(table, abfss_path))
df.createOrReplaceTempView(table_name)
tables = [
'content',
'directory',
'directory_entry_dir',
'directory_entry_file',
'directory_entry_rev',
'origin',
'origin_visit',
'person',
'release',
'revision',
'revision_history',
'skipped_content',
'snapshot',
'snapshot_branch',
'snapshot_branches'
]
for table in tables:
register_table(table)
Running queries
---------------
You can now execute PySpark methods on the tables::
df = spark.sql("select id from origin limit 10")
display(df)
.. highlight:: sql
It is also possible to use the ``%sql`` magic command in the Notebook to
directly preview SQL results::
%sql
select id from origin limit 10

File Metadata

Mime Type
text/x-python
Expires
Thu, Jul 3, 12:01 PM (2 d, 22 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3290806

Event Timeline