Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9313906
databricks.rst
No One
Temporary
Actions
Download File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
2 KB
Subscribers
None
databricks.rst
View Options
Setup on Azure Databricks
=========================
..
highlight
::
python
This tutorial will explain you how you can load the dataset in an Azure Spark
cluster, and interface with it using a Python notebook in Azure Databricks.
Preliminaries
-------------
Make sure you have:
-
familiarized yourself with the `Azure Databricks Getting Started Guide
<https://docs.azuredatabricks.net/getting-started/index.html>`_
-
uploaded the dataset in the Parquet format on Azure (the most efficient place
to upload it is an `Azure Data Lake Storage Gen2
<https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction>`_
container).
-
created a Spark cluster in the Databricks interface and attached a Python
notebook to it.
-
set the OAuth credentials in the Notebook so that your parquet files are
accessible from the notebook, as described `here
<https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake-gen2.html#dataframe-or-dataset-api>`_.
To ensure that you have completed all the preliminary steps, run the following
command in your Notebook
::
dataset_path = 'abfss://YOUR_CONTAINER@YOUR_ACCOUNT.dfs.core.windows.net/PARQUET_FILES_PATH'
dbutils.fs.ls(dataset_path)
You should see an output like this
::
[FileInfo(path='abfss://.../swh/content/', name='content/', size=0),
FileInfo(path='abfss://.../swh/directory/', name='directory/', size=0),
FileInfo(path='abfss://.../swh/directory_entry_dir/', name='directory_entry_dir/', size=0),
...]
Loading the tables
------------------
We need to load the Parquet tables as temporary views in Spark
::
def register_table(table):
abfss_path = dataset_path + '/' + table
df = spark.read.parquet(abfss_path)
print("Register the DataFrame as a SQL temporary view: {} (path: {})"
.format(table, abfss_path))
df.createOrReplaceTempView(table_name)
tables = [
'content',
'directory',
'directory_entry_dir',
'directory_entry_file',
'directory_entry_rev',
'origin',
'origin_visit',
'person',
'release',
'revision',
'revision_history',
'skipped_content',
'snapshot',
'snapshot_branch',
'snapshot_branches'
]
for table in tables:
register_table(table)
Running queries
---------------
You can now execute PySpark methods on the tables
::
df = spark.sql("select id from origin limit 10")
display(df)
..
highlight
::
sql
It is also possible to use the
``%sql``
magic command in the Notebook to
directly preview SQL results
::
%sql
select id from origin limit 10
File Metadata
Details
Attached
Mime Type
text/x-python
Expires
Thu, Jul 3, 12:01 PM (3 d, 8 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3290806
Attached To
rDDATASET Datasets
Event Timeline
Log In to Comment