Data Engineering With Databricks Da

Data

Engineering
with Databricks
Databricks Academy
2023
Meet your instructor
Add instructor name, Add instructor title

• Team: <add>
• Time at Databricks: <add>
Replace with • Fun fact: <Add>
instructor
photograph

2
Meet your classmates
• Where is everyone joining us from today (city, country)?
3
• How long have you been working with Databricks?
4
• What has your experience working with Databricks for data
engineering been so far?
5
• What are you hoping to get out of this class?
6
Getting Started
1_DAIS_Title_Slide
with the Course

Course Objectives
Perform common code development tasks in a data engineering workflow
1
using the Databricks Data Science & Engineering Workspace.
Use Spark to extract data from a variety of sources, apply common cleaning
2 transformations, and manipulate complex data to load into Delta Lake.
Define and schedule data pipelines that incrementally ingest and process data
3
through multiple tables in the lakehouse using Delta Live Tables.
Orchestrate data pipelines with Databricks Workflow Jobs and schedule

4
dashboard updates to keep analytics up-to-date.
Configure permissions in Unity Catalog to ensure that users have proper access
5
to databases for analytics and dashboarding.
Agenda
Module Name Duration
Get Started with Databricks Data Science and Engineering Workspace 1 hour, 20 min
Transform Data with Spark (SQL/PySpark) 2 hours, 50 min
Manage Data with Delta Lake 1 hour, 30 min
Build Data Pipelines with Delta Live Tables (SQL/PySpark) 3 hours
Deploy Workloads with Databricks Workflows 1 hour, 10 min
Manage Data Access for Analytics with Unity Catalog 2 hours
● We will take 10 minute breaks about every hour
9
Get Started with
Databricks Data
Module 01
Science &
1_DAIS_Title_Slide
Engineering
Workspace
Module Objectives
Get Started with Databricks Data Science and Engineering Workspace
1. Describe the core components of the Databricks Lakehouse platform.

2. Navigate the Databricks Data Science & Engineering Workspace UI.
3. Create and manage clusters using the Databricks Clusters UI.
4. Develop and run code in multi-cell Databricks notebooks using basic
operations.
5. Integrate git support using Databricks Repos.
11
Module Overview
Get Started with Databricks Data Science and Engineering Workspace
Databricks Workspace and Services

Navigate the Workspace UI
Compute Resources
DE 1.1 - Create and Manage Interactive Clusters
Develop Code with Notebooks & Databricks Repos
DE 1.2 - Databricks Notebook Operations
DE 1.3L - Get Started with the Databricks Platform Lab
12
Databricks Workspace and
Services
Databricks Workspace and Services
Control Plane
Data Plane
Web App
Unity Catalog
Cluster Manager Workspace SQL

Metastore clusters Warehouses
Workflow Manager
Access Control
Data Jobs Cloud Storage

Lineage/Explorer
Notebooks, Repos,
DBSQL
Demo:
Navigate the Workspace UI
Compute Resources
Clusters
Overview
Collection of VM instances Workloads Cluster
Worker
Distributes workloads
Notebook
across workers VM instance
Two main types: Driver Worker

Job
1. All-purpose clusters for VM instance VM instance
interactive development
Pipeline Worker
2. Job clusters for
automating workloads VM instance
17
Cluster Types
All-purpose Clusters Job Clusters
Analyze data collaboratively using Run automated jobs

interactive notebooks
The Databricks job scheduler creates job
Create clusters from the Workspace or API clusters when running jobs
Configuration information retained for up Configuration information retained for up
to 70 clusters for up to 30 days to 30 most recently terminated clusters
18
Cluster Configuration
Cluster Mode
Standard (Multi Node)
Default mode for workloads developed in any supported language (requires
at least two VM instances)
Single node
Low-cost single-instance cluster catering to single-node machine learning
workloads and lightweight exploratory analysis
Databricks Runtime Version
Standard
Apache Spark and many other components and updates to provide an
optimized big data analytics experiences
Machine learning
Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and
XGBoost.
Photon
An optional add-on to optimize SQL workloads
21
Access Mode
Access mode Unity Catalog Supported
Visible to user
dropdown support languages
Python, SQL,
Single user Always Yes
Scala, R
Python (DBR
Shared Always (Premium plan required) Yes
11.1+), SQL
No isolation Can be hidden by enforcing user isolation in the admin Python, SQL,
No
shared console or configuring account-level settings Scala, R
Only shown for existing clusters without access modes

Python, SQL,
Custom (i.e. legacy cluster modes, Standard or High No
Scala, R
Concurrency); not an option for creating new clusters.
Cluster Policies
Cluster policies can help to achieve the following:
• Standardize cluster configurations
• Provide predefined configurations targeting specific use cases
• Simplify the user experience
• Prevent excessive use and control cost
• Enforce correct tagging
Cluster Access Control
No Permissions Can Attach To Can Restart Can Manage
Attach notebook ✓ ✓ ✓
View Spark UI, cluster
metrics, driver logs ✓ ✓ ✓
Start, restart,
terminate ✓ ✓
Edit ✓
Attach library ✓
Resize ✓
Change permissions ✓
24
Demo:
DE 1.1: Create and Manage
Interactive Clusters
Develop Code with Notebooks
& Databricks Repos
Databricks Notebooks
Collaborative, reproducible, and enterprise ready
Multi-language Reproducible
Use Python, SQL, Scala, and R, all in one Automatically track version history, and
Notebook use git version control with Repos
Collaborative
Real-time co-presence, co-editing, and Get to production faster
commenting Quickly schedule notebooks as jobs or
create dashboards from their results, all
in the Notebook
Ideal for exploration
Explore, visualize, and summarize data
with built-in charts and data profiles
Enterprise-ready
Enterprise-grade access controls,
Adaptable identity management, and auditability
Install standard libraries and use local
modules
27
Notebook magic commands
Use to override default languages, run utilities/auxiliary commands, etc.
%python, %r, %scala, %sql Switch languages in a command cell

%sh Run shell code (only runs on driver node, not worker nodes)
%fs Shortcut for dbutils filesystem commands
%md Markdown for styling the display
%run Execute a remote notebook from a notebook
%pip Install new Python libraries
28
dbutils (Databricks Utilities)
Perform various tasks with Databricks using notebooks
Utility Description Example
Manipulates the Databricks filesystem (DBFS)

fs dbutils.fs.ls()
from the console
Provides utilities for leveraging secrets within

secrets dbutils.secrets.get()
notebooks
notebook Utilities for the control flow of a notebook dbutils.notebook.run()
Methods to create and get bound value of

widgets dbutils.widget.text()
input widgets inside notebooks
jobs Utilities for leveraging jobs features dbutils.jobs.taskValues.set()
Available within Python, R, or Scala notebooks
29
Git Versioning with Databricks
Repos
Databricks Repos
Git Versioning CI/CD Integration Enterprise ready
Native integration with API surface to integrate Allow lists to avoid

Github, Gitlab, Bitbucket with automation exfiltration
and Azure Devops
Simplifies the Secret detection to avoid
UI-based workflows dev/staging/prod leaking keys
multi-workspace story
CI CD
31
Databricks Repos
CI/CD Integration
Control Plane in Databricks Git and CI/CD Systems

Manage customer accounts, datasets, and clusters
Databricks Web Repos / Cluster Version Review Test

Application Jobs Management
Notebooks
Repos Service
32
CI/CD workflows with Git and Repos
Documentation
User workflow in Merge workflow in Production job
Admin workflow
Databricks Git provider workflow in Databricks
Set up top-level Clone remote

repository to user Pull request and API call brings Repo in
Repos folders
folder review process Production folder to
(example:
latest version
Production)
Create new branch

based on main Merge into main
branch branch
Set up Git Run Databricks job

automation to update based on Repo in
Repos on merge Create and edit code Production folder
Git automation calls
Databricks Repos API
Steps in Databricks
Commit and push to
feature branch Steps in your Git provider
Demo:
DE 1.2: Databricks Notebook
Operations
Lab:
DE 1.3L: Get Started with the
Databricks Platform
Transform Data
1_DAIS_Title_Slide
with Spark
Module Objectives
Transform Data with Spark
1. Extract data from a variety of file formats and data sources using Spark
2. Apply a number of common transformations to clean data using Spark
3. Reshape and manipulate complex data using advanced built-in functions
in Spark
4. Leverage UDFs for reusable code and apply best practices for
performance in Spark
37
Module Agenda
Transform Data with Spark
Data Objects in the Lakehouse

DE 2.1 - Querying Files Directly
DE 2.2 - Options for External Sources
DE 2.3L - Extract Data Lab
DE 2.4 - Cleaning Data
DE 2.5 - Complex Transformations
DE 2.6L - Reshape Data Lab
DE 2.7A – SQL UDFs and Control Flow
DE 2.7B - Python UDFs
38
Data Objects in the Lakehouse
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Table View Function
40
Metastore
Catalog
Schema
(Database)
Table View Function
41
Metastore
Catalog
Schema
(Database)
(Database)
Table View Function
42
Metastore
Catalog
Schema
(Database)
Managed table
Table View Function
External table
43
Managed Tables
Metastore
Catalog
Schema
Managed table
Metastore storage
44
External Tables
Metastore
Catalog Storage credential
Schema External location
Managed table External table
Metastore storage External storage
45
Metastore
Catalog
Schema
(Database)
Table View Function
46
Metastore
Catalog
Schema
(Database)
Table View Function
Global Temporary
Temporary View
View 47
Metastore
Catalog
Schema
(Database)
Table View Function

Function
48
Extracting Data
Query files directly
SELECT * FROM file_format.`path/to/file`
Files can be queried directly using SQL

• SELECT * FROM json.`path/to/files/`
• SELECT * FROM text.`path/to/files/`
Process based on specified file format

• json pulls schema from underlying data
• binaryFile and text file formats have fixed data schemas
• text → string value column (row for each line)
• binaryFile → path, modificationTime, length, content columns (row for each file)
50
Configure external tables with read options
CREATE TABLE USING data_source OPTIONS (...)
Many data sources require schema declaration and other options to

correctly read data
• CSV options for delimiter, header, etc
• JDBC options for url, user, password, etc
• Note: using the JDBC driver pulls RDBMS tables dynamically for Spark processing
51
Demo:
DE 2.1: Querying Files Directly
Demo:
DE 2.2: Providing Options for
External Sources
Lab:
DE 2.3L: Extract Data Lab
Lab:
DE 2.4: Cleaning Data
Complex Transformations
Interact with Nested Data
Use built-in syntax to traverse nested data with Spark SQL
Use “:” (colon) syntax in queries to access subfields in JSON strings
SELECT value:device, value:geo ...
Use “.” (dot) syntax in queries to access subfields in STRUCT types
SELECT value.device, value.geo ...
57
Complex Types
Nested data types storing multiple values
• Array: arbitrary number of elements of same data type
• Map: set of key-value pairs
• Struct: ordered (fixed) collection of column(s) and any data type
Example table with complex types
CREATE TABLE employees (name STRING, salary FLOAT,

subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,city:STRING,state:STRING, zip:INT>)
58
Demo:
DE 2.5: Complex
Transformations
explode lab
SELECT
user_id, event_timestamp, event_name,
explode(items) AS item
FROM events
explode outputs the elements of an array field into a separate row for each element
1
2
3
Each item in the items array above is exploded into its own row, resulting in the 3 rows below
1
2
3
60
flatten lab
collect_set returns an array of unique values from a field for each group of rows
flatten returns an array that flattens multiple arrays into one
SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id
61
Collection example
collect_set returns an array with duplicate elements eliminated

collect_list returns an array with duplicate elements intact
df df.agg(collect_set(‘age’)) df.agg(collect_list(‘age’))
62
Parse JSON strings into structs
Create the schema to parse the JSON strings by providing an example JSON string from a row
that has no nulls
from_json uses JSON schema returned by schema_of_json to convert a column of JSON strings into structs
This highlighted JSON string is taken from the value field of a single row of data
Returns STRUCT column containing ARRAY of nested STRUCT
63
Lab:
DE 2.5L: Reshape Data Lab
(Optional)
Demo:
DE 2.7A: SQL UDFs and
Control Flow (Optional)
Demo:
DE 2.7B: Python UDFs
(Optional)
Manage Data 1_DAIS_Title_Slide
with Delta Lake
Module 03
Module Agenda
Manage Data with Delta Lake
What is Delta Lake

DE 3.1 - Schemas and Tables
DE 3.2 - Version and Optimize Delta Tables
DE 3.3L - Manipulate Delta Tables Lab
DE 3.4 - Set Up Delta Tables
DE 3.5 - Load Data into Delta Lake
DE 3.6L - Load Data Lab
68
What is Delta Lake?
Delta Lake is an open-source
project that enables building a
data lakehouse on top of
existing cloud storage
70
Delta Lake Is Not…
• Proprietary technology
• Storage format
• Storage medium
• Database service or data warehouse
Delta Lake Is…
• Open source
• Builds upon standard data formats
• Optimized for cloud object storage
• Built for scalable metadata handling
Delta Lake brings ACID to object storage
Atomicity means all transactions either succeed or fail completely
Consistency guarantees relate to how a given state of the data is observed by

simultaneous operations
Isolation refers to how simultaneous operations conflict with one another. The
isolation guarantees that Delta Lake provides do differ from other systems
Durability means that committed changes are permanent

Problems solved by ACID
• Hard to append data
• Modification of existing data difficult
• Jobs failing mid way
• Real-time operations hard
• Costly to keep historical data versions
Delta Lake is the default format
for tables created in Databricks
CREATE TABLE foo df.write

USING DELTA .format(“delta”)
75
Demo:
DE 3.1: Schemas and Tables
Demo:
DE 3.2: Version and Optimize
Delta Tables
Lab:
DE 3.3L - Manipulate Delta
Tables Lab
Demo:
DE 3.4: Set up Delta Tables
Demo:
DE 3.5: Load Data into Delta
Tables
Lab:
DE 3.6L: Load Data
Build Data
Pipelines with
1_DAIS_Title_Slide
Delta Live
Tables
Agenda
Build Data Pipelines with Delta Live Tables
The Medallion Architecture

Introduction to Delta Live Tables
DE 4.1 - DLT UI Walkthrough
DE 4.1A - SQL Pipelines
DE 4.1B - Python Pipelines
DE 4.2 - Python vs SQL
DE 4.3 - Pipeline Results
DE 4.4 - Pipeline Event Logs
83
The Medallion Architecture
Medallion Architecture in the Lakehouse
Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake
Data Science
Raw ingestion Filtered, cleaned, Business-level & ML
and history augmented aggregates
Data Quality & Governance Data Sharing
85
Multi-Hop in the Lakehouse
Bronze Layer
Typically just a raw copy of ingested data

Replaces traditional data lake
Bronze
Provides efficient storage and querying of full, unprocessed
history of data
86
Silver Layer
Reduces data storage complexity, latency, and redundancy

Optimizes ETL throughput and analytic query performance
Silver
Preserves grain of original data (without aggregations)
Eliminates duplicate records
Production schema enforced
Data quality checks, corrupt data quarantined
87
Gold Layer
Powers ML applications, reporting, dashboards, ad hoc analytics

Refined views of data, typically with aggregations
Gold
Reduces strain on production systems
Optimizes query performance for business-critical data
88
Introduction to Delta Live
Tables
Streaming
analytics
CSV
JSON
TXT
Bronze Silver Gold

Databricks Auto
Loader Raw Ingestion and Filtered, Cleaned, Business-level
History Augmented Aggregates
Data quality
AI and reporting
The Reality is Not so Simple
Bronze Silver Gold

Large scale ETL is complex and brittle
Complex pipeline Data quality and Difficult pipeline

development governance operations
Hard to build and maintain table Difficult to monitor and enforce Poor observability at granular,
dependencies data quality data level
Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious
92
Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake
Operate with agility Trust your data Scale with reliability
Declarative tools to DLT has built-in Easily scale

build batch and declarative quality infrastructure
streaming data controls alongside your data
pipelines
Declare quality
expectations and
actions to take
93
What is a LIVE TABLE?
What is a Live Table?
Live Tables are materialized views for the lakehouse.
A live table is: Live tables provides tools to:

• Defined by a SQL query • Manage dependencies
• Created and kept up-to-date by a • Control quality
pipeline
• Automate operations
• Simplify collaboration
CREATE OR REFRESH LIVE TABLE report
• Save costs
AS SELECT sum(profit)
• Reduce latency
FROM prod.sales
95
What is a Streaming Live Table?
Based on SparkTM Structured Streaming
A streaming live table is “stateful”: • Streaming Live tables compute results

over append-only streams such as
• Ensures exactly-once processing of
Kafka, Kinesis, or Auto Loader (files on
input rows
cloud storage)
• Inputs are only read once
• Streaming live tables allow you to reduce
costs and latency by avoiding
reprocessing of old data.
CREATE STREAMING LIVE TABLE report
AS SELECT sum(profit)
FROM cloud_files(prod.sales)
96
When should I use streaming?
Using Spark Structured Streaming for ingestion
Easily ingest files from cloud storage as they are uploaded
This example creates a table with all the

json data stored in “/data”:
• cloud_files keeps track of which files
CREATE STREAMING LIVE TABLE raw_data have been read to avoid duplication and
AS SELECT * wasted work
FROM cloud_files("/data", "json”) • Supports both listing and notifications
for arbitrary scale
• Configurable schema inference and
schema evolution
98
Using the SQL STREAM() function
Stream data from any Delta table
CREATE STREAMING LIVE TABLE • STREAM(my_table) reads a stream of

mystream new records, instead of a snapshot
AS SELECT * • Streaming tables must be an

append-only table
FROM STREAM(my_table)
Pitfall: my_table must be an append-only • Any append-only delta table can be
source. read as a stream (i.e. from the live
schema, from the catalog, or just from a
e.g. it may not:
path).
• be the target of APPLY CHANGES INTO
• define an aggregate function
• be a table on which you’ve executed DML to
delete/update a row (see GDPR section)
99
How do I use DLT?
Creating Your First Live Table Pipeline
SQL to DLT in three easy steps…
Write create live table Create a pipeline Click start
• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.
101
BEST PRACTICE
Development vs Production
Fast iteration or enterprise grade reliability
Development Mode Production Mode
• Reuses a long-running cluster • Cuts costs by turning off clusters

running for fast iteration. as soon as they are done (within 5
minutes)
• No retries on errors enabling
faster debugging. • Escalating retries, including
cluster restarts, ensure reliability
in the face of transient issues.
In the Pipelines
UI:
102
What if I have
dependent tables?
Declare LIVE Dependencies
Using the LIVE virtual schema.
CREATE LIVE TABLE events • Dependencies owned by other producers

are just read from the catalog or spark
AS SELECT … FROM prod.raw_data
data source as normal.
• LIVE dependencies, from the same
CREATE LIVE TABLE report pipeline, are read from the LIVE schema.
AS SELECT … FROM LIVE.events • DLT detects LIVE dependencies and
executes all operations in correct order.
events report
• DLT handles parallelism and captures the
lineage of the data.
104
How do I ensure
Data Quality?
BEST PRACTICE
Ensure correctness with Expectations

Expectations are tests that ensure data quality in production
CONSTRAINT valid_timestamp Expectations are true/false expressions

that are used to validate each row during
EXPECT (timestamp > '2012-01-01’)
processing.
ON VIOLATION DROP
DLT offers flexible policies on how to handle

@dlt.expect_or_drop( records that violate expectations:
"valid_timestamp", • Track number of bad records
col("timestamp") > '2012-01-01') • Drop bad records
• Abort processing for a single bad record
106
What about operations?
Pipelines UI (1 of 5)
A one stop shop for ETL debugging and operations
• Visualize data flows

between tables
108

between tables
• Discover metadata and
quality of each table
109

between tables
• Access to historical
updates
110

between tables
updates
• Control operations
111

between tables
updates
• Control operations
• Dive deep into events
112
The Event Log
The event log automatically records all pipelines operations.
Operational Statistics Provenance Data Quality
Time and current status, for all operations
Pipeline and cluster configurations
Table schemas, Expectation pass /
Row counts definitions, and failure / drop
declared properties statistics
Table-level lineage Input/Output rows
that caused
Query plans used to
expectation failures
update tables
113
How can I use parameters?
Modularize your code with configuration
Avoid hard coding paths, topic names, and other constants in your code.
A pipeline’s configuration is a map of key value pairs that can be used to

parameterize your code:
• Improve code readability/maintainability
• Reuse code in multiple pipelines for different data
CREATE STREAMING LIVE TABLE data AS
SELECT * FROM cloud_files("${my_etl.input_path}",
"json")
@dlt.table
def data():
input_path = spark.conf.get("my_etl.input_path”)
spark.readStream.format("cloud_files”).load(input_path)
115
How can I do
change data capture (CDC)?
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
APPLY CHANGES INTO LIVE.cities

{UPDATE}
FROM STREAM(LIVE.city_updates) {DELETE}
KEYS (id) {INSERT}
SEQUENCE BY ts
APPLY
CHANGES
INTO
Up-to-date
Snapshot
117

city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly,
KEYS (id) CA"}
SEQUENCE BY ts
cities
A target for the changes to id city
be applied to.
118

city_updates
KEYS (id) CA"}
SEQUENCE BY ts
A source of changes,
currently this has to be a
stream.
119

city_updates
KEYS (id) CA"}
SEQUENCE BY ts
cities
A unique key that can be
used to identify a given row. id city
120

city_updates
KEYS (id) CA"}
SEQUENCE BY ts
A sequence that can be used cities

to order changes: id city
• Log sequence number (lsn)
• Timestamp
• Ingestion time
121

city_updates
{"id": 1, "ts": 100, "city": "Bekerly, CA"}
KEYS (id) {"id": 1, "ts": 200, "city": "Berkeley, CA"}
SEQUENCE BY ts
cities
id city
1 Bekerly, CA Berkeley, CA
122
REFERENCE ARCHITECTURE
Change Data Capture (CDC) from RDBMS

A variety of 3rd party tools can provide a streaming change feed
replicated_table
Amazon DMS to S3 cloud_files APPLY CHANGES INTO

RDS
replicated_table
MySQL
or Debezium APPLY CHANGES INTO
Postgre
s
replicated_table
Golden Gate APPLY CHANGES INTO

Oracle
What do I no longer need to
manage with DLT?
Automated Data Management
DLT automatically optimizes data for performance & ease-of-use
Best Practices Physical Data Schema Evolution
What: What: What:
DLT encodes Delta best practices DLT automatically manages your Schema evolution is handled for
automatically when creating DLT physical data to minimize cost and you
tables. optimize performance.
How:
How: How:
Modifying a live table
DLT sets the following properties: • runs vacuum daily transformation to
• runs optimize daily add/remove/rename a column will
• optimizeWrite
automatically do the right thing.
• autoCompact You still can tell us how you want
• tuneFileSizesForRewrites it organized (ie ZORDER) When removing a column in a
streaming live table, old values are
preserved.
125
Demo:
DE 4.1 - Using the Delta Live
Tables UI
Demo:
DE 4.1.1 - Fundamentals of DLT
Syntax
Demo:
DE 4.1.2 - More DLT SQL
Syntax
Demo:
DE 4.2 - Delta Live Tables:
Python vs SQL
Demo:
DE 4.3 - Exploring the Results
of a DLT Pipeline
Demo:
DE 4.4 - Exploring the Pipeline
Events Logs
Lab:
DE 4.1.3 - Troubleshooting DLT
Syntax Lab
Deploy
Workloads with
1_DAIS_Title_Slide
Databricks
Workflows
Module Agenda
Deploy Workloads with Databricks Workflows
Introduction to Workflows
Building and Monitoring Workflow Jobs
DE 5.1 - Scheduling Tasks with the Jobs UI
DE 5.2L - Jobs Lab
134
Introduction to Workflows
Lesson Objectives
1 Describe the main features and use cases of Databricks Workflows
2 Create a task orchestration workflow composed of various task types
3 Utilize monitoring and debugging features of Databricks Workflows
4 Describe workflow best practices

Databricks Workflows
Workflows is a fully-managed cloud-based
general-purpose task orchestration service
for the entire Lakehouse.
Lakehouse Platform
Workflows is a service for data engineers, data Data Data Data Data Science
Warehousing Engineering Streaming and ML
scientists and analysts to build reliable data,
analytics and AI workflows on any cloud. Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Cloud Data Lake

All structured and unstructured data
137
Databricks has two main task orchestration

services:
• Workflow Jobs (Workflows)
• Workflows for every job
• Delta Live Tables (DLT)
• Automated data pipelines for Delta Lake
Note: DLT pipeline can be a task in a workflow
138
DLT versus Jobs
Considerations
Delta Live Tables Workflow Jobs
JARs, notebooks, DLT, application written in

Source Notebooks only
Scala, Java, Python
Dependencies Automatically determined Manually set
Cluster Self-provisioned Self-provisioned or existing
Timeouts and Retries Not supported Supported
Import Libraries Not supported Supported
139
DLT versus Jobs
Use Cases
Orchestration of Machine Learning Tasks Arbitrary Code, External Data Ingestion and
Dependent Jobs API Calls, Custom Tasks Transformation
Run MLflow notebook task
Jobs running on schedule, in a job Run tasks in a job which ETL jobs, Support for batch
containing dependent can contain Jar file, Spark and streaming, Built in data
tasks/steps Submit, Python Script, SQL quality constraints,
task, dbt monitoring & logging
Jobs Workflows Jobs Workflows Jobs Workflows Delta Live Tables
140
Workflows Features
Part 1 of 2
Orchestrate Anything Fully Managed Simple Workflow

Anywhere Authoring
Run diverse workloads for the full Remove operational overhead An easy point-and-click authoring
data and AI lifecycle, on any cloud. with a fully managed experience for all your data teams
Orchestrate; orchestration service enabling not just those with specialized
you to focus on your workflows skills
• Notebooks
not on managing your
• Delta Live Tables
infrastructure
• Jobs for SQL
• ML models, and more
141
Workflows Features
Part 2 of 2
Deep Platform Integration Proven Reliability
Designed and built into your Have full confidence in your

lakehouse platform giving you workflows leveraging our proven
deep monitoring capabilities and experience running tens of
centralized observability across millions of production workloads
all your workflows daily across AWS, Azure, and GCP
142
How to Leverage Workflows
• Allows you to build simple ETL/ML task orchestration
• Reduces infrastructure overhead
• Easily integrate with external tools
• Enables non-engineers to build their own workflows using simple UI
• Cloud-provider independent
• Enables re-using clusters to reduce cost and startup time
143
Common Workflow Patterns
Sequence Funnel Fan-out
Sequence Funnel
● Data transformation/ Fan-out, star pattern
● Multiple data sources
processing/cleaning ● Single data source
● Data collection
● Bronze/silver/gold tables ● Data ingestion and
distribution
144
Example Workflow
Data ingestion funnel

E.g. Auto Loader, DLT
Data filtering, quality assurance, transformation

E.g. DLT, SQL, Python
ML feature extraction
E.g. MLflow
Persisting features and training prediction model
145
Building and Monitoring
Workflow Jobs
Workflows Job Components
TASKS SCHEDULE CLUSTER
What? When? How?
147
Creating a Workflow
Task Definition
While creating a task;

• Define the task type
• Choose the cluster type
• Job clusters and All-purpose clusters can
be used.
• A cluster can be used by multiple tasks.
This reduces cost and startup time.
• If you want to create a new cluster,
you must have required permissions.
• Define task dependency if task
depends on another task
Monitoring and Debugging
Scheduling and Alerts
You can run your jobs immediately or

periodically through an easy-to-use
scheduling system.
You can specific alerts to be notified

when runs of a job begin, complete or
fail. Notifications can be sent via email,
Slack or AWS SNS.
Access Control
Workflows integrates with existing

resources access controls, enabling you
to easily manage access across different
teams.
Job Run History Run duration
Workflows keeps track of job runs and save information about the success or failure of each task
in the job run.
Navigate to the Runs tab to view completed or active runs for a job.
Tasks
Job run
151
Repair a Failed Job Run
Repair feature allows you to re-run only

the failed task and sub-tasks, which
reduces the time and resources required
to recover from unsuccessful job runs.
Navigating the Jobs UI
Use breadcrumbs to navigate back to your job from a specific run page
153
Navigating the Jobs UI
Runs vs Tasks tabs on the job page
Use Runs tab to view completed or Use Tasks tab to modify or add
active runs for the job tasks to the job
154
Demo:
DE 5.1.1: Task Orchestration
Demo: Task Orchestration
DE 5.1.1 - Task Orchestration
• Schedule a notebook task in a Databricks Workflow Job
• Describe job scheduling options and differences between cluster types
• Review Job Runs to track progress and see results
• Schedule a DLT pipeline task in a Databricks Workflow Job
• Configure dependency between tasks via Databricks Workflows UI
156
Lab:
DE 5.2.1.L: Task Orchestration
Lab: Task Orchestration
DE 5.2.1.L - Task Orchestration
158
Manage Data
Access for
1_DAIS_Title_Slide
Analytics
Lesson Objectives
By the end of this course, you will be able to:
1. Describe Unity Catalog key concepts and how it integrates with the
Databricks platform
2. Access Unity Catalog through clusters and SQL warehouses
3. Create and govern data assets in Unity Catalog
4. Adopt Databricks recommendations into your organization’s Unity
Catalog-based solutions
160
Module Agenda
Manage Data Access for Analytics with Unity Catalog
Introduction to Unity Catalog
Compute Resources in Unity Catalog
DE 6.1 - Introduction to Unity Catalog
DE 6.8 - Compute Resources
DE 6.2 - Overview of Data Governance
DE 6.9 - Creating Compute Resources
DE 6.3 - Unity Catalog Key Concepts
DE 6.4 - Unity Catalog Architecture
DE 6.5 - Unity Catalog Identities
DE 6.6 - Managing Principals in Unity Catalog
DE 6.7 - Managing Catalog Metastores
Data Access Control in Unity Catalog Unity Catalog Best Practices
DE 6.10 - Data Access Control in Databricks DE 6.16 - Best Practices
DE 6.11 - Security Model DE 6.17 - Data Segregation
DE 6.12 - External Storage DE 6.18 - Identity Management
DE 6.13 - Creating and Governing Data DE 6.19 - External Storage
DE 6.14 - Create and Share Tables DE 6.20 - Upgrade a Table to Unity Catalog
DE 6.15 - Create External Tables DE 6.21 - Create Views and Limiting Table Access
161
Introduction to Unity Catalog
Overview of
Data Governance
80% of organizations seeking to scale
digital business will fail because they do not
take a modern approach to data and analytics
governance
Source: Gartner
164
Data Governance
Four key functional areas
Data Access Control Data Access Audit
Control who has access to which data Capture and record all access to data
Data Lineage Data Discovery
Capture upstream sources and downstream Ability to search for and discover authorized assets
consumers
165
Governance for data,
analytics and AI is complex
Permissions on files No row and column level permissions
Inflexible when policies change

Data Lake
Data analyst
Permissions on tables and views
1 Can be out of sync with data
6
6
Metadata
Permissions on tables, columns, rows

Data engineer Different governance model
Data Warehouse
Permissions on ML models,
dashboards, features, …
Yet another governance model
Data scientist
ML and AI
Databricks Unity Catalog
Unified governance for data, analytics and AI
Data Lake
Data analyst
Metadata
Unity Catalog
Data engineer
Data Warehouse
Data scientist
ML and AI
167
Unity Catalog
Overview
Unified governance across Unified data and AI assets Unified existing catalogs
clouds
Centrally share, audit, secure and Works in concert with existing
Fine-grained governance for data manage all data types with one data, storage, and catalogs - no
lakes across clouds - based on simple interface. hard migration required.
open standard ANSI SQL.
1 2 3
168
Unity Catalog
Key Capabilities
● Centralized metadata and user management Unity Catalog
● Centralized data access controls Databricks Databricks

Workspace Workspace
● Data access auditing

GRANT … ON … TO …
● Data lineage REVOKE … ON … FROM …
● Data search and discovery Catalogs, Databases (schemas),

Tables, Views, Storage
credentials, External locations
● Secure data sharing with Delta Sharing
169
Unity Catalog
Key Concepts
Metastore
Unity Catalog metastore elements
Metastore
Control
Storage Plane
External Location Catalog Share Recipient
Credential
Schema
(Database) Cloud
Storage
Table View Function
171
Metastore
Accessing legacy Hive metastore
Metastore
hive_metastore Catalog 1 Catalog 2
Schema
Workspace (Database)
Table View Function
172
Catalog
Top-level container for data objects
Metastore
Storage
Credential
Schema
(Database)
Table View Function
173
Catalog
Three-level namespace
Traditional SQL two-level Unity Catalog three-level

namespace namespace
SELECT * FROM schema.table SELECT * FROM catalog.schema.table
174
Data Objects
Schema (database), tables, views, functions
Metastore
Storage
Credential
Schema
(Database)
Managed table
Table View Function
External table
175
External Storage
Storage credentials and external locations
Metastore
Storage
Credential
Schema
(Database)
Table View Function
176
Delta Sharing
Shares and recipients
Metastore
Storage
Credential
Schema
(Database)
Table View Function
177
Unity Catalog Architecture
Architecture
Before Unity Catalog With Unity Catalog
Workspace 1 Workspace 2 Unity Catalog
User/group User/group User/group Access

Metastore
management management management controls
Metastore Metastore
Access controls Access controls Workspace 1 Workspace 2
Compute Compute Compute Compute

resources resources resources resources
179
Query Lifecycle
Unity Catalog Security Model
Check namespace,
2 metadata and grants
Return short-lived
1 Send query 4 token and signed URL Audit Log
Assume IAM Role or

Principal Request data from URL 3
8 Send result 5 with short-lived token
Service Principal
Compute
Enforce
7 policies
6 Return data
Cloud Storage
180
Compute Resources and Unity
Catalog
Compute Resources for Unity Catalog
Modes Modes not
Cluster Access
supporting UC Mode
supporting
No isolationUC
shared
Multiple
Single user language
supportnot shareable
Multiple language support,
Shared
Shareable, Python and SQL, legacy table ACLs
182
Cluster Access Mode
Feature matrix
Init scripts
Supported Legacy Credential DBFS Fuse Dynamic Machine
Access mode Shareable RDD API and
languages table ACL passthrough mounts views learning
libraries
No Isolation
All ⬤ ⬤ ⬤ ⬤ ⬤
Shared
Single user All ⬤ ⬤ ⬤ ⬤ ⬤
SQL
Shared ⬤ ⬤ ⬤
Python
183
Roles and Identities in Unity
Catalog
Unity Catalog
Roles
Cloud Admin Identity Admin Cloud Admin

• Manage underlying cloud resources
• Storage accounts/buckets
Account Admin • IAM role/service principals/managed
identities
Metastore Admin Identity Admin
• Manage users and groups in the identity
provider (IdP)
Data Owner • Provision into account (with account admin)
Workspace Admin
185
Unity Catalog
Roles
Account Admin
• Create or delete metastores, assign
Cloud Admin Identity Admin metastores to workspaces
• Manage users and groups, integrate with IdP
• Full access to all data objects
Account Admin
Metastore Admin
• Create or drop, grant privileges on, and
Metastore Admin change ownership of catalogs and otherdata
objects
Data Owner Data Owner - owns data objects they created

• Create nested objects, grant privileges on,
and change ownership of owned objects
Workspace Admin
186
Unity Catalog
Roles
Workspace Admin
Cloud Admin Identity Admin • Manages permissions on
workspace assets
• Restricts access to cluster creation
Account Admin • Adds or removes users
• Elevates users permissions
• Grant privileges to others
Metastore Admin • Change job ownership
Data Owner
Workspace Admin
187
Unity Catalog
Identities
• User • Service Principal
• Account Administrator • Service Principal with administrative privileges
[email protected] terraform
First name First name

App ID UUID
GUID
Last name Last name
Name terraform
Password ●●●●●●●●●●
Admin role
Admin role
188
Unity Catalog
Identities
• Groups
allusers
analysts developers
[email protected] terraform
189
Unity Catalog
Identity Federation
Account Workspace 1 Workspace 2
[email protected]
1
9
0
Account identity
[email protected] [email protected]
Workspace identity Workspace identity

Data Access Control in Unity
Catalog
Security model
Principals Privileges Securables

Account admin CREATE
Catalog Schema
Metastore admin USAGE
Data owner SELECT

Table View
User MODIFY
Service principal CREATE TABLE

Function
Group READ FILES
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
192
Security model

Data owner CREATE
Catalog Schema
Account admin USAGE
Metastore admin SELECT

Table View
User MODIFY

Function
Group READ FILES
WRITE FILES
EXECUTE
Share Recipient
193
Security model

Data owner CREATE
Table
Account admin USAGE
Metastore admin SELECT Catalog Schema View
User MODIFY
Function
Storage External
Group READ FILES
credential location
WRITE FILES
Share Recipient
EXECUTE
194
Security model
Privileges Securables
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
EXECUTE
Share Recipient
195
Security model
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
EXECUTE
Share Recipient
196
Security model
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
EXECUTE
Share Recipient
197
Security model
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
EXECUTE
Share Recipient
198
Security model
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
EXECUTE
Share Recipient
199
Security model
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
EXECUTE
Share Recipient
200
Privilege Recap
Tables
Querying tables (SELECT)

Metastore
Modifying tables (MODIFY)
• Data (INSERT, DELETE)
✓ USAGE Catalog • Metadata (ALTER)
Traversing containers (USAGE)
✓ USAGE Schema
✓ SELECT/MODIFY Table
201
Privilege Recap
Views
Abstract complex queries

Metastore
• Aggregations
• Transformations
✓ USAGE Catalog
• Joins
• Filters
Enhanced table access control
✓ USAGE Schema
Querying views (SELECT)
✓ SELECT View Table
202
Privilege Recap
Functions
Provide custom code via

Metastore
user-defined functions
Using functions (EXECUTE)
✓ USAGE Catalog Traversing containers (USAGE)
✓ USAGE Schema
✓ EXECUTE Function
203
Dynamic Views
Limit access to columns Limit access to rows Data Masking
Omit column values from output Omit rows from output Obscure data
●●●●●●@databricks.com
Can be conditional on a specific user/service principal or group

membership through Databricks-provided functions
204
Creating New Objects
Creating new objects (CREATE)

Metastore
✓ USAGE Catalog
✓ USAGE
Schema
✓ CREATE
New table, view or

function
205
Deleting Objects
DROP objects
Metastore
Catalog
Schema
Table, view or
function
206
Unity Catalog External Storage
Storage Credentials and External Locations
Storage Credential External Location
Enables Unity Catalog to connect to Cloud storage path + storage
an external cloud storage credential
Examples include: • Self-contained object for
accessing specific locations in
• IAM role for AWS S3
cloud storage
• Service principal for Azure
• Fine-grained control over
Storage
external storage
Access Control
CREATE TABLE READ FILES WRITE FILES
Create an External Table Read files directly using this Write files directly using
directly using this Storage Storage Credential this Storage Credential
Credential
Storage Credential
External Location
Create an External Table
from files governed by Read files governed by this Write files governed by this
this External Location External Location External Location
209
Managed Tables
Metastore
Catalog
Schema
Managed table
Metastore storage
210
External Tables
Metastore
Catalog Storage credential
Schema External location
Managed table External table
Metastore storage External storage
211
Unity Catalog Patterns and
Best Practices
UC Patterns & Best Practices
1 metastore per region Region B
dev staging prod

workspace workspace workspace
Metastore
Region A
dev staging prod

Metastore
213
Share data with Delta Sharing Region B
dev staging prod

Metastore
Region A
dev staging prod

Metastore
Share tables from Region A with Region B

214
Data Segregation
Use catalogs (not metastores) to segregate data Metastore
Apply permissions appropriately
For example, grant to group B: ✓ USAGE

Catalog A Catalog B
• USAGE on catalog B
• USAGE on all applicable schemas in catalog B ✓ USAGE

• SELECT/MODIFY on applicable tables Schema A Schema B
✓ SELECT
Table 1 Table 2 Table 3
215
Data Segregation Catalogs
dev
staging Environment scope

prod
bu1_dev
Metastore bu1_staging Business unit + environment scope

bu1_prod
team1_sandbox
Sandboxes
team2_sandbox
216
Identity Management
Account-level Identities Groups Service Principals

Manage all identities at the Use groups rather than users to Use service principals to run
account-level assign access and ownership to production jobs
securable objects
Enable UC for workspaces to
enable identity federation
analysts
terraform
[email protected]
App ID GUID
Name terraform
Admin
developers
terraform
217
UC Patterns & Best
Practices
Storage Credentials and External
Locations
Storage External Location
Credential
Enables Unity Cloud storage path
Catalog to connect + storage credential
to an external cloud
• Self-contained
store
object for
Examples include: accessing
specific
• IAM role for
2
1
locations in 8
AWS S3
cloud stores
• Service principal
• Fine-grained
for Azure
control over
Storage
external storage
user1/
External
location 1
users/
user2/
External
location 2
Storage /
credential
tables/
External
location 3
shared/
External
tmp/
location 4
219
Managed versus External Tables
Managed Tables External Tables
Metadata lives in control plane Metadata lives in control plane

Data lives in metastore-managed Data lives in user-provided storage
storage location location (external to UC)
DROP discards data DROP leaves data intact
Delta format only Several formats supported (delta, csv,
json, avro, parquet, orc, text)
When to use external tables?
Quick and easy upgrade from external table in Hive metastore

External readers or writers
Requirement for specific storage naming or hierarchy
Infrastructure-level isolation requirements
Non-Delta support requirement
221
Unity Catalog Key Capabilities
Centralized metadata and user management
Unity Catalog Architecture
Before Unity Catalog With Unity Catalog
Workspace 1 Workspace 2 Unity Catalog
User/group User/group User/group Access

Metastore
management management management controls
Metastore Metastore
Access controls Access controls Workspace 1 Workspace 2
Compute Compute Compute Compute

resources resources resources resources
223
Centralized Access Controls
Centrally grant and manage access permissions across workloads
Using ANSI SQL DCL Using UI
GRANT <privilege> ON <securable_type>

<securable_name> TO `<principal>`
GRANT SELECT ON iot.events TO engineers
Choose ‘Table’= collection of

permission level files in S3/ADLS Sync groups from
your identity
provider
224
Three level namespace
Seamless access to your existing metastores
Unity Catalog
hive_metastore
Catalog 2 Catalog 1
(legacy)
default
(database) Database 2 Database 1
customers External External Managed

(table) Views
Table Tables Tables
SELECT * FROM main.student.example; -- <catalog>.<database>.<table>

SELECT * FROM hive_metastore.default.customers;
225
Managed Data Sources & External Locations
Simplify data access management across clouds
Unity External
Audit log Locations &
Credentials
Catalog
Access Control
Cloud Storage
(S3, ADLS, GCS)
Managed Managed
Managed container / bucket Data Sources
tables
External
External container / bucket
User Cluster or tables
SQL warehouse … External
Locations
External
Files in container / bucket
Cloud Strg
226
Automated lineage for all workloads
End-to-end visibility into how data flows and consumed in your organization
● Auto-capture runtime data lineage

on a Databricks cluster or SQL
warehouse
● Track lineage down to the table and
column level
● Leverage common permission model
from Unity Catalog
● Lineage across tables, dashboards,
workflows, notebooks 227
Lineage flow - How it works
ETL / Job
Explore lineage in UI
Workspace Table and
Lineage
cluster / SQL column
service
Warehouse lineage
Ad-hoc Alation
FY23Q4 Microsoft
Collibra Purview
DLT
External Catalogs
● Code (any language) is submitted ● Lineage service analyzes logs emitted ● Presented to the end user
to a cluster or SQL warehouse or from the cluster, and pulls metadata graphically in Databricks
DLT* executes data flow from DLT ● Lineage can be exported
● Assembles column and table level via API and imported into
lineage other tool
228
Built-in search and discovery
Accelerate time to value with low latency data discovery
● UI to search for data assets stored in

Unity Catalog
● Unified UI across DSML + DBSQL
● Leverage common permission model
from Unity Catalog
229
An open standard for secure sharing of data assets
Unity Catalog
-Architecture
Audit Log Account

Metastore Level User
Mgmt
Unity
Lineage Storage
Explorer Credentials
Catalog
2
3
1
Data Explorer Access ACL Store

Control
Cloud Storage
(S3, ADLS, GCS)
Databricks
Workspace
✔ * Container / bucket
User
* Unity Catalog will support any data format (table or raw files)

Data Engineering With Databricks Da

Uploaded by

Copyright:

Available Formats

Data Engineering With Databricks Da

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Engineering With Databricks Da

Uploaded by

Copyright:

Available Formats

Data

with the Course

Orchestrate data pipelines with Databricks Workﬂow Jobs and schedule

Transform Data with Spark (SQL/PySpark) 2 hours, 50 min

Manage Data with Delta Lake 1 hour, 30 min

Build Data Pipelines with Delta Live Tables (SQL/PySpark) 3 hours

Deploy Workloads with Databricks Workﬂows 1 hour, 10 min

Manage Data Access for Analytics with Unity Catalog 2 hours

● We will take 10 minute breaks about every hour

1. Describe the core components of the Databricks Lakehouse platform.

Databricks Workspace and Services

Cluster Manager Workspace SQL

Data Jobs Cloud Storage

Collection of VM instances Workloads Cluster

Two main types: Driver Worker

All-purpose Clusters Job Clusters

Analyze data collaboratively using Run automated jobs

Only shown for existing clusters without access modes

%python, %r, %scala, %sql Switch languages in a command cell

Manipulates the Databricks ﬁlesystem (DBFS)

Provides utilities for leveraging secrets within

notebook Utilities for the control ﬂow of a notebook dbutils.notebook.run()

Methods to create and get bound value of

jobs Utilities for leveraging jobs features dbutils.jobs.taskValues.set()

Available within Python, R, or Scala notebooks

Git Versioning CI/CD Integration Enterprise ready

Native integration with API surface to integrate Allow lists to avoid

Control Plane in Databricks Git and CI/CD Systems

Databricks Web Repos / Cluster Version Review Test

Set up top-level Clone remote

Create new branch

Set up Git Run Databricks job

Data Objects in the Lakehouse

Table View Function

Table View Function

Table View Function

Catalog Storage credential

Schema External location

Managed table External table

Metastore storage External storage

Table View Function

Table View Function

Table View Function

Files can be queried directly using SQL

Process based on speciﬁed ﬁle format

Many data sources require schema declaration and other options to

Use “:” (colon) syntax in queries to access subﬁelds in JSON strings

SELECT value:device, value:geo ...

Use “.” (dot) syntax in queries to access subﬁelds in STRUCT types

SELECT value.device, value.geo ...

• Array: arbitrary number of elements of same data type

• Map: set of key-value pairs

• Struct: ordered (ﬁxed) collection of column(s) and any data type

Example table with complex types

CREATE TABLE employees (name STRING, salary FLOAT,

collect_set returns an array with duplicate elements eliminated

Returns STRUCT column containing ARRAY of nested STRUCT

with Delta Lake

What is Delta Lake

Consistency guarantees relate to how a given state of the data is observed by