Data Engineering With Databricks Da

Download as pdf or txt
Download as pdf or txt
You are on page 1of 232

Data

Engineering
with Databricks
Databricks Academy
2023
Meet your instructor
Add instructor name, Add instructor title

• Team: <add>
• Time at Databricks: <add>
Replace with • Fun fact: <Add>
instructor
photograph

2
Meet your classmates
• Where is everyone joining us from today (city, country)?

3
Meet your classmates
• How long have you been working with Databricks?

4
Meet your classmates
• What has your experience working with Databricks for data
engineering been so far?

5
Meet your classmates
• What are you hoping to get out of this class?

6
Getting Started
1_DAIS_Title_Slide

with the Course


Course Objectives
Perform common code development tasks in a data engineering workflow
1
using the Databricks Data Science & Engineering Workspace.

Use Spark to extract data from a variety of sources, apply common cleaning
2 transformations, and manipulate complex data to load into Delta Lake.

Define and schedule data pipelines that incrementally ingest and process data
3
through multiple tables in the lakehouse using Delta Live Tables.

Orchestrate data pipelines with Databricks Workflow Jobs and schedule


4
dashboard updates to keep analytics up-to-date.

Configure permissions in Unity Catalog to ensure that users have proper access
5
to databases for analytics and dashboarding.
Agenda
Module Name Duration

Get Started with Databricks Data Science and Engineering Workspace 1 hour, 20 min

Transform Data with Spark (SQL/PySpark) 2 hours, 50 min

Manage Data with Delta Lake 1 hour, 30 min

Build Data Pipelines with Delta Live Tables (SQL/PySpark) 3 hours

Deploy Workloads with Databricks Workflows 1 hour, 10 min

Manage Data Access for Analytics with Unity Catalog 2 hours

● We will take 10 minute breaks about every hour

9
Get Started with
Databricks Data
Module 01
Science &
1_DAIS_Title_Slide

Engineering
Workspace
Module Objectives
Get Started with Databricks Data Science and Engineering Workspace

1. Describe the core components of the Databricks Lakehouse platform.


2. Navigate the Databricks Data Science & Engineering Workspace UI.
3. Create and manage clusters using the Databricks Clusters UI.
4. Develop and run code in multi-cell Databricks notebooks using basic
operations.
5. Integrate git support using Databricks Repos.

11
Module Overview
Get Started with Databricks Data Science and Engineering Workspace

Databricks Workspace and Services


Navigate the Workspace UI
Compute Resources
DE 1.1 - Create and Manage Interactive Clusters
Develop Code with Notebooks & Databricks Repos
DE 1.2 - Databricks Notebook Operations
DE 1.3L - Get Started with the Databricks Platform Lab

12
Databricks Workspace and
Services
Databricks Workspace and Services
Control Plane
Data Plane

Web App
Unity Catalog

Cluster Manager Workspace SQL


Metastore clusters Warehouses

Workflow Manager
Access Control

Data Jobs Cloud Storage


Lineage/Explorer

Notebooks, Repos,
DBSQL
Demo:
Navigate the Workspace UI
Compute Resources
Clusters
Overview

Collection of VM instances Workloads Cluster

Worker
Distributes workloads
Notebook
across workers VM instance

Two main types: Driver Worker


Job
1. All-purpose clusters for VM instance VM instance
interactive development
Pipeline Worker
2. Job clusters for
automating workloads VM instance

17
Cluster Types

All-purpose Clusters Job Clusters

Analyze data collaboratively using Run automated jobs


interactive notebooks
The Databricks job scheduler creates job
Create clusters from the Workspace or API clusters when running jobs
Configuration information retained for up Configuration information retained for up
to 70 clusters for up to 30 days to 30 most recently terminated clusters

18
Cluster Configuration
Cluster Mode
Standard (Multi Node)
Default mode for workloads developed in any supported language (requires
at least two VM instances)
Single node
Low-cost single-instance cluster catering to single-node machine learning
workloads and lightweight exploratory analysis
Databricks Runtime Version
Standard
Apache Spark and many other components and updates to provide an
optimized big data analytics experiences
Machine learning
Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and
XGBoost.
Photon
An optional add-on to optimize SQL workloads

21
Access Mode
Access mode Unity Catalog Supported
Visible to user
dropdown support languages

Python, SQL,
Single user Always Yes
Scala, R

Python (DBR
Shared Always (Premium plan required) Yes
11.1+), SQL

No isolation Can be hidden by enforcing user isolation in the admin Python, SQL,
No
shared console or configuring account-level settings Scala, R

Only shown for existing clusters without access modes


Python, SQL,
Custom (i.e. legacy cluster modes, Standard or High No
Scala, R
Concurrency); not an option for creating new clusters.
Cluster Policies
Cluster policies can help to achieve the following:
• Standardize cluster configurations
• Provide predefined configurations targeting specific use cases
• Simplify the user experience
• Prevent excessive use and control cost
• Enforce correct tagging
Cluster Access Control
No Permissions Can Attach To Can Restart Can Manage

Attach notebook ✓ ✓ ✓
View Spark UI, cluster
metrics, driver logs ✓ ✓ ✓
Start, restart,
terminate ✓ ✓
Edit ✓
Attach library ✓
Resize ✓
Change permissions ✓

24
Demo:
DE 1.1: Create and Manage
Interactive Clusters
Develop Code with Notebooks
& Databricks Repos
Databricks Notebooks
Collaborative, reproducible, and enterprise ready

Multi-language Reproducible
Use Python, SQL, Scala, and R, all in one Automatically track version history, and
Notebook use git version control with Repos

Collaborative
Real-time co-presence, co-editing, and Get to production faster
commenting Quickly schedule notebooks as jobs or
create dashboards from their results, all
in the Notebook
Ideal for exploration
Explore, visualize, and summarize data
with built-in charts and data profiles
Enterprise-ready
Enterprise-grade access controls,
Adaptable identity management, and auditability
Install standard libraries and use local
modules

27
Notebook magic commands
Use to override default languages, run utilities/auxiliary commands, etc.

%python, %r, %scala, %sql Switch languages in a command cell


%sh Run shell code (only runs on driver node, not worker nodes)
%fs Shortcut for dbutils filesystem commands
%md Markdown for styling the display
%run Execute a remote notebook from a notebook
%pip Install new Python libraries

28
dbutils (Databricks Utilities)
Perform various tasks with Databricks using notebooks
Utility Description Example

Manipulates the Databricks filesystem (DBFS)


fs dbutils.fs.ls()
from the console

Provides utilities for leveraging secrets within


secrets dbutils.secrets.get()
notebooks

notebook Utilities for the control flow of a notebook dbutils.notebook.run()

Methods to create and get bound value of


widgets dbutils.widget.text()
input widgets inside notebooks

jobs Utilities for leveraging jobs features dbutils.jobs.taskValues.set()

Available within Python, R, or Scala notebooks

29
Git Versioning with Databricks
Repos
Databricks Repos

Git Versioning CI/CD Integration Enterprise ready

Native integration with API surface to integrate Allow lists to avoid


Github, Gitlab, Bitbucket with automation exfiltration
and Azure Devops
Simplifies the Secret detection to avoid
UI-based workflows dev/staging/prod leaking keys
multi-workspace story

CI CD

31
Databricks Repos
CI/CD Integration

Control Plane in Databricks Git and CI/CD Systems


Manage customer accounts, datasets, and clusters

Databricks Web Repos / Cluster Version Review Test


Application Jobs Management
Notebooks

Repos Service

32
CI/CD workflows with Git and Repos
Documentation
User workflow in Merge workflow in Production job
Admin workflow
Databricks Git provider workflow in Databricks

Set up top-level Clone remote


repository to user Pull request and API call brings Repo in
Repos folders
folder review process Production folder to
(example:
latest version
Production)

Create new branch


based on main Merge into main
branch branch

Set up Git Run Databricks job


automation to update based on Repo in
Repos on merge Create and edit code Production folder
Git automation calls
Databricks Repos API

Steps in Databricks
Commit and push to
feature branch Steps in your Git provider
Demo:
DE 1.2: Databricks Notebook
Operations
Lab:
DE 1.3L: Get Started with the
Databricks Platform
Transform Data
1_DAIS_Title_Slide

with Spark
Module Objectives
Transform Data with Spark

1. Extract data from a variety of file formats and data sources using Spark
2. Apply a number of common transformations to clean data using Spark
3. Reshape and manipulate complex data using advanced built-in functions
in Spark
4. Leverage UDFs for reusable code and apply best practices for
performance in Spark

37
Module Agenda
Transform Data with Spark

Data Objects in the Lakehouse


DE 2.1 - Querying Files Directly
DE 2.2 - Options for External Sources
DE 2.3L - Extract Data Lab
DE 2.4 - Cleaning Data
DE 2.5 - Complex Transformations
DE 2.6L - Reshape Data Lab
DE 2.7A – SQL UDFs and Control Flow
DE 2.7B - Python UDFs
38
Data Objects in the Lakehouse
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function

40
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function

41
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)
(Database)

Table View Function

42
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Managed table
Table View Function
External table

43
Managed Tables
Metastore

Catalog

Schema

Managed table

Metastore storage

44
External Tables
Metastore

Catalog Storage credential

Schema External location

Managed table External table

Metastore storage External storage

45
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function

46
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function

Global Temporary
Temporary View
View 47
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function


Function

48
Extracting Data
Query files directly
SELECT * FROM file_format.`path/to/file`

Files can be queried directly using SQL


• SELECT * FROM json.`path/to/files/`
• SELECT * FROM text.`path/to/files/`

Process based on specified file format


• json pulls schema from underlying data
• binaryFile and text file formats have fixed data schemas
• text → string value column (row for each line)
• binaryFile → path, modificationTime, length, content columns (row for each file)

50
Configure external tables with read options
CREATE TABLE USING data_source OPTIONS (...)

Many data sources require schema declaration and other options to


correctly read data
• CSV options for delimiter, header, etc
• JDBC options for url, user, password, etc
• Note: using the JDBC driver pulls RDBMS tables dynamically for Spark processing

51
Demo:
DE 2.1: Querying Files Directly
Demo:
DE 2.2: Providing Options for
External Sources
Lab:
DE 2.3L: Extract Data Lab
Lab:
DE 2.4: Cleaning Data
Complex Transformations
Interact with Nested Data
Use built-in syntax to traverse nested data with Spark SQL

Use “:” (colon) syntax in queries to access subfields in JSON strings

SELECT value:device, value:geo ...

Use “.” (dot) syntax in queries to access subfields in STRUCT types

SELECT value.device, value.geo ...

57
Complex Types
Nested data types storing multiple values

• Array: arbitrary number of elements of same data type

• Map: set of key-value pairs

• Struct: ordered (fixed) collection of column(s) and any data type

Example table with complex types

CREATE TABLE employees (name STRING, salary FLOAT,


subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,city:STRING,state:STRING, zip:INT>)

58
Demo:
DE 2.5: Complex
Transformations
explode lab
SELECT
user_id, event_timestamp, event_name,
explode(items) AS item
FROM events

explode outputs the elements of an array field into a separate row for each element

1
2
3

Each item in the items array above is exploded into its own row, resulting in the 3 rows below

1
2
3

60
flatten lab
collect_set returns an array of unique values from a field for each group of rows
flatten returns an array that flattens multiple arrays into one

SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id

61
Collection example

collect_set returns an array with duplicate elements eliminated


collect_list returns an array with duplicate elements intact

df df.agg(collect_set(‘age’)) df.agg(collect_list(‘age’))

62
Parse JSON strings into structs
Create the schema to parse the JSON strings by providing an example JSON string from a row
that has no nulls
from_json uses JSON schema returned by schema_of_json to convert a column of JSON strings into structs

This highlighted JSON string is taken from the value field of a single row of data

Returns STRUCT column containing ARRAY of nested STRUCT

63
Lab:
DE 2.5L: Reshape Data Lab
(Optional)
Demo:
DE 2.7A: SQL UDFs and
Control Flow (Optional)
Demo:
DE 2.7B: Python UDFs
(Optional)
Manage Data 1_DAIS_Title_Slide

with Delta Lake

Module 03
Module Agenda
Manage Data with Delta Lake

What is Delta Lake


DE 3.1 - Schemas and Tables
DE 3.2 - Version and Optimize Delta Tables
DE 3.3L - Manipulate Delta Tables Lab
DE 3.4 - Set Up Delta Tables
DE 3.5 - Load Data into Delta Lake
DE 3.6L - Load Data Lab

68
What is Delta Lake?
Delta Lake is an open-source
project that enables building a
data lakehouse on top of
existing cloud storage

70
Delta Lake Is Not…
• Proprietary technology
• Storage format
• Storage medium
• Database service or data warehouse
Delta Lake Is…
• Open source
• Builds upon standard data formats
• Optimized for cloud object storage
• Built for scalable metadata handling
Delta Lake brings ACID to object storage
Atomicity means all transactions either succeed or fail completely

Consistency guarantees relate to how a given state of the data is observed by


simultaneous operations

Isolation refers to how simultaneous operations conflict with one another. The
isolation guarantees that Delta Lake provides do differ from other systems

Durability means that committed changes are permanent


Problems solved by ACID
• Hard to append data
• Modification of existing data difficult
• Jobs failing mid way
• Real-time operations hard
• Costly to keep historical data versions
Delta Lake is the default format
for tables created in Databricks

CREATE TABLE foo df.write


USING DELTA .format(“delta”)

75
Demo:
DE 3.1: Schemas and Tables
Demo:
DE 3.2: Version and Optimize
Delta Tables
Lab:
DE 3.3L - Manipulate Delta
Tables Lab
Demo:
DE 3.4: Set up Delta Tables
Demo:
DE 3.5: Load Data into Delta
Tables
Lab:
DE 3.6L: Load Data
Build Data
Pipelines with
1_DAIS_Title_Slide

Delta Live
Tables
Agenda
Build Data Pipelines with Delta Live Tables

The Medallion Architecture


Introduction to Delta Live Tables
DE 4.1 - DLT UI Walkthrough
DE 4.1A - SQL Pipelines
DE 4.1B - Python Pipelines
DE 4.2 - Python vs SQL
DE 4.3 - Pipeline Results
DE 4.4 - Pipeline Event Logs
83
The Medallion Architecture
Medallion Architecture in the Lakehouse

Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake

Data Science
Raw ingestion Filtered, cleaned, Business-level & ML

and history augmented aggregates

Data Quality & Governance Data Sharing

85
Multi-Hop in the Lakehouse
Bronze Layer

Typically just a raw copy of ingested data


Replaces traditional data lake
Bronze
Provides efficient storage and querying of full, unprocessed
history of data

86
Multi-Hop in the Lakehouse
Silver Layer

Reduces data storage complexity, latency, and redundancy


Optimizes ETL throughput and analytic query performance
Silver
Preserves grain of original data (without aggregations)
Eliminates duplicate records
Production schema enforced
Data quality checks, corrupt data quarantined

87
Multi-Hop in the Lakehouse
Gold Layer

Powers ML applications, reporting, dashboards, ad hoc analytics


Refined views of data, typically with aggregations
Gold
Reduces strain on production systems
Optimizes query performance for business-critical data

88
Introduction to Delta Live
Tables
Multi-Hop in the Lakehouse

Streaming
analytics

CSV
JSON
TXT

Bronze Silver Gold


Databricks Auto
Loader Raw Ingestion and Filtered, Cleaned, Business-level
History Augmented Aggregates

Data quality
AI and reporting
The Reality is Not so Simple

Bronze Silver Gold


Large scale ETL is complex and brittle

Complex pipeline Data quality and Difficult pipeline


development governance operations
Hard to build and maintain table Difficult to monitor and enforce Poor observability at granular,
dependencies data quality data level

Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious

92
Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake

Operate with agility Trust your data Scale with reliability

Declarative tools to DLT has built-in Easily scale


build batch and declarative quality infrastructure
streaming data controls alongside your data
pipelines
Declare quality
expectations and
actions to take

93
What is a LIVE TABLE?
What is a Live Table?
Live Tables are materialized views for the lakehouse.

A live table is: Live tables provides tools to:


• Defined by a SQL query • Manage dependencies
• Created and kept up-to-date by a • Control quality
pipeline
• Automate operations
• Simplify collaboration
CREATE OR REFRESH LIVE TABLE report
• Save costs
AS SELECT sum(profit)
• Reduce latency
FROM prod.sales

95
What is a Streaming Live Table?
Based on SparkTM Structured Streaming

A streaming live table is “stateful”: • Streaming Live tables compute results


over append-only streams such as
• Ensures exactly-once processing of
Kafka, Kinesis, or Auto Loader (files on
input rows
cloud storage)
• Inputs are only read once
• Streaming live tables allow you to reduce
costs and latency by avoiding
reprocessing of old data.
CREATE STREAMING LIVE TABLE report
AS SELECT sum(profit)
FROM cloud_files(prod.sales)

96
When should I use streaming?
Using Spark Structured Streaming for ingestion
Easily ingest files from cloud storage as they are uploaded

This example creates a table with all the


json data stored in “/data”:
• cloud_files keeps track of which files
CREATE STREAMING LIVE TABLE raw_data have been read to avoid duplication and
AS SELECT * wasted work
FROM cloud_files("/data", "json”) • Supports both listing and notifications
for arbitrary scale
• Configurable schema inference and
schema evolution

98
Using the SQL STREAM() function
Stream data from any Delta table

CREATE STREAMING LIVE TABLE • STREAM(my_table) reads a stream of


mystream new records, instead of a snapshot

AS SELECT * • Streaming tables must be an


append-only table
FROM STREAM(my_table)
Pitfall: my_table must be an append-only • Any append-only delta table can be
source. read as a stream (i.e. from the live
schema, from the catalog, or just from a
e.g. it may not:
path).
• be the target of APPLY CHANGES INTO
• define an aggregate function
• be a table on which you’ve executed DML to
delete/update a row (see GDPR section)
99
How do I use DLT?
Creating Your First Live Table Pipeline
SQL to DLT in three easy steps…

Write create live table Create a pipeline Click start

• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.

101
BEST PRACTICE

Development vs Production
Fast iteration or enterprise grade reliability

Development Mode Production Mode

• Reuses a long-running cluster • Cuts costs by turning off clusters


running for fast iteration. as soon as they are done (within 5
minutes)
• No retries on errors enabling
faster debugging. • Escalating retries, including
cluster restarts, ensure reliability
in the face of transient issues.

In the Pipelines
UI:

102
What if I have
dependent tables?
Declare LIVE Dependencies
Using the LIVE virtual schema.

CREATE LIVE TABLE events • Dependencies owned by other producers


are just read from the catalog or spark
AS SELECT … FROM prod.raw_data
data source as normal.
• LIVE dependencies, from the same
CREATE LIVE TABLE report pipeline, are read from the LIVE schema.
AS SELECT … FROM LIVE.events • DLT detects LIVE dependencies and
executes all operations in correct order.
events report
• DLT handles parallelism and captures the
lineage of the data.

104
How do I ensure
Data Quality?
BEST PRACTICE

Ensure correctness with Expectations


Expectations are tests that ensure data quality in production

CONSTRAINT valid_timestamp Expectations are true/false expressions


that are used to validate each row during
EXPECT (timestamp > '2012-01-01’)
processing.
ON VIOLATION DROP

DLT offers flexible policies on how to handle


@dlt.expect_or_drop( records that violate expectations:
"valid_timestamp", • Track number of bad records
col("timestamp") > '2012-01-01') • Drop bad records
• Abort processing for a single bad record

106
What about operations?
Pipelines UI (1 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables

108
Pipelines UI (2 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables
• Discover metadata and
quality of each table

109
Pipelines UI (3 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables
• Discover metadata and
quality of each table
• Access to historical
updates

110
Pipelines UI (4 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables
• Discover metadata and
quality of each table
• Access to historical
updates
• Control operations

111
Pipelines UI (5 of 5)
A one stop shop for ETL debugging and operations

• Visualize data flows


between tables
• Discover metadata and
quality of each table
• Access to historical
updates
• Control operations
• Dive deep into events

112
The Event Log
The event log automatically records all pipelines operations.
Operational Statistics Provenance Data Quality
Time and current status, for all operations
Pipeline and cluster configurations
Table schemas, Expectation pass /
Row counts definitions, and failure / drop
declared properties statistics
Table-level lineage Input/Output rows
that caused
Query plans used to
expectation failures
update tables

113
How can I use parameters?
Modularize your code with configuration
Avoid hard coding paths, topic names, and other constants in your code.

A pipeline’s configuration is a map of key value pairs that can be used to


parameterize your code:
• Improve code readability/maintainability
• Reuse code in multiple pipelines for different data
CREATE STREAMING LIVE TABLE data AS
SELECT * FROM cloud_files("${my_etl.input_path}",
"json")

@dlt.table
def data():
input_path = spark.conf.get("my_etl.input_path”)

spark.readStream.format("cloud_files”).load(input_path)

115
How can I do
change data capture (CDC)?
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


{UPDATE}
FROM STREAM(LIVE.city_updates) {DELETE}
KEYS (id) {INSERT}
SEQUENCE BY ts
APPLY
CHANGES
INTO

Up-to-date
Snapshot
117
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly,
KEYS (id) CA"}

SEQUENCE BY ts

cities
A target for the changes to id city
be applied to.

118
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly,
KEYS (id) CA"}

SEQUENCE BY ts

A source of changes,
currently this has to be a
stream.

119
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly,
KEYS (id) CA"}

SEQUENCE BY ts

cities
A unique key that can be
used to identify a given row. id city

120
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 100, "city": "Bekerly,
KEYS (id) CA"}

SEQUENCE BY ts

A sequence that can be used cities


to order changes: id city
• Log sequence number (lsn)
• Timestamp
• Ingestion time

121
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere

APPLY CHANGES INTO LIVE.cities


city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 100, "city": "Bekerly, CA"}
KEYS (id) {"id": 1, "ts": 200, "city": "Berkeley, CA"}

SEQUENCE BY ts

cities
id city
1 Bekerly, CA Berkeley, CA

122
REFERENCE ARCHITECTURE

Change Data Capture (CDC) from RDBMS


A variety of 3rd party tools can provide a streaming change feed
replicated_table

Amazon DMS to S3 cloud_files APPLY CHANGES INTO


RDS

replicated_table

MySQL
or Debezium APPLY CHANGES INTO
Postgre
s

replicated_table

Golden Gate APPLY CHANGES INTO


Oracle
What do I no longer need to
manage with DLT?
Automated Data Management
DLT automatically optimizes data for performance & ease-of-use

Best Practices Physical Data Schema Evolution

What: What: What:

DLT encodes Delta best practices DLT automatically manages your Schema evolution is handled for
automatically when creating DLT physical data to minimize cost and you
tables. optimize performance.
How:
How: How:
Modifying a live table
DLT sets the following properties: • runs vacuum daily transformation to
• runs optimize daily add/remove/rename a column will
• optimizeWrite
automatically do the right thing.
• autoCompact You still can tell us how you want
• tuneFileSizesForRewrites it organized (ie ZORDER) When removing a column in a
streaming live table, old values are
preserved.

125
Demo:
DE 4.1 - Using the Delta Live
Tables UI
Demo:
DE 4.1.1 - Fundamentals of DLT
Syntax
Demo:
DE 4.1.2 - More DLT SQL
Syntax
Demo:
DE 4.2 - Delta Live Tables:
Python vs SQL
Demo:
DE 4.3 - Exploring the Results
of a DLT Pipeline
Demo:
DE 4.4 - Exploring the Pipeline
Events Logs
Lab:
DE 4.1.3 - Troubleshooting DLT
Syntax Lab
Deploy
Workloads with
1_DAIS_Title_Slide

Databricks
Workflows
Module Agenda
Deploy Workloads with Databricks Workflows

Introduction to Workflows
Building and Monitoring Workflow Jobs
DE 5.1 - Scheduling Tasks with the Jobs UI
DE 5.2L - Jobs Lab

134
Introduction to Workflows
Lesson Objectives

1 Describe the main features and use cases of Databricks Workflows

2 Create a task orchestration workflow composed of various task types

3 Utilize monitoring and debugging features of Databricks Workflows

4 Describe workflow best practices


Databricks Workflows
Databricks Workflows
Workflows is a fully-managed cloud-based
general-purpose task orchestration service
for the entire Lakehouse.

Lakehouse Platform
Workflows is a service for data engineers, data Data Data Data Data Science
Warehousing Engineering Streaming and ML
scientists and analysts to build reliable data,
analytics and AI workflows on any cloud. Unity Catalog
Fine-grained governance for data and AI

Delta Lake
Data reliability and performance

Cloud Data Lake


All structured and unstructured data

137
Databricks Workflows

Databricks has two main task orchestration


services:
• Workflow Jobs (Workflows)
• Workflows for every job
• Delta Live Tables (DLT)
• Automated data pipelines for Delta Lake

Note: DLT pipeline can be a task in a workflow

138
DLT versus Jobs
Considerations

Delta Live Tables Workflow Jobs

JARs, notebooks, DLT, application written in


Source Notebooks only
Scala, Java, Python

Dependencies Automatically determined Manually set

Cluster Self-provisioned Self-provisioned or existing

Timeouts and Retries Not supported Supported

Import Libraries Not supported Supported

139
DLT versus Jobs
Use Cases

Orchestration of Machine Learning Tasks Arbitrary Code, External Data Ingestion and
Dependent Jobs API Calls, Custom Tasks Transformation
Run MLflow notebook task
Jobs running on schedule, in a job Run tasks in a job which ETL jobs, Support for batch
containing dependent can contain Jar file, Spark and streaming, Built in data
tasks/steps Submit, Python Script, SQL quality constraints,
task, dbt monitoring & logging

Jobs Workflows Jobs Workflows Jobs Workflows Delta Live Tables

140
Workflows Features
Part 1 of 2

Orchestrate Anything Fully Managed Simple Workflow


Anywhere Authoring
Run diverse workloads for the full Remove operational overhead An easy point-and-click authoring
data and AI lifecycle, on any cloud. with a fully managed experience for all your data teams
Orchestrate; orchestration service enabling not just those with specialized
you to focus on your workflows skills
• Notebooks
not on managing your
• Delta Live Tables
infrastructure
• Jobs for SQL
• ML models, and more

141
Workflows Features
Part 2 of 2

Deep Platform Integration Proven Reliability

Designed and built into your Have full confidence in your


lakehouse platform giving you workflows leveraging our proven
deep monitoring capabilities and experience running tens of
centralized observability across millions of production workloads
all your workflows daily across AWS, Azure, and GCP

142
How to Leverage Workflows
• Allows you to build simple ETL/ML task orchestration
• Reduces infrastructure overhead
• Easily integrate with external tools
• Enables non-engineers to build their own workflows using simple UI
• Cloud-provider independent
• Enables re-using clusters to reduce cost and startup time

143
Common Workflow Patterns

Sequence Funnel Fan-out

Sequence Funnel
● Data transformation/ Fan-out, star pattern
● Multiple data sources
processing/cleaning ● Single data source
● Data collection
● Bronze/silver/gold tables ● Data ingestion and
distribution

144
Example Workflow

Data ingestion funnel


E.g. Auto Loader, DLT

Data filtering, quality assurance, transformation


E.g. DLT, SQL, Python

ML feature extraction
E.g. MLflow

Persisting features and training prediction model

145
Building and Monitoring
Workflow Jobs
Workflows Job Components

TASKS SCHEDULE CLUSTER

What? When? How?

147
Creating a Workflow
Task Definition

While creating a task;


• Define the task type
• Choose the cluster type
• Job clusters and All-purpose clusters can
be used.
• A cluster can be used by multiple tasks.
This reduces cost and startup time.
• If you want to create a new cluster,
you must have required permissions.
• Define task dependency if task
depends on another task
Monitoring and Debugging
Scheduling and Alerts

You can run your jobs immediately or


periodically through an easy-to-use
scheduling system.

You can specific alerts to be notified


when runs of a job begin, complete or
fail. Notifications can be sent via email,
Slack or AWS SNS.
Monitoring and Debugging
Access Control

Workflows integrates with existing


resources access controls, enabling you
to easily manage access across different
teams.
Monitoring and Debugging
Job Run History Run duration

Workflows keeps track of job runs and save information about the success or failure of each task
in the job run.

Navigate to the Runs tab to view completed or active runs for a job.

Tasks

Job run
151
Monitoring and Debugging
Repair a Failed Job Run

Repair feature allows you to re-run only


the failed task and sub-tasks, which
reduces the time and resources required
to recover from unsuccessful job runs.
Navigating the Jobs UI
Use breadcrumbs to navigate back to your job from a specific run page

153
Navigating the Jobs UI
Runs vs Tasks tabs on the job page

Use Runs tab to view completed or Use Tasks tab to modify or add
active runs for the job tasks to the job

154
Demo:
DE 5.1.1: Task Orchestration
Demo: Task Orchestration
DE 5.1.1 - Task Orchestration

• Schedule a notebook task in a Databricks Workflow Job

• Describe job scheduling options and differences between cluster types

• Review Job Runs to track progress and see results

• Schedule a DLT pipeline task in a Databricks Workflow Job

• Configure dependency between tasks via Databricks Workflows UI

156
Lab:
DE 5.2.1.L: Task Orchestration
Lab: Task Orchestration
DE 5.2.1.L - Task Orchestration

158
Manage Data
Access for
1_DAIS_Title_Slide

Analytics
Lesson Objectives
By the end of this course, you will be able to:

1. Describe Unity Catalog key concepts and how it integrates with the
Databricks platform
2. Access Unity Catalog through clusters and SQL warehouses
3. Create and govern data assets in Unity Catalog
4. Adopt Databricks recommendations into your organization’s Unity
Catalog-based solutions

160
Module Agenda
Manage Data Access for Analytics with Unity Catalog
Introduction to Unity Catalog
Compute Resources in Unity Catalog
DE 6.1 - Introduction to Unity Catalog
DE 6.8 - Compute Resources
DE 6.2 - Overview of Data Governance
DE 6.9 - Creating Compute Resources
DE 6.3 - Unity Catalog Key Concepts

DE 6.4 - Unity Catalog Architecture

DE 6.5 - Unity Catalog Identities

DE 6.6 - Managing Principals in Unity Catalog

DE 6.7 - Managing Catalog Metastores

Data Access Control in Unity Catalog Unity Catalog Best Practices

DE 6.10 - Data Access Control in Databricks DE 6.16 - Best Practices

DE 6.11 - Security Model DE 6.17 - Data Segregation

DE 6.12 - External Storage DE 6.18 - Identity Management

DE 6.13 - Creating and Governing Data DE 6.19 - External Storage

DE 6.14 - Create and Share Tables DE 6.20 - Upgrade a Table to Unity Catalog

DE 6.15 - Create External Tables DE 6.21 - Create Views and Limiting Table Access

161
Introduction to Unity Catalog
Overview of
Data Governance
80% of organizations seeking to scale
digital business will fail because they do not
take a modern approach to data and analytics
governance

Source: Gartner

164
Data Governance
Four key functional areas

Data Access Control Data Access Audit

Control who has access to which data Capture and record all access to data

Data Lineage Data Discovery

Capture upstream sources and downstream Ability to search for and discover authorized assets
consumers

165
Governance for data,
analytics and AI is complex

Permissions on files No row and column level permissions

Inflexible when policies change


Data Lake
Data analyst
Permissions on tables and views
1 Can be out of sync with data
6
6
Metadata

Permissions on tables, columns, rows


Data engineer Different governance model

Data Warehouse
Permissions on ML models,
dashboards, features, …
Yet another governance model
Data scientist
ML and AI
Databricks Unity Catalog
Unified governance for data, analytics and AI

Data Lake
Data analyst

Metadata
Unity Catalog
Data engineer

Data Warehouse

Data scientist
ML and AI

167
Unity Catalog
Overview

Unified governance across Unified data and AI assets Unified existing catalogs
clouds
Centrally share, audit, secure and Works in concert with existing
Fine-grained governance for data manage all data types with one data, storage, and catalogs - no
lakes across clouds - based on simple interface. hard migration required.
open standard ANSI SQL.

1 2 3
168
Unity Catalog
Key Capabilities

● Centralized metadata and user management Unity Catalog

● Centralized data access controls Databricks Databricks


Workspace Workspace

● Data access auditing


GRANT … ON … TO …
● Data lineage REVOKE … ON … FROM …

● Data search and discovery Catalogs, Databases (schemas),


Tables, Views, Storage
credentials, External locations
● Secure data sharing with Delta Sharing

169
Unity Catalog
Key Concepts
Metastore
Unity Catalog metastore elements

Metastore

Control
Storage Plane
External Location Catalog Share Recipient
Credential

Schema
(Database) Cloud
Storage

Table View Function

171
Metastore
Accessing legacy Hive metastore

Metastore

hive_metastore Catalog 1 Catalog 2

Schema
Workspace (Database)

Table View Function

172
Catalog
Top-level container for data objects

Metastore

Storage
External Location Catalog Share Recipient
Credential

Schema
(Database)

Table View Function

173
Catalog
Three-level namespace

Traditional SQL two-level Unity Catalog three-level


namespace namespace

SELECT * FROM schema.table SELECT * FROM catalog.schema.table

174
Data Objects
Schema (database), tables, views, functions

Metastore

Storage
External Location Catalog Share Recipient
Credential

Schema
(Database)

Managed table
Table View Function
External table

175
External Storage
Storage credentials and external locations

Metastore

Storage
External Location Catalog Share Recipient
Credential

Schema
(Database)

Table View Function

176
Delta Sharing
Shares and recipients

Metastore

Storage
External Location Catalog Share Recipient
Credential

Schema
(Database)

Table View Function

177
Unity Catalog Architecture
Architecture
Before Unity Catalog With Unity Catalog

Workspace 1 Workspace 2 Unity Catalog

User/group User/group User/group Access


Metastore
management management management controls

Metastore Metastore

Access controls Access controls Workspace 1 Workspace 2

Compute Compute Compute Compute


resources resources resources resources

179
Query Lifecycle
Unity Catalog Security Model

Check namespace,
2 metadata and grants

Return short-lived
1 Send query 4 token and signed URL Audit Log

Assume IAM Role or


Principal Request data from URL 3
8 Send result 5 with short-lived token
Service Principal

Compute
Enforce
7 policies
6 Return data

Cloud Storage

180
Compute Resources and Unity
Catalog
Compute Resources for Unity Catalog
Modes Modes not
Cluster Access
supporting UC Mode
supporting
No isolationUC
shared
Multiple
Single user language
supportnot shareable
Multiple language support,
Shared
Shareable, Python and SQL, legacy table ACLs

182
Cluster Access Mode
Feature matrix

Init scripts
Supported Legacy Credential DBFS Fuse Dynamic Machine
Access mode Shareable RDD API and
languages table ACL passthrough mounts views learning
libraries

No Isolation
All ⬤ ⬤ ⬤ ⬤ ⬤
Shared

Single user All ⬤ ⬤ ⬤ ⬤ ⬤

SQL
Shared ⬤ ⬤ ⬤
Python

183
Roles and Identities in Unity
Catalog
Unity Catalog
Roles

Cloud Admin Identity Admin Cloud Admin


• Manage underlying cloud resources
• Storage accounts/buckets
Account Admin • IAM role/service principals/managed
identities
Metastore Admin Identity Admin
• Manage users and groups in the identity
provider (IdP)
Data Owner • Provision into account (with account admin)

Workspace Admin

185
Unity Catalog
Roles

Account Admin
• Create or delete metastores, assign
Cloud Admin Identity Admin metastores to workspaces
• Manage users and groups, integrate with IdP
• Full access to all data objects
Account Admin
Metastore Admin
• Create or drop, grant privileges on, and
Metastore Admin change ownership of catalogs and otherdata
objects

Data Owner Data Owner - owns data objects they created


• Create nested objects, grant privileges on,
and change ownership of owned objects
Workspace Admin

186
Unity Catalog
Roles

Workspace Admin
Cloud Admin Identity Admin • Manages permissions on
workspace assets
• Restricts access to cluster creation
Account Admin • Adds or removes users
• Elevates users permissions
• Grant privileges to others
Metastore Admin • Change job ownership

Data Owner

Workspace Admin

187
Unity Catalog
Identities
• User • Service Principal
• Account Administrator • Service Principal with administrative privileges

[email protected] terraform

First name First name


App ID UUID
GUID
Last name Last name
Name terraform
Password ●●●●●●●●●●
Admin role
Admin role

188
Unity Catalog
Identities
• Groups

allusers

analysts developers

[email protected] terraform

189
Unity Catalog
Identity Federation

Account Workspace 1 Workspace 2

[email protected]
1
9
0

Account identity

[email protected] [email protected]

Workspace identity Workspace identity


Data Access Control in Unity
Catalog
Security model

Principals Privileges Securables


Account admin CREATE
Catalog Schema
Metastore admin USAGE

Data owner SELECT


Table View
User MODIFY

Service principal CREATE TABLE


Function
Group READ FILES

WRITE FILES
Storage credential External location
EXECUTE

Share Recipient

192
Security model

Principals Privileges Securables


Data owner CREATE
Catalog Schema
Account admin USAGE

Metastore admin SELECT


Table View
User MODIFY

Service principal CREATE TABLE


Function
Group READ FILES

WRITE FILES
Storage credential External location
EXECUTE

Share Recipient

193
Security model

Principals Privileges Securables


Data owner CREATE
Table
Account admin USAGE

Metastore admin SELECT Catalog Schema View

User MODIFY
Function
Service principal CREATE TABLE
Storage External
Group READ FILES
credential location
WRITE FILES
Share Recipient
EXECUTE

194
Security model

Privileges Securables
CREATE
Catalog Schema
USAGE

SELECT
Table View
MODIFY

CREATE TABLE
Function
READ FILES

WRITE FILES
Storage credential External location
EXECUTE

Share Recipient

195
Security model

Privileges Securables
CREATE
Catalog Schema
USAGE

SELECT
Table View
MODIFY

CREATE TABLE
Function
READ FILES

WRITE FILES
Storage credential External location
EXECUTE

Share Recipient

196
Security model

Privileges Securables
CREATE
Catalog Schema
USAGE

SELECT
Table View
MODIFY

CREATE TABLE
Function
READ FILES

WRITE FILES
Storage credential External location
EXECUTE

Share Recipient

197
Security model

Privileges Securables
CREATE
Catalog Schema
USAGE

SELECT
Table View
MODIFY

CREATE TABLE
Function
READ FILES

WRITE FILES
Storage credential External location
EXECUTE

Share Recipient

198
Security model

Privileges Securables
CREATE
Catalog Schema
USAGE

SELECT
Table View
MODIFY

CREATE TABLE
Function
READ FILES

WRITE FILES
Storage credential External location
EXECUTE

Share Recipient

199
Security model

Privileges Securables
CREATE
Catalog Schema
USAGE

SELECT
Table View
MODIFY

CREATE TABLE
Function
READ FILES

WRITE FILES
Storage credential External location
EXECUTE

Share Recipient

200
Privilege Recap
Tables

Querying tables (SELECT)


Metastore
Modifying tables (MODIFY)
• Data (INSERT, DELETE)
✓ USAGE Catalog • Metadata (ALTER)
Traversing containers (USAGE)

✓ USAGE Schema

✓ SELECT/MODIFY Table

201
Privilege Recap
Views

Abstract complex queries


Metastore
• Aggregations
• Transformations
✓ USAGE Catalog
• Joins
• Filters
Enhanced table access control
✓ USAGE Schema
Querying views (SELECT)
Traversing containers (USAGE)
✓ SELECT View Table

202
Privilege Recap
Functions

Provide custom code via


Metastore
user-defined functions
Using functions (EXECUTE)
✓ USAGE Catalog Traversing containers (USAGE)

✓ USAGE Schema

✓ EXECUTE Function

203
Dynamic Views

Limit access to columns Limit access to rows Data Masking

Omit column values from output Omit rows from output Obscure data

●●●●●●@databricks.com

Can be conditional on a specific user/service principal or group


membership through Databricks-provided functions

204
Creating New Objects

Creating new objects (CREATE)


Metastore
Traversing containers (USAGE)

✓ USAGE Catalog

✓ USAGE
Schema
✓ CREATE

New table, view or


function

205
Deleting Objects

DROP objects
Metastore

Catalog

Schema

Table, view or
function

206
Unity Catalog External Storage
Storage Credentials and External Locations
Storage Credential External Location
Enables Unity Catalog to connect to Cloud storage path + storage
an external cloud storage credential
Examples include: • Self-contained object for
accessing specific locations in
• IAM role for AWS S3
cloud storage
• Service principal for Azure
• Fine-grained control over
Storage
external storage
Storage Credentials and External Locations
Access Control

CREATE TABLE READ FILES WRITE FILES

Create an External Table Read files directly using this Write files directly using
directly using this Storage Storage Credential this Storage Credential
Credential
Storage Credential

External Location
Create an External Table
from files governed by Read files governed by this Write files governed by this
this External Location External Location External Location

209
Managed Tables

Metastore

Catalog

Schema

Managed table

Metastore storage

210
External Tables

Metastore

Catalog Storage credential

Schema External location

Managed table External table

Metastore storage External storage

211
Unity Catalog Patterns and
Best Practices
UC Patterns & Best Practices
1 metastore per region Region B

dev staging prod


workspace workspace workspace

Metastore

Region A

dev staging prod


workspace workspace workspace

Metastore

213
UC Patterns & Best Practices
Share data with Delta Sharing Region B

dev staging prod


workspace workspace workspace

Metastore

Region A

dev staging prod


workspace workspace workspace

Metastore

Share tables from Region A with Region B


214
UC Patterns & Best Practices
Data Segregation

Use catalogs (not metastores) to segregate data Metastore

Apply permissions appropriately

For example, grant to group B: ✓ USAGE


Catalog A Catalog B
• USAGE on catalog B

• USAGE on all applicable schemas in catalog B ✓ USAGE


• SELECT/MODIFY on applicable tables Schema A Schema B

✓ SELECT
Table 1 Table 2 Table 3

215
UC Patterns & Best Practices
Data Segregation Catalogs
dev

staging Environment scope


prod

bu1_dev

Metastore bu1_staging Business unit + environment scope


bu1_prod

team1_sandbox
Sandboxes
team2_sandbox

216
UC Patterns & Best Practices
Identity Management

Account-level Identities Groups Service Principals


Manage all identities at the Use groups rather than users to Use service principals to run
account-level assign access and ownership to production jobs
securable objects
Enable UC for workspaces to
enable identity federation
analysts
terraform
[email protected]
App ID GUID
Name terraform
Admin

developers

terraform

217
UC Patterns & Best
Practices
Storage Credentials and External
Locations
Storage External Location
Credential
Enables Unity Cloud storage path
Catalog to connect + storage credential
to an external cloud
• Self-contained
store
object for
Examples include: accessing
specific
• IAM role for
2
1
locations in 8
AWS S3
cloud stores
• Service principal
• Fine-grained
for Azure
control over
Storage
external storage
UC Patterns & Best Practices
Storage Credentials and External Locations

user1/
External
location 1
users/

user2/
External
location 2
Storage /
credential
tables/
External
location 3
shared/
External
tmp/
location 4

219
UC Patterns & Best Practices
Managed versus External Tables
Managed Tables External Tables

Metadata lives in control plane Metadata lives in control plane


Data lives in metastore-managed Data lives in user-provided storage
storage location location (external to UC)
DROP discards data DROP leaves data intact
Delta format only Several formats supported (delta, csv,
json, avro, parquet, orc, text)
UC Patterns & Best Practices
When to use external tables?

Quick and easy upgrade from external table in Hive metastore


External readers or writers
Requirement for specific storage naming or hierarchy
Infrastructure-level isolation requirements
Non-Delta support requirement

221
Unity Catalog Key Capabilities
Centralized metadata and user management
Unity Catalog Architecture

Before Unity Catalog With Unity Catalog

Workspace 1 Workspace 2 Unity Catalog

User/group User/group User/group Access


Metastore
management management management controls

Metastore Metastore

Access controls Access controls Workspace 1 Workspace 2

Compute Compute Compute Compute


resources resources resources resources

223
Centralized Access Controls
Centrally grant and manage access permissions across workloads

Using ANSI SQL DCL Using UI

GRANT <privilege> ON <securable_type>


<securable_name> TO `<principal>`

GRANT SELECT ON iot.events TO engineers

Choose ‘Table’= collection of


permission level files in S3/ADLS Sync groups from
your identity
provider

224
Three level namespace
Seamless access to your existing metastores

Unity Catalog

hive_metastore
Catalog 2 Catalog 1
(legacy)

default
(database) Database 2 Database 1

customers External External Managed


(table) Views
Table Tables Tables

SELECT * FROM main.student.example; -- <catalog>.<database>.<table>


SELECT * FROM hive_metastore.default.customers;
225
Managed Data Sources & External Locations
Simplify data access management across clouds

Unity External
Audit log Locations &
Credentials
Catalog

Access Control
Cloud Storage
(S3, ADLS, GCS)

Managed Managed
Managed container / bucket Data Sources
tables

External
External container / bucket
User Cluster or tables
SQL warehouse … External
Locations
External
Files in container / bucket
Cloud Strg
226
Automated lineage for all workloads
End-to-end visibility into how data flows and consumed in your organization

● Auto-capture runtime data lineage


on a Databricks cluster or SQL
warehouse
● Track lineage down to the table and
column level
● Leverage common permission model
from Unity Catalog
● Lineage across tables, dashboards,
workflows, notebooks 227
Lineage flow - How it works

ETL / Job
Explore lineage in UI
Workspace Table and
Lineage
cluster / SQL column
service
Warehouse lineage
Ad-hoc Alation

FY23Q4 Microsoft
Collibra Purview

DLT
External Catalogs

● Code (any language) is submitted ● Lineage service analyzes logs emitted ● Presented to the end user
to a cluster or SQL warehouse or from the cluster, and pulls metadata graphically in Databricks
DLT* executes data flow from DLT ● Lineage can be exported
● Assembles column and table level via API and imported into
lineage other tool
228
Built-in search and discovery
Accelerate time to value with low latency data discovery

● UI to search for data assets stored in


Unity Catalog
● Unified UI across DSML + DBSQL
● Leverage common permission model
from Unity Catalog

229
An open standard for secure sharing of data assets
Unity Catalog
-Architecture

Audit Log Account


Metastore Level User
Mgmt

Unity
Lineage Storage
Explorer Credentials
Catalog
2
3
1

Data Explorer Access ACL Store


Control

Cloud Storage
(S3, ADLS, GCS)

Databricks
Workspace
✔ * Container / bucket
User

* Unity Catalog will support any data format (table or raw files)

You might also like