Data Engineering With Databricks Da
Data Engineering With Databricks Da
Data Engineering With Databricks Da
Engineering
with Databricks
Databricks Academy
2023
Meet your instructor
Add instructor name, Add instructor title
• Team: <add>
• Time at Databricks: <add>
Replace with • Fun fact: <Add>
instructor
photograph
2
Meet your classmates
• Where is everyone joining us from today (city, country)?
3
Meet your classmates
• How long have you been working with Databricks?
4
Meet your classmates
• What has your experience working with Databricks for data
engineering been so far?
5
Meet your classmates
• What are you hoping to get out of this class?
6
Getting Started
1_DAIS_Title_Slide
Use Spark to extract data from a variety of sources, apply common cleaning
2 transformations, and manipulate complex data to load into Delta Lake.
Define and schedule data pipelines that incrementally ingest and process data
3
through multiple tables in the lakehouse using Delta Live Tables.
Configure permissions in Unity Catalog to ensure that users have proper access
5
to databases for analytics and dashboarding.
Agenda
Module Name Duration
Get Started with Databricks Data Science and Engineering Workspace 1 hour, 20 min
9
Get Started with
Databricks Data
Module 01
Science &
1_DAIS_Title_Slide
Engineering
Workspace
Module Objectives
Get Started with Databricks Data Science and Engineering Workspace
11
Module Overview
Get Started with Databricks Data Science and Engineering Workspace
12
Databricks Workspace and
Services
Databricks Workspace and Services
Control Plane
Data Plane
Web App
Unity Catalog
Workflow Manager
Access Control
Notebooks, Repos,
DBSQL
Demo:
Navigate the Workspace UI
Compute Resources
Clusters
Overview
Worker
Distributes workloads
Notebook
across workers VM instance
17
Cluster Types
18
Cluster Configuration
Cluster Mode
Standard (Multi Node)
Default mode for workloads developed in any supported language (requires
at least two VM instances)
Single node
Low-cost single-instance cluster catering to single-node machine learning
workloads and lightweight exploratory analysis
Databricks Runtime Version
Standard
Apache Spark and many other components and updates to provide an
optimized big data analytics experiences
Machine learning
Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and
XGBoost.
Photon
An optional add-on to optimize SQL workloads
21
Access Mode
Access mode Unity Catalog Supported
Visible to user
dropdown support languages
Python, SQL,
Single user Always Yes
Scala, R
Python (DBR
Shared Always (Premium plan required) Yes
11.1+), SQL
No isolation Can be hidden by enforcing user isolation in the admin Python, SQL,
No
shared console or configuring account-level settings Scala, R
Attach notebook ✓ ✓ ✓
View Spark UI, cluster
metrics, driver logs ✓ ✓ ✓
Start, restart,
terminate ✓ ✓
Edit ✓
Attach library ✓
Resize ✓
Change permissions ✓
24
Demo:
DE 1.1: Create and Manage
Interactive Clusters
Develop Code with Notebooks
& Databricks Repos
Databricks Notebooks
Collaborative, reproducible, and enterprise ready
Multi-language Reproducible
Use Python, SQL, Scala, and R, all in one Automatically track version history, and
Notebook use git version control with Repos
Collaborative
Real-time co-presence, co-editing, and Get to production faster
commenting Quickly schedule notebooks as jobs or
create dashboards from their results, all
in the Notebook
Ideal for exploration
Explore, visualize, and summarize data
with built-in charts and data profiles
Enterprise-ready
Enterprise-grade access controls,
Adaptable identity management, and auditability
Install standard libraries and use local
modules
27
Notebook magic commands
Use to override default languages, run utilities/auxiliary commands, etc.
28
dbutils (Databricks Utilities)
Perform various tasks with Databricks using notebooks
Utility Description Example
29
Git Versioning with Databricks
Repos
Databricks Repos
CI CD
31
Databricks Repos
CI/CD Integration
Repos Service
32
CI/CD workflows with Git and Repos
Documentation
User workflow in Merge workflow in Production job
Admin workflow
Databricks Git provider workflow in Databricks
Steps in Databricks
Commit and push to
feature branch Steps in your Git provider
Demo:
DE 1.2: Databricks Notebook
Operations
Lab:
DE 1.3L: Get Started with the
Databricks Platform
Transform Data
1_DAIS_Title_Slide
with Spark
Module Objectives
Transform Data with Spark
1. Extract data from a variety of file formats and data sources using Spark
2. Apply a number of common transformations to clean data using Spark
3. Reshape and manipulate complex data using advanced built-in functions
in Spark
4. Leverage UDFs for reusable code and apply best practices for
performance in Spark
37
Module Agenda
Transform Data with Spark
Metastore
Catalog
Schema
(Database)
40
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
41
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
(Database)
42
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Managed table
Table View Function
External table
43
Managed Tables
Metastore
Catalog
Schema
Managed table
Metastore storage
44
External Tables
Metastore
45
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
46
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Global Temporary
Temporary View
View 47
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
48
Extracting Data
Query files directly
SELECT * FROM file_format.`path/to/file`
50
Configure external tables with read options
CREATE TABLE USING data_source OPTIONS (...)
51
Demo:
DE 2.1: Querying Files Directly
Demo:
DE 2.2: Providing Options for
External Sources
Lab:
DE 2.3L: Extract Data Lab
Lab:
DE 2.4: Cleaning Data
Complex Transformations
Interact with Nested Data
Use built-in syntax to traverse nested data with Spark SQL
57
Complex Types
Nested data types storing multiple values
58
Demo:
DE 2.5: Complex
Transformations
explode lab
SELECT
user_id, event_timestamp, event_name,
explode(items) AS item
FROM events
explode outputs the elements of an array field into a separate row for each element
1
2
3
Each item in the items array above is exploded into its own row, resulting in the 3 rows below
1
2
3
60
flatten lab
collect_set returns an array of unique values from a field for each group of rows
flatten returns an array that flattens multiple arrays into one
SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id
61
Collection example
df df.agg(collect_set(‘age’)) df.agg(collect_list(‘age’))
62
Parse JSON strings into structs
Create the schema to parse the JSON strings by providing an example JSON string from a row
that has no nulls
from_json uses JSON schema returned by schema_of_json to convert a column of JSON strings into structs
This highlighted JSON string is taken from the value field of a single row of data
63
Lab:
DE 2.5L: Reshape Data Lab
(Optional)
Demo:
DE 2.7A: SQL UDFs and
Control Flow (Optional)
Demo:
DE 2.7B: Python UDFs
(Optional)
Manage Data 1_DAIS_Title_Slide
Module 03
Module Agenda
Manage Data with Delta Lake
68
What is Delta Lake?
Delta Lake is an open-source
project that enables building a
data lakehouse on top of
existing cloud storage
70
Delta Lake Is Not…
• Proprietary technology
• Storage format
• Storage medium
• Database service or data warehouse
Delta Lake Is…
• Open source
• Builds upon standard data formats
• Optimized for cloud object storage
• Built for scalable metadata handling
Delta Lake brings ACID to object storage
Atomicity means all transactions either succeed or fail completely
Isolation refers to how simultaneous operations conflict with one another. The
isolation guarantees that Delta Lake provides do differ from other systems
75
Demo:
DE 3.1: Schemas and Tables
Demo:
DE 3.2: Version and Optimize
Delta Tables
Lab:
DE 3.3L - Manipulate Delta
Tables Lab
Demo:
DE 3.4: Set up Delta Tables
Demo:
DE 3.5: Load Data into Delta
Tables
Lab:
DE 3.6L: Load Data
Build Data
Pipelines with
1_DAIS_Title_Slide
Delta Live
Tables
Agenda
Build Data Pipelines with Delta Live Tables
Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake
Data Science
Raw ingestion Filtered, cleaned, Business-level & ML
85
Multi-Hop in the Lakehouse
Bronze Layer
86
Multi-Hop in the Lakehouse
Silver Layer
87
Multi-Hop in the Lakehouse
Gold Layer
88
Introduction to Delta Live
Tables
Multi-Hop in the Lakehouse
Streaming
analytics
CSV
JSON
TXT
Data quality
AI and reporting
The Reality is Not so Simple
Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious
92
Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake
93
What is a LIVE TABLE?
What is a Live Table?
Live Tables are materialized views for the lakehouse.
95
What is a Streaming Live Table?
Based on SparkTM Structured Streaming
96
When should I use streaming?
Using Spark Structured Streaming for ingestion
Easily ingest files from cloud storage as they are uploaded
98
Using the SQL STREAM() function
Stream data from any Delta table
• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.
101
BEST PRACTICE
Development vs Production
Fast iteration or enterprise grade reliability
In the Pipelines
UI:
102
What if I have
dependent tables?
Declare LIVE Dependencies
Using the LIVE virtual schema.
104
How do I ensure
Data Quality?
BEST PRACTICE
106
What about operations?
Pipelines UI (1 of 5)
A one stop shop for ETL debugging and operations
108
Pipelines UI (2 of 5)
A one stop shop for ETL debugging and operations
109
Pipelines UI (3 of 5)
A one stop shop for ETL debugging and operations
110
Pipelines UI (4 of 5)
A one stop shop for ETL debugging and operations
111
Pipelines UI (5 of 5)
A one stop shop for ETL debugging and operations
112
The Event Log
The event log automatically records all pipelines operations.
Operational Statistics Provenance Data Quality
Time and current status, for all operations
Pipeline and cluster configurations
Table schemas, Expectation pass /
Row counts definitions, and failure / drop
declared properties statistics
Table-level lineage Input/Output rows
that caused
Query plans used to
expectation failures
update tables
113
How can I use parameters?
Modularize your code with configuration
Avoid hard coding paths, topic names, and other constants in your code.
@dlt.table
def data():
input_path = spark.conf.get("my_etl.input_path”)
spark.readStream.format("cloud_files”).load(input_path)
115
How can I do
change data capture (CDC)?
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
Up-to-date
Snapshot
117
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
SEQUENCE BY ts
cities
A target for the changes to id city
be applied to.
118
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
SEQUENCE BY ts
A source of changes,
currently this has to be a
stream.
119
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
SEQUENCE BY ts
cities
A unique key that can be
used to identify a given row. id city
120
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
SEQUENCE BY ts
121
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
SEQUENCE BY ts
cities
id city
1 Bekerly, CA Berkeley, CA
122
REFERENCE ARCHITECTURE
replicated_table
MySQL
or Debezium APPLY CHANGES INTO
Postgre
s
replicated_table
DLT encodes Delta best practices DLT automatically manages your Schema evolution is handled for
automatically when creating DLT physical data to minimize cost and you
tables. optimize performance.
How:
How: How:
Modifying a live table
DLT sets the following properties: • runs vacuum daily transformation to
• runs optimize daily add/remove/rename a column will
• optimizeWrite
automatically do the right thing.
• autoCompact You still can tell us how you want
• tuneFileSizesForRewrites it organized (ie ZORDER) When removing a column in a
streaming live table, old values are
preserved.
125
Demo:
DE 4.1 - Using the Delta Live
Tables UI
Demo:
DE 4.1.1 - Fundamentals of DLT
Syntax
Demo:
DE 4.1.2 - More DLT SQL
Syntax
Demo:
DE 4.2 - Delta Live Tables:
Python vs SQL
Demo:
DE 4.3 - Exploring the Results
of a DLT Pipeline
Demo:
DE 4.4 - Exploring the Pipeline
Events Logs
Lab:
DE 4.1.3 - Troubleshooting DLT
Syntax Lab
Deploy
Workloads with
1_DAIS_Title_Slide
Databricks
Workflows
Module Agenda
Deploy Workloads with Databricks Workflows
Introduction to Workflows
Building and Monitoring Workflow Jobs
DE 5.1 - Scheduling Tasks with the Jobs UI
DE 5.2L - Jobs Lab
134
Introduction to Workflows
Lesson Objectives
Lakehouse Platform
Workflows is a service for data engineers, data Data Data Data Data Science
Warehousing Engineering Streaming and ML
scientists and analysts to build reliable data,
analytics and AI workflows on any cloud. Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
137
Databricks Workflows
138
DLT versus Jobs
Considerations
139
DLT versus Jobs
Use Cases
Orchestration of Machine Learning Tasks Arbitrary Code, External Data Ingestion and
Dependent Jobs API Calls, Custom Tasks Transformation
Run MLflow notebook task
Jobs running on schedule, in a job Run tasks in a job which ETL jobs, Support for batch
containing dependent can contain Jar file, Spark and streaming, Built in data
tasks/steps Submit, Python Script, SQL quality constraints,
task, dbt monitoring & logging
140
Workflows Features
Part 1 of 2
141
Workflows Features
Part 2 of 2
142
How to Leverage Workflows
• Allows you to build simple ETL/ML task orchestration
• Reduces infrastructure overhead
• Easily integrate with external tools
• Enables non-engineers to build their own workflows using simple UI
• Cloud-provider independent
• Enables re-using clusters to reduce cost and startup time
143
Common Workflow Patterns
Sequence Funnel
● Data transformation/ Fan-out, star pattern
● Multiple data sources
processing/cleaning ● Single data source
● Data collection
● Bronze/silver/gold tables ● Data ingestion and
distribution
144
Example Workflow
ML feature extraction
E.g. MLflow
145
Building and Monitoring
Workflow Jobs
Workflows Job Components
147
Creating a Workflow
Task Definition
Workflows keeps track of job runs and save information about the success or failure of each task
in the job run.
Navigate to the Runs tab to view completed or active runs for a job.
Tasks
Job run
151
Monitoring and Debugging
Repair a Failed Job Run
153
Navigating the Jobs UI
Runs vs Tasks tabs on the job page
Use Runs tab to view completed or Use Tasks tab to modify or add
active runs for the job tasks to the job
154
Demo:
DE 5.1.1: Task Orchestration
Demo: Task Orchestration
DE 5.1.1 - Task Orchestration
156
Lab:
DE 5.2.1.L: Task Orchestration
Lab: Task Orchestration
DE 5.2.1.L - Task Orchestration
158
Manage Data
Access for
1_DAIS_Title_Slide
Analytics
Lesson Objectives
By the end of this course, you will be able to:
1. Describe Unity Catalog key concepts and how it integrates with the
Databricks platform
2. Access Unity Catalog through clusters and SQL warehouses
3. Create and govern data assets in Unity Catalog
4. Adopt Databricks recommendations into your organization’s Unity
Catalog-based solutions
160
Module Agenda
Manage Data Access for Analytics with Unity Catalog
Introduction to Unity Catalog
Compute Resources in Unity Catalog
DE 6.1 - Introduction to Unity Catalog
DE 6.8 - Compute Resources
DE 6.2 - Overview of Data Governance
DE 6.9 - Creating Compute Resources
DE 6.3 - Unity Catalog Key Concepts
DE 6.14 - Create and Share Tables DE 6.20 - Upgrade a Table to Unity Catalog
DE 6.15 - Create External Tables DE 6.21 - Create Views and Limiting Table Access
161
Introduction to Unity Catalog
Overview of
Data Governance
80% of organizations seeking to scale
digital business will fail because they do not
take a modern approach to data and analytics
governance
Source: Gartner
164
Data Governance
Four key functional areas
Control who has access to which data Capture and record all access to data
Capture upstream sources and downstream Ability to search for and discover authorized assets
consumers
165
Governance for data,
analytics and AI is complex
Data Warehouse
Permissions on ML models,
dashboards, features, …
Yet another governance model
Data scientist
ML and AI
Databricks Unity Catalog
Unified governance for data, analytics and AI
Data Lake
Data analyst
Metadata
Unity Catalog
Data engineer
Data Warehouse
Data scientist
ML and AI
167
Unity Catalog
Overview
Unified governance across Unified data and AI assets Unified existing catalogs
clouds
Centrally share, audit, secure and Works in concert with existing
Fine-grained governance for data manage all data types with one data, storage, and catalogs - no
lakes across clouds - based on simple interface. hard migration required.
open standard ANSI SQL.
1 2 3
168
Unity Catalog
Key Capabilities
169
Unity Catalog
Key Concepts
Metastore
Unity Catalog metastore elements
Metastore
Control
Storage Plane
External Location Catalog Share Recipient
Credential
Schema
(Database) Cloud
Storage
171
Metastore
Accessing legacy Hive metastore
Metastore
Schema
Workspace (Database)
172
Catalog
Top-level container for data objects
Metastore
Storage
External Location Catalog Share Recipient
Credential
Schema
(Database)
173
Catalog
Three-level namespace
174
Data Objects
Schema (database), tables, views, functions
Metastore
Storage
External Location Catalog Share Recipient
Credential
Schema
(Database)
Managed table
Table View Function
External table
175
External Storage
Storage credentials and external locations
Metastore
Storage
External Location Catalog Share Recipient
Credential
Schema
(Database)
176
Delta Sharing
Shares and recipients
Metastore
Storage
External Location Catalog Share Recipient
Credential
Schema
(Database)
177
Unity Catalog Architecture
Architecture
Before Unity Catalog With Unity Catalog
Metastore Metastore
179
Query Lifecycle
Unity Catalog Security Model
Check namespace,
2 metadata and grants
Return short-lived
1 Send query 4 token and signed URL Audit Log
Compute
Enforce
7 policies
6 Return data
Cloud Storage
180
Compute Resources and Unity
Catalog
Compute Resources for Unity Catalog
Modes Modes not
Cluster Access
supporting UC Mode
supporting
No isolationUC
shared
Multiple
Single user language
supportnot shareable
Multiple language support,
Shared
Shareable, Python and SQL, legacy table ACLs
182
Cluster Access Mode
Feature matrix
Init scripts
Supported Legacy Credential DBFS Fuse Dynamic Machine
Access mode Shareable RDD API and
languages table ACL passthrough mounts views learning
libraries
No Isolation
All ⬤ ⬤ ⬤ ⬤ ⬤
Shared
SQL
Shared ⬤ ⬤ ⬤
Python
183
Roles and Identities in Unity
Catalog
Unity Catalog
Roles
Workspace Admin
185
Unity Catalog
Roles
Account Admin
• Create or delete metastores, assign
Cloud Admin Identity Admin metastores to workspaces
• Manage users and groups, integrate with IdP
• Full access to all data objects
Account Admin
Metastore Admin
• Create or drop, grant privileges on, and
Metastore Admin change ownership of catalogs and otherdata
objects
186
Unity Catalog
Roles
Workspace Admin
Cloud Admin Identity Admin • Manages permissions on
workspace assets
• Restricts access to cluster creation
Account Admin • Adds or removes users
• Elevates users permissions
• Grant privileges to others
Metastore Admin • Change job ownership
Data Owner
Workspace Admin
187
Unity Catalog
Identities
• User • Service Principal
• Account Administrator • Service Principal with administrative privileges
[email protected] terraform
188
Unity Catalog
Identities
• Groups
allusers
analysts developers
[email protected] terraform
189
Unity Catalog
Identity Federation
[email protected]
1
9
0
Account identity
[email protected] [email protected]
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
192
Security model
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
193
Security model
User MODIFY
Function
Service principal CREATE TABLE
Storage External
Group READ FILES
credential location
WRITE FILES
Share Recipient
EXECUTE
194
Security model
Privileges Securables
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
195
Security model
Privileges Securables
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
196
Security model
Privileges Securables
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
197
Security model
Privileges Securables
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
198
Security model
Privileges Securables
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
199
Security model
Privileges Securables
CREATE
Catalog Schema
USAGE
SELECT
Table View
MODIFY
CREATE TABLE
Function
READ FILES
WRITE FILES
Storage credential External location
EXECUTE
Share Recipient
200
Privilege Recap
Tables
✓ USAGE Schema
✓ SELECT/MODIFY Table
201
Privilege Recap
Views
202
Privilege Recap
Functions
✓ USAGE Schema
✓ EXECUTE Function
203
Dynamic Views
Omit column values from output Omit rows from output Obscure data
●●●●●●@databricks.com
204
Creating New Objects
✓ USAGE Catalog
✓ USAGE
Schema
✓ CREATE
205
Deleting Objects
DROP objects
Metastore
Catalog
Schema
Table, view or
function
206
Unity Catalog External Storage
Storage Credentials and External Locations
Storage Credential External Location
Enables Unity Catalog to connect to Cloud storage path + storage
an external cloud storage credential
Examples include: • Self-contained object for
accessing specific locations in
• IAM role for AWS S3
cloud storage
• Service principal for Azure
• Fine-grained control over
Storage
external storage
Storage Credentials and External Locations
Access Control
Create an External Table Read files directly using this Write files directly using
directly using this Storage Storage Credential this Storage Credential
Credential
Storage Credential
External Location
Create an External Table
from files governed by Read files governed by this Write files governed by this
this External Location External Location External Location
209
Managed Tables
Metastore
Catalog
Schema
Managed table
Metastore storage
210
External Tables
Metastore
211
Unity Catalog Patterns and
Best Practices
UC Patterns & Best Practices
1 metastore per region Region B
Metastore
Region A
Metastore
213
UC Patterns & Best Practices
Share data with Delta Sharing Region B
Metastore
Region A
Metastore
✓ SELECT
Table 1 Table 2 Table 3
215
UC Patterns & Best Practices
Data Segregation Catalogs
dev
bu1_dev
team1_sandbox
Sandboxes
team2_sandbox
216
UC Patterns & Best Practices
Identity Management
developers
terraform
217
UC Patterns & Best
Practices
Storage Credentials and External
Locations
Storage External Location
Credential
Enables Unity Cloud storage path
Catalog to connect + storage credential
to an external cloud
• Self-contained
store
object for
Examples include: accessing
specific
• IAM role for
2
1
locations in 8
AWS S3
cloud stores
• Service principal
• Fine-grained
for Azure
control over
Storage
external storage
UC Patterns & Best Practices
Storage Credentials and External Locations
user1/
External
location 1
users/
user2/
External
location 2
Storage /
credential
tables/
External
location 3
shared/
External
tmp/
location 4
219
UC Patterns & Best Practices
Managed versus External Tables
Managed Tables External Tables
221
Unity Catalog Key Capabilities
Centralized metadata and user management
Unity Catalog Architecture
Metastore Metastore
223
Centralized Access Controls
Centrally grant and manage access permissions across workloads
224
Three level namespace
Seamless access to your existing metastores
Unity Catalog
hive_metastore
Catalog 2 Catalog 1
(legacy)
default
(database) Database 2 Database 1
Unity External
Audit log Locations &
Credentials
Catalog
Access Control
Cloud Storage
(S3, ADLS, GCS)
Managed Managed
Managed container / bucket Data Sources
tables
External
External container / bucket
User Cluster or tables
SQL warehouse … External
Locations
External
Files in container / bucket
Cloud Strg
226
Automated lineage for all workloads
End-to-end visibility into how data flows and consumed in your organization
ETL / Job
Explore lineage in UI
Workspace Table and
Lineage
cluster / SQL column
service
Warehouse lineage
Ad-hoc Alation
FY23Q4 Microsoft
Collibra Purview
DLT
External Catalogs
● Code (any language) is submitted ● Lineage service analyzes logs emitted ● Presented to the end user
to a cluster or SQL warehouse or from the cluster, and pulls metadata graphically in Databricks
DLT* executes data flow from DLT ● Lineage can be exported
● Assembles column and table level via API and imported into
lineage other tool
228
Built-in search and discovery
Accelerate time to value with low latency data discovery
229
An open standard for secure sharing of data assets
Unity Catalog
-Architecture
Unity
Lineage Storage
Explorer Credentials
Catalog
2
3
1
Cloud Storage
(S3, ADLS, GCS)
Databricks
Workspace
✔ * Container / bucket
User
* Unity Catalog will support any data format (table or raw files)