Create BigLake external tables for Delta Lake
BigLake lets you access Delta Lake tables with more granular access control. Delta Lake is an open source, tabular data storage format developed by Databricks that supports petabyte scale data tables.
BigQuery supports the following features with Delta Lake tables:
- Access delegation: Query structured data in external data stores with access delegation. Access delegation decouples access to the Delta Lake table from access to the underlying datastore.
- Fine-grained access control: Enforce fine-grained security at the table level, including row-level and column-level security. For Delta Lake tables based on Cloud Storage, you can also use dynamic data masking.
- Schema evolution: Schema changes in the Delta Lake tables are autodetected. Changes to the schema are reflected in the BigQuery table.
Delta Lake tables also support all BigLake features when you configure them as BigLake tables.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery Connection and BigQuery Reservation APIs.
-
In the Google Cloud console, activate Cloud Shell.
Ensure that you have a BigQuery dataset.
Ensure that your version of the Google Cloud SDK is 366.0.0 or later:
gcloud version
If needed, update the Google Cloud SDK.
Create a Cloud resource connection based on your external data source, and grant that connection access to Cloud Storage. If you don't have the appropriate permissions to create a connection, ask your BigQuery administrator to create a connection and share it with you.
Required roles
The following permissions are required to create a Delta Lake table:
bigquery.tables.create
bigquery.connections.delegate
The BigQuery Admin (roles/bigquery.admin
) predefined
Identity and Access Management role includes these permissions.
If you are not a principal in this role, ask your administrator to grant you these permissions or to create the Delta Lake table for you.
Additionally, to allow BigQuery users to query the table, the service account associated with the connection must have the following permission and access:
- BigQuery Viewer (
roles/bigquery.viewer
) role - BigQuery Connection User (
roles/bigquery.connectionUser
) role - Access to the Cloud Storage bucket that contains that data
For more information on Identity and Access Management roles and permissions in BigQuery, see Predefined roles and permissions.
Create tables with Delta Lake
To create Delta Lake tables, follow these steps.
SQL
Use the
CREATE EXTERNAL TABLE
statement
to create the Delta Lake table:
CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`
WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`
OPTIONS (
format ="DELTA_LAKE",
uris=['DELTA_TABLE_GCS_BASE_PATH']);
Replace the following values:
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- DATASET: the BigQuery dataset to contain the Delta Lake table
- DELTALAKE_TABLE_NAME: the name of your Delta Lake table
- REGION: the region that contains the connection to create the Delta Lake table—for example,
us
CONNECTION_ID: the connection ID—for example,
myconnection
When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection
.DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix
bq
In a command-line environment, use the
bq mk
command
to create the Delta Lake table:
bq mk --table --external_table_definition=DEFINITION_FILE PROJECT_ID:DATASET.DELTALAKE_TABLE_NAME
Replace the following values:
- DEFINITION_FILE: the path to a table definition file
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- DATASET: the BigQuery dataset to contain the Delta Lake table
- DELTALAKE_TABLE_NAME: the name of your Delta Lake table
REST
Use the BigQuery API to
create a Delta Lake table by calling the tables.insert
API method:
REQUEST='{
"autodetect": true,
"externalDataConfiguration": {
"sourceFormat": "DELTA_LAKE",
"connectionId": "PROJECT_ID.REGION.CONNECTION_ID",
"sourceUris": [
"DELTA_TABLE_GCS_BASE_PATH"
],
},
"tableReference": {
"tableId": "DELTALAKE_TABLE_NAME"
}
}'
echo $REQUEST | curl -X POST -d @- -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/datasets/DATASET/tables?autodetect_schema=true
Replace the following values:
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- REGION: the region that contains the connection to create the Delta Lake table—for example,
us
CONNECTION_ID: the connection ID—for example,
myconnection
When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection
.DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix
DELTALAKE_TABLE_NAME: the name of your Delta Lake table
DATASET: the BigQuery dataset to contain the Delta Lake table
When you create Delta Lake tables, the Delta Lake prefix is
used as the URI for the table. For example, for a table that has logs in the
bucket gs://bucket/warehouse/basictable/_delta_log
, the table URI is
gs://bucket/warehouse/basictable
. When you run queries on the
Delta Lake table, BigQuery reads data under the prefix
to identify the current version of the table and then computes the metadata and
the files for the table.
Although you can create Delta Lake external tables without a connection, it is not recommended for the following reasons:
- Users might experience
ACCESS_DENIED
errors when trying to access files on Cloud Storage. - Features such as fine-grained access control are only available in Delta Lake BigLake tables.
Update Delta Lake tables
To update (refresh) the schema of Delta Lake tables, follow these steps.
bq
In a command-line environment, use the
bq update
command
to update (refresh) the schema of the Delta Lake table:
bq update --autodetect_schema PROJECT_ID:DATASET.DELTALAKE_TABLE_NAME
Replace the following values:
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- DATASET: the BigQuery dataset to contain the Delta Lake table
- DELTALAKE_TABLE_NAME: the name of your Delta Lake table
REST
Use the BigQuery API to
update a Delta Lake table by calling the tables.patch
API method:
REQUEST='{
"externalDataConfiguration": {
"sourceFormat": "DELTA_LAKE",
"sourceUris": [
"DELTA_TABLE_GCS_BASE_PATH"
],
"connectionId": "PROJECT_ID.REGION.CONNECTION_ID",
"autodetect": true
},
"tableReference": {
"tableId": "DELTALAKE_TABLE_NAME"
}
}'
echo $REQUEST |curl -X POST -d @- -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/datasets/DATASET/tables
Replace the following values:
- DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- REGION: the region that contains the connection to create the Delta Lake table—for example,
us
CONNECTION_ID: the connection ID—for example,
myconnection
When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection
.DELTALAKE_TABLE_NAME: the name of your Delta Lake table
DATASET: the BigQuery dataset to contain the Delta Lake table
Query Delta Lake tables
After creating a Delta Lake BigLake table, you can query it using GoogleSQL syntax, the same as you would a standard BigQuery table. For example:
SELECT field1, field2 FROM mydataset.my_cloud_storage_table;
For more information, see Query Cloud Storage data in BigLake tables.
An external connection associated with a service account is used to connect to the datastore. Because the service account retrieves data from the datastore, users only need access to the Delta Lake table.
Data mapping
BigQuery converts Delta Lake data types to BigQuery data types as shown in the following table:
Delta Lake Type | BigQuery Type |
---|---|
boolean |
BOOL |
byte |
INT64 |
int |
INT64 |
long |
INT64 |
float |
FLOAT64 |
double |
FLOAT64 |
Decimal(P/S) |
NUMERIC or BIG_NUMERIC depending on precision |
date |
DATE |
time |
TIME |
timestamp (not partition column) |
TIMESTAMP |
timestamp (partition column) |
DATETIME |
string |
STRING |
binary |
BYTES |
array<Type> |
ARRAY<Type> |
struct |
STRUCT |
map<KeyType, ValueType> |
ARRAY<Struct<key KeyType, value ValueType>> |
Limitations
Delta Lake tables have BigLake table limitations and also the following limitations:
- Supports Delta Lake reader version 3 with deletion vectors and column mapping.
- Doesn't support Delta Lake V2 checkpoints.
- You must list the reader version in the last log entry file. For example,
new tables must include
00000..0.json
. - Change data capture (CDC) operations aren't supported. Any existing CDC operations are ignored.
- The schema is autodetected. Modifying the schema by using BigQuery isn't supported.
- Table column names must adhere to BigQuery column name restrictions.
- Materialized views aren't supported.
- The Read API isn't supported for Delta Lake.
Troubleshooting
This section provides help with Delta Lake BigLake tables. For more general help with troubleshooting BigQuery queries, see Troubleshoot query issues.
Query timeout and resource errors
Check the logs directory (gs://bucket/warehouse/basictable/_delta_log
) of the
Delta Lake table and look for JSON files with a version number greater than the previous checkpoint. You can obtain the version number
by listing the directory or inspecting the
_delta_log/_last_checkpoint file.
JSON files larger than 10 MiB can slow down table expansion which can lead to
timeout and resource issues. To resolve this issue, use the following command
to create a new checkpoint so that queries skip reading the JSON files:
spark.sql("ALTER TABLE delta.`gs://bucket/mydeltatabledir` SET TBLPROPERTIES ('delta.checkpointInterval' = '1')");
Users can then use the same command to reset the checkpoint interval to either the default value of 10 or to a value that avoids having more than 50 MB of JSON files between checkpoints.
Invalid column name
Ensure that column mapping is turned on for the Delta Lake table. Column mapping is supported with Reader version 2 or greater. For Reader version 1, set 'delta.columnMapping.mode' to 'name' by using the following command:
spark.sql("ALTER TABLE delta.`gs://bucket/mydeltatabledir` SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name', 'delta.minReaderVersion' = '3', 'delta.minWriterVersion' = '7')");
If the invalid column name adheres to flexible column name restrictions, contact Cloud Customer Care or [email protected].
Access denied errors
To diagnose issues with Delta Lake BigLake tables, check the following:
Ensure that you are using Delta Lake BigLake tables (with a connection).
Users have required permission outlined.
Performance
To improve query performance, try the following steps:
Use Delta Lake utilities to compact the underlying data files and remove redundant files, such as data and metadata.
Ensure that
delta.checkpoint.writeStatsAsStruct
is set totrue
.Ensure that variables that are frequently used in predicate clauses are in partition columns.
Large datasets (greater than 100 TB) might benefit from additional configurations and features. If the preceding steps do not resolve your issues, consider contacting Customer Care or [email protected], especially for datasets larger than 100 TB.