Query open table formats with manifests
This document describes how to use manifest files to query data stored in open table formats such as Apache Hudi and Delta Lake.
Some open table formats such as Hudi and Delta Lake export their current state as one or more manifest files. A manifest file contains a list of data files that make tables. With the manifest support in BigQuery, you can query and load data stored in open table formats.
Before you begin
-
Enable the BigQuery Connection, BigQuery Reservation, and BigLake APIs.
To create BigLake tables, you can run the Spark commands by using one of the following methods:
Create a Dataproc cluster. For querying Hudi tables, set the
--optional-components
field toHUDI
. For querying Delta tables, set--optional-components
toPresto
.Use a stored procedure for Spark in BigQuery. To do so, follow these steps:
To store the manifest file in Cloud Storage, create a Cloud Storage bucket. You need to connect to your Cloud Storage bucket to access the manifest file. To do so, follow these steps:
Required roles
To query BigLake tables based on Hudi and Delta Lake data, ensure you have the following roles:
- BigQuery Connection User (
roles/bigquery.connectionUser
) - BigQuery Data Viewer (
roles/bigquery.dataViewer
) - BigQuery User (
roles/bigquery.user
)
You can also query Hudi external tables. However, we recommend you to upgrade the external table to BigLake. To query Hudi external tables, ensure you have the following roles:
- BigQuery Data Viewer (
roles/bigquery.dataViewer
) - BigQuery User (
roles/bigquery.user
) - Storage Object Viewer (
roles/storage.objectViewer
)
Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, see Viewing the grantable roles on resources.
To see the exact permissions that are required to query BigLake tables, expand the Required permissions section:
Required permissions
bigquery.connections.use
bigquery.jobs.create
bigquery.readsessions.create
(Only required if you are reading data with the BigQuery Storage Read API)bigquery.tables.get
bigquery.tables.getData
You might also be able to get these permissions with custom roles or other predefined roles.
Query Hudi workloads
To query Hudi data, follow these steps:
- Create an external table based on Hudi data.
- Upgrade the external table to BigLake.
Create Hudi external tables
When you sync tables by using the sync tool for Hudi and
BigQuery, enable the use-bq-manifest-file
flag to transition
to the manifest file approach. This flag also exports a manifest file in a
format supported by BigQuery and uses it to create an external
table with the name specified in the --table
parameter.
To create a Hudi external table, follow these steps:
To create a Hudi external table, submit a job to an existing Dataproc cluster. When you build the Hudi-BigQuery connector, enable the
use-bq-manifest-file
flag to transition to the manifest file approach. This flag exports a manifest file in a format supported by BigQuery and uses it to create an external table with the name specified in the--table
parameter.spark-submit \ --master yarn \ --packages com.google.cloud:google-cloud-bigquery:2.10.4 \ --class org.apache.hudi.gcp.bigquery.BigQuerySyncTool \ JAR \ --project-id PROJECT_ID \ --dataset-name DATASET \ --dataset-location LOCATION \ --table TABLE \ --source-uri URI \ --source-uri-prefix URI_PREFIX \ --base-path BASE_PATH \ --partitioned-by PARTITION_BY \ --use-bq-manifest-file
Replace the following:
JAR
: If you are using the Hudi-BigQuery connector, specifyhudi-gcp-bundle-0.14.0.jar
. If you are using the Hudi component in Dataproc 2.1, specify/usr/lib/hudi/tools/bq-sync-tool/hudi-gcp-bundle-0.12.3.1.jar
PROJECT_ID
: the project ID in which you want to create the Hudi BigLake tableDATASET
: the dataset in which you want to create the Hudi BigLake tableLOCATION
: the location in which you want to create the Hudi BigLake tableTABLE
: the name of the table that you want to createIf you are transitioning from the earlier version of the Hudi-BigQuery connector (0.13.0 and earlier) that created views on the manifest files, ensure that you use the same table name as it lets you keep the existing downstream pipeline code.
URI
: the Cloud Storage URI that you created to store the Hudi manifest fileThis URI points to the first level partition; make sure to include the partition key. For example,
gs://mybucket/hudi/mydataset/EventDate=*
URI_PREFIX
: the prefix for the Cloud Storage URI path, usually it's the path to Hudi tablesBASE_PATH
: the base path for Hudi tablesFor example,
gs://mybucket/hudi/mydataset/
PARTITION_BY
: the partition valueFor example,
EventDate
For more information about the connector's configuration, see Hudi-BigQuery connector.
To set appropriate fine-grained controls or to accelerate the performance by enabling metadata caching, see Upgrade BigLake tables.
Query Delta workloads
Delta tables are now natively supported. We recommend creating Delta BigLake tables for Delta workloads. Delta Lake BigLake tables support more advanced Delta Lake tables, including tables with column remapping and deletion vectors. Additionally, Delta BigLake tables directly read the latest snapshot, so updates are instantly available.
To query Delta workloads, follow these steps:
- Generate a manifest file.
- Create a BigLake table based on the manifest file.
- Set appropriate fine-grained controls or accelerate the performance by enabling metadata caching. To do this, see Upgrade BigLake tables.
Generate a manifest file
BigQuery supports the manifest file in a
SymLinkTextInputFormat
format, which is a
newline-delimited list of URIs. For more information about generating a manifest
file, see Set up Presto to Delta Lake integration and query Delta tables.
To generate a manifest file, submit a job to an existing Dataproc cluster:
SQL
Using Spark, run the following command on a Delta table at location path-to-delta-table
:
GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`
Scala
Using Spark, run the following command on a Delta table at location path-to-delta-table
:
val deltaTable = DeltaTable.forPath(<path-to-delta-table>) deltaTable.generate("symlink_format_manifest")
Java
Using Spark, run the following command on a Delta table at location path-to-delta-table
:
DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>); deltaTable.generate("symlink_format_manifest");
Python
Using Spark, run the following command on a Delta table at location path-to-delta-table
:
deltaTable = DeltaTable.forPath(<path-to-delta-table>) deltaTable.generate("symlink_format_manifest")
Create Delta BigLake tables
To create a Delta BigLake table, use the
CREATE EXTERNAL TABLE
statement with the
file_set_spec_type
field set to NEW_LINE_DELIMITED_MANIFEST
:
Go to the BigQuery page.
In the query editor, run the
CREATE EXTERNAL TABLE
statement:CREATE EXTERNAL TABLE PROJECT_ID.DATASET_NAME.TABLE_NAME WITH PARTITION COLUMNS( `PARTITION_COLUMN PARTITION_COLUMN_TYPE`,) WITH CONNECTION `PROJECT_IDREGION.CONNECTION_NAME` OPTIONS ( format = "DATA_FORMAT", uris = ["URI"], file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST', hive_partition_uri_prefix = "PATH_TO_DELTA_TABLE" max_staleness = STALENESS_INTERVAL, metadata_cache_mode = 'CACHE_MODE');
Replace the following:
DATASET_NAME
: the name of the dataset you createdTABLE_NAME
: the name you want to give to this tableREGION
: the location where the connection is located (for example,us-east1
)CONNECTION_NAME
: the name of the connection you createdDATA_FORMAT
: any of the supported formats (such asPARQUET
)URI
: the path to the manifest file (for example,gs://mybucket/path
)PATH_TO_DELTA_TABLE
: a common prefix for all source URIs before the partition key encoding beginsSTALENESS_INTERVAL
: specifies whether cached metadata is used by operations against the BigLake table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, see Metadata caching for performance.To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval literal value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR
for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Delta Lake instead.CACHE_MODE
: specifies whether the metadata cache is refreshed automatically or manually. For more information about metadata caching considerations, see Metadata caching for performance.Set to
AUTOMATIC
for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.Set to
MANUAL
if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure to refresh the cache.You must set
CACHE_MODE
ifSTALENESS_INTERVAL
is set to a value greater than 0.
Example:
CREATE EXTERNAL TABLE mydataset.mytable WITH CONNECTION `us-east1.myconnection` OPTIONS ( format="PARQUET", uris=["gs://mybucket/path/partitionpath=*"], file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST' hive_partition_uri_prefix = "gs://mybucket/path/" max_staleness = INTERVAL 1 DAY, metadata_cache_mode = 'AUTOMATIC' );
Upgrade BigLake tables
You can also accelerate the performance of your workloads by taking advantage of metadata caching and materialized views. If you want to use metadata caching, you can specify settings for this at the same time. To get table details such as source format and source URI, see Get table information.
To update an external table to a BigLake table or update an existing BigLake, select one of the following options:
SQL
Use the
CREATE OR REPLACE EXTERNAL TABLE
DDL statement
to update a table:
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
CREATE OR REPLACE EXTERNAL TABLE `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME` WITH CONNECTION `REGION.CONNECTION_ID` OPTIONS( format ="TABLE_FORMAT", uris = ['BUCKET_PATH'], max_staleness = STALENESS_INTERVAL, metadata_cache_mode = 'CACHE_MODE' );
Replace the following:
PROJECT_ID
: the name of the project that contains the tableDATASET
: the name of the dataset that contains the tableEXTERNAL_TABLE_NAME
: the name of the tableREGION
: the region that contains the connectionCONNECTION_ID
: the name of the connection to useTABLE_FORMAT
: the format used by the tableYou can't change this when updating the table.
BUCKET_PATH
: the path to the Cloud Storage bucket that contains the data for the external table, in the format['gs://bucket_name/[folder_name/]file_name']
.You can select multiple files from the bucket by specifying one asterisk (
*
) wildcard character in the path. For example,['gs://mybucket/file_name*']
. For more information, see Wildcard support for Cloud Storage URIs.You can specify multiple buckets for the
uris
option by providing multiple paths.The following examples show valid
uris
values:['gs://bucket/path1/myfile.csv']
['gs://bucket/path1/*.csv']
['gs://bucket/path1/*', 'gs://bucket/path2/file00*']
When you specify
uris
values that target multiple files, all of those files must share a compatible schema.For more information about using Cloud Storage URIs in BigQuery, see Cloud Storage resource path.
STALENESS_INTERVAL
: specifies whether cached metadata is used by operations against the table, and how fresh the cached metadata must be in order for the operation to use itFor more information about metadata caching considerations, see Metadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval literal value between 30 minutes and 7 days. For example, specify
INTERVAL 4 HOUR
for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.CACHE_MODE
: specifies whether the metadata cache is refreshed automatically or manuallyFor more information on metadata caching considerations, see Metadata caching for performance.
Set to
AUTOMATIC
for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.Set to
MANUAL
if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure to refresh the cache.You must set
CACHE_MODE
ifSTALENESS_INTERVAL
is set to a value greater than 0.
Click
Run.
For more information about how to run queries, see Run an interactive query.
bq
Use the bq mkdef
and
bq update
commands
to update a table:
Generate an external table definition, that describes the aspects of the table to change:
bq mkdef --connection_id=PROJECT_ID.REGION.CONNECTION_ID \ --source_format=TABLE_FORMAT \ --metadata_cache_mode=CACHE_MODE \ "BUCKET_PATH" > /tmp/DEFINITION_FILE
Replace the following:
PROJECT_ID
: the name of the project that contains the connectionREGION
: the region that contains the connectionCONNECTION_ID
: the name of the connection to useTABLE_FORMAT
: the format used by the table. You can't change this when updating the table.CACHE_MODE
: specifies whether the metadata cache is refreshed automatically or manually. For more information on metadata caching considerations, see Metadata caching for performance.Set to
AUTOMATIC
for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.Set to
MANUAL
if you want to refresh the metadata cache on a schedule you determine. In this case, you can call theBQ.REFRESH_EXTERNAL_METADATA_CACHE
system procedure to refresh the cache.You must set
CACHE_MODE
ifSTALENESS_INTERVAL
is set to a value greater than 0.BUCKET_PATH
: the path to the Cloud Storage bucket that contains the data for the external table, in the formatgs://bucket_name/[folder_name/]file_name
.You can limit the files selected from the bucket by specifying one asterisk (
*
) wildcard character in the path. For example,gs://mybucket/file_name*
. For more information, see Wildcard support for Cloud Storage URIs.You can specify multiple buckets for the
uris
option by providing multiple paths.The following examples show valid
uris
values:gs://bucket/path1/myfile.csv
gs://bucket/path1/*.csv
gs://bucket/path1/*,gs://bucket/path2/file00*
When you specify
uris
values that target multiple files, all of those files must share a compatible schema.For more information about using Cloud Storage URIs in BigQuery, see Cloud Storage resource path.
DEFINITION_FILE
: the name of the table definition file that you are creating.
Update the table using the new external table definition:
bq update --max_staleness=STALENESS_INTERVAL \ --external_table_definition=/tmp/DEFINITION_FILE \ PROJECT_ID:DATASET.EXTERNAL_TABLE_NAME
Replace the following:
STALENESS_INTERVAL
: specifies whether cached metadata is used by operations against the table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, see Metadata caching for performance.To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify an interval value between 30 minutes and 7 days, using the
Y-M D H:M:S
format described in theINTERVAL
data type documentation. For example, specify0-0 0 4:0:0
for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.DEFINITION_FILE
: the name of the table definition file that you created or updated.PROJECT_ID
: the name of the project that contains the tableDATASET
: the name of the dataset that contains the tableEXTERNAL_TABLE_NAME
: the name of the table
Query BigLake and external tables
After creating a BigLake table, you can
query it using GoogleSQL syntax, the
same as if it were a standard BigQuery table.
For example, SELECT field1, field2 FROM mydataset.my_cloud_storage_table;
.
Limitations
BigQuery only supports querying Delta Lake reader v1 tables.
Hudi and BigQuery integration only works for hive-style partitioned
copy-on-write
tables.
What's next
- Learn about using SQL in BigQuery.
- Learn about BigLake tables.
- Learn about BigQuery quotas.