BigQuery tables for Apache Iceberg

For support during the preview, email [email protected].

BigQuery tables for Apache Iceberg, (hereafter, Iceberg tables) provide the foundation for building open-format lakehouses on Google Cloud. Iceberg tables offer the same fully managed experience as BigQuery tables, but store data in customer-owned storage buckets using Parquet to be interoperable with Iceberg open table formats.

BigQuery tables for Apache Iceberg are distinct from BigLake external tables for Apache Iceberg because only BigQuery tables for Apache Iceberg are modifiable directly within BigQuery. BigLake external tables for Apache Iceberg are read-only tables generated from another query engine, such as Apache Spark, and can only be queried using BigQuery.

Iceberg tables support the following features:

Table mutations using GoogleSQL data manipulation language (DML).
Unified batch and high throughput streaming using the Storage Write API through BigLake connectors like Spark, Dataflow, and other engines.
Schema evolution, which lets you add, drop, and rename columns to suit your needs. This feature also lets you change an existing column's data type and column mode. For more information, see type conversion rules.
Automatic storage optimization, including adaptive file sizing, automatic clustering, garbage collection, and metadata optimization.
Column-level security and data masking.

Architecture

Iceberg tables bring the convenience of BigQuery resource management to tables that reside in your own cloud buckets. Iceberg tables let you use BigQuery on these tables without moving the data out of buckets that you control.

The following diagram shows the managed table architecture at a high level: BigQuery for Iceberg tables architecture diagram.

This table management has the following implications on your bucket:

BigQuery creates new data files in the bucket in response to write requests and background storage optimizations, such as DML statements and streaming.
When you delete a managed table in BigQuery, BigQuery doesn't delete associated data files. You must confirm the deletion by deleting the files and any exported table metadata from the bucket manually.
Iceberg tables incur no BigQuery storage costs. For more information, see Billing.

Creating an Iceberg table is similar to creating BigQuery tables. Because it stores data in open formats on Cloud Storage, it has more options for the following:

Specify the Cloud resource connection with WITH CONNECTION to configure the connection credentials for BigLake to access Cloud Storage.
Specifying the file format of data storage with file_format. PARQUET is supported in Preview.
Specifying the open-source metadata table format with table_format. ICEBERG is supported in Preview.

Best practices

Directly changing or adding files to the bucket outside of BigQuery can lead to data loss or unrecoverable errors. The following table describes possible scenarios:

Operation	Consequences	Prevention
Add new files to the bucket outside BigQuery.	Data loss: New files or objects added outside of BigQuery are not tracked by BigQuery. Untracked files are deleted by background garbage collection processes.	Add data exclusively through BigQuery. This lets BigQuery track the files and prevent them from being garbage collected. To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Iceberg tables.
Create a new Iceberg table in a non-empty prefix.	Data loss: Extant data isn't tracked by BigQuery, so these files are considered untracked, and deleted by background garbage collection processes.	Only create new Iceberg tables in empty prefixes.
Modify or replace Iceberg table data files.	Data loss: On external modification or replacement, the table fails a consistency check and becomes unreadable. Queries against the table fail. There is no self-serve way to recover from this point. Contact support for data recovery assistance.	Modify data exclusively through BigQuery. This lets BigQuery track the files and prevent them from being garbage collected. To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Iceberg tables.
Create two BigQuery tables for Apache Iceberg on the same or overlapping URIs.	Data loss: BigQuery doesn't bridge identical URI instances of Iceberg tables. Background garbage collection processes for each table will consider the opposite table's files as untracked, and delete them, causing data loss.	Use unique URIs for each Iceberg table.

Billing

The following features are charged using the existing published pricing:

Cloud Storage pricing for all data stored in Cloud Storage buckets, data processing done by Cloud Storage, and network usage for the amount of data read from your bucket.
BigQuery compute pricing for queries, DML, and background storage optimization (including clustering, coalescence, and garbage collection).
- Charges using reservations (slots) follow existing slot pricing.
- Charges using the on-demand stock-keeping units (SKUs) follow existing on-demand pricing. For more information, see BigLake costs.
Batch load and Extract compute are charged using either on-demand SKUs or reservations (slots).
Storage Read API pricing for reading from Spark through the Read API.
Storage Write API pricing for streaming.

Iceberg table workflows

The following sections describe how to create, load, manage, and query managed tables.

Before you begin

Before creating and using Iceberg tables, ensure that you have set up a cloud resource connection to a storage bucket. Your connection needs write permissions on the storage bucket, as specified in the following Required roles section.

Required roles

To get the permissions that you need to let BigQuery manage tables in your project, ask your administrator to grant you the following IAM roles:

To create Iceberg tables:
- BigQuery Data Owner (roles/bigquery.dataOwner) on your project
- BigQuery Connection Admin (roles/bigquery.connectionAdmin) on your project
To query Iceberg tables:
- BigQuery Data Viewer (roles/bigquery.dataViewer) on your project
- BigQuery User (roles/bigquery.user) on your project
For the connection service account to read and write data in Cloud Storage:
- Storage Object Admin (roles/storage.objectAdmin) on the bucket
- Storage Legacy Bucket Reader (roles/storage.legacyBucketReader) on the bucket

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to let BigQuery manage tables in your project. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to let BigQuery manage tables in your project:

bigquery.connections.delegate on your project
bigquery.jobs.create on your project
bigquery.readsessions.create on your project
bigquery.tables.create on your project
bigquery.tables.get on your project
bigquery.tables.getData on your project
storage.buckets.get on your project
storage.objects.create on your project
storage.objects.delete on your project
storage.objects.get on your project
storage.objects.list on your project

You might also be able to get these permissions with custom roles or other predefined roles.

Create Iceberg tables

To create a Iceberg table, select one of the following methods:

SQL

CREATE TABLE [PROJECT_NAME.]DATASET_NAME.TABLE_NAME (
COLUMN DATA_TYPE[, ...]
)
CLUSTER BY CLUSTER_COLUMN_LIST
WITH CONNECTION CONNECTION_NAME
OPTIONS (
file_format = 'PARQUET',
table_format = 'ICEBERG',
storage_uri = 'STORAGE_URI');

Replace the following:

PROJECT_NAME: the project containing the dataset. If undefined, the command assumes the default project.
DATASET_NAME: an existing dataset.
TABLE_NAME: the name of the table you're creating.
DATA_TYPE: the data type of the information that is contained in the column.
CLUSTER_COLUMN_LIST: a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.
STORAGE_URI: a fully qualified Cloud Storage URI. For example, gs://mybucket/table.

bq

bq --project_id=PROJECT_NAME mk \
    --file_format=PARQUET \
    --table_format=ICEBERG \
    --connection_id=CONNECTION_NAME \
    --storage_uri=STORAGE_URI \
    --schema=COLUMN_NAME:DATA_TYPE[, ...] \
    --clustering_fields=CLUSTER_COLUMN_LIST \
    MANAGED_TABLE_NAME

Replace the following:

PROJECT_NAME: the project containing the dataset. If undefined, the command assumes the default project.
CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.
STORAGE_URI: a fully qualified Cloud Storage URI. For example, gs://mybucket/table.
COLUMN_NAME: the column name.
DATA_TYPE: the data type of the information contained in the column.
CLUSTER_COLUMN_LIST: a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
MANAGED_TABLE_NAME: the name of the table you're creating.

API

Call the tables.insert' method with a defined table resource, similar to the following:

{
"tableReference": {
  "tableId": "TABLE_NAME"
},
"biglakeConfiguration": {
  "connectionId": "CONNECTION_NAME",
  "fileFormat": "PARQUET",
  "tableFormat": "ICEBERG",
  "storageUri": "STORAGE_URI"
},
"schema": {
  "fields": [
    {
      "name": "COLUMN_NAME",
      "type": "DATA_TYPE"
    }
    [, ...]
  ]
}
}

Replace the following:

TABLE_NAME: the name of the table that you're creating.
CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.
STORAGE_URI: a fully qualified Cloud Storage URI. Wildcards are also supported. For example, gs://mybucket/table.
COLUMN_NAME: the column name.
DATA_TYPE: the data type of the information contained in the column.

Import data into Iceberg table

The following sections describe how to import data from various table formats into Iceberg tables.

Fast load from Parquet files

The copy_files_only option lets you load data faster by copying your existing Parquet files, instead of reading the content and rewriting the content as new files. Fast load uses less compute capacity compared to a regular file load. Parquet files must be compatible with the Apache Iceberg specification and have complete column statistics. Fast load doesn't detect invalid values (such as out-of-range timestamps) in files because the files are not read and reprocessed. For more information on loading Parquet files, see Loading Parquet data into a new table.

To fast load flat Parquet files into an existing Iceberg table, use the bq load command:

bq load \
    --copy_files_only \
    --source_format=PARQUET \
    DATASET_NAME.TABLE_NAME \
    PATH_TO_SOURCE

Replace the following:

DATASET_NAME: the dataset containing your Iceberg table.
TABLE_NAME: the name of the Iceberg table into which you're loading data.
PATH_TO_SOURCE: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/mydata*.parquet.

Standard load data from flat files

Iceberg tables use BigQuery load jobs to load external files into Iceberg tables. If you have an existing Iceberg table, follow the bq load CLI guide or the LOAD SQL guide to load external data. After loading the data, new Parquet files are written into the STORAGE_URI/data folder.

If the prior instructions are used without an existing Iceberg table, a BigQuery table is created instead.

See the following for tool-specific examples of batch loads into managed tables:

SQL

LOAD DATA INTO MANAGED_TABLE_NAME
FROM FILES (
uris=['STORAGE_URI'],
format='FILE_FORMAT');

Replace the following:

MANAGED_TABLE_NAME: the name of an existing Iceberg table.
STORAGE_URI: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/table.
FILE_FORMAT: the source table format. For supported formats, see the format row of load_option_list.

bq

bq load \
  --source_format=FILE_FORMAT \
  MANAGED_TABLE \
  STORAGE_URI

Replace the following:

FILE_FORMAT: the source table format. For supported formats, see the format row of load_option_list.
MANAGED_TABLE_NAME: the name of an existing Iceberg table.
STORAGE_URI: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/table.

Standard load from Hive-partitioned files

You can load Hive-partitioned files into Iceberg tables using standard BigQuery load jobs. For more information, see Loading externally partitioned data.

Load streaming data from Pub/Sub

You can load streaming data into Iceberg tables by using a Pub/Sub BigQuery subscription.

Export data from Iceberg tables

The following sections describe how to export data from Iceberg tables into various table formats.

Export data into flat formats

To export a Iceberg table into a flat format, use the EXPORT DATA statement and select a destination format. For more information, see Exporting data.

Read Iceberg tables with Iceberg metadata

To read data from Iceberg tables in Iceberg format, follow these steps:

Export the metadata into the Iceberg format with the EXPORT TABLE METADATA SQL statement.
Setup and read table data in the Apache Spark with HadoopCatalog.

The following sample sets up your environment to use Spark SQL with Apache Iceberg, and then executes a query to fetch data from a specified Iceberg table.
```
spark-sql \
 --packages org.apache.iceberg:iceberg-spark-runtime-ICEBERG_VERSION_NUMBER \
 --conf spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog \
 --conf spark.sql.catalog.CATALOG_NAME.type=hadoop \
 --conf spark.sql.catalog.CATALOG_NAME.warehouse='BUCKET_PATH' \

# Queries the table
spark-sql> SELECT * FROM CATALOG_NAME.FOLDER_NAME;
```
Replace the following:
- ICEBERG_VERSION_NUMBER: the current version of Apache Spark Iceberg runtime. Download the latest version from Spark Releases.
- CATALOG_NAME: the catalog to reference your Iceberg table.
- BUCKET_PATH: the path to the bucket containing the table files. For example, gs://mybucket/.
- FOLDER_NAME: the folder containing the table files. For example, myfolder.

Modify Iceberg tables

To modify a Iceberg table, follow the steps shown in Modifying table schemas.

Pricing

Iceberg table pricing consists of three separate components:

Storage

Iceberg table store all data in Cloud Storage. You are charged for all data stored, including historical table data. Cloud Storage data processing and transfer charges may also apply, as applicable. There are no BigQuery-specific storage fees. For more information, see Cloud Storage Pricing.

Storage optimization

Iceberg tables require storage optimization operations, such as file coalescing and re-clustering. These optimization operations use Enterprise edition pay as you go slots, and don't use existing BACKGROUND reservations.

Data export operations taking place while streaming through the BigQuery Storage Write API are included in Storage Write API pricing and are not charged as background maintenance. For more information, see Data ingestion pricing.

Queries and jobs

Similar to BigQuery tables, you are charged for queries and bytes read (per TiB) if you are using BigQuery on-demand pricing, or slot consumption (per slot hour) if you are using BigQuery capacity compute pricing.

BigQuery pricing also applies to the BigQuery Storage Read API and the BigQuery Storage Write API.

Load and export operations (such as EXPORT METADATA) use Enterprise edition pay as you go slots. This differs from BigQuery tables, which are not charged for these operations. If PIPELINE reservations with Enterprise or Enterprise Plus slots are available, load and export operations preferentially use these reservation slots instead.

Limitations

Iceberg tables have the following limitations:

Iceberg tables don't support renaming operations or ALTER TABLE RENAME TO statements.
Iceberg tables don't support table copies or CREATE TABLE COPY statements.
Iceberg tables don't support table clones or CREATE TABLE CLONE statements.
Iceberg tables don't support table snapshots or CREATE SNAPSHOT TABLE statements.
Iceberg tables don't support the following table schema:
- Empty schema
- Schema with INTERVAL, JSON, RANGE, or GEOGRAPHY data types.
- Schema with field collations.
- Schema with default value expressions.
EXPORT METADATA doesn't support tables containing BIGNUMERIC or NUMERIC data types with precision greater than 38 places.
Iceberg tables don't support the following schema evolution cases:
- NUMERIC to FLOAT type coercions
- INT to FLOAT type coercions
- Adding new nested fields to an existing RECORD columns using SQL DDL statements
Iceberg tables display a 0-byte storage size when queried by the console or APIs.
Iceberg tables don't support materialized views.
Iceberg tables don't support multi-statement transactions.
Iceberg tables don't support change data capture (CDC) updates.
When streaming data to Iceberg tables using the BigQuery Storage Write API, you must first disable the query cache.
Iceberg tables don't support managed disaster recovery
Iceberg tables don't support partitioning. Consider clustering as an alternative.
Iceberg tables don't support row-level security.
Iceberg tables don't support time travel.
Iceberg tables don't support fail-safe windows.
Iceberg tables don't support extract jobs.
The INFORMATION_SCHEMA.TABLE_STORAGE view doesn't include Iceberg tables.
Iceberg tables are not supported as query result destinations.
CREATE OR REPLACE doesn't support replacing standard tables with Iceberg tables, or Iceberg tables with standard tables.
Batch loading and LOAD DATA statements only support appending data to existing Iceberg tables.
Batch loading and LOAD DATA statements don't support schema updates.
TRUNCATE TABLE doesn't support Iceberg tables. There are two alternatives:
- CREATE OR REPLACE TABLE, using the same table creation options.
- DELETE FROM table WHERE true
The APPENDS table-valued function (TVF) doesn't support Iceberg tables.
Iceberg exports in Apache Spark don't contain recently streamed data in write-optimized storage.
Fast load doesn't support files with flexible column names.