Create a Bigtable external table
This page describes how to create a BigQuery permanent external table that can be used to query data stored in Bigtable. Querying data in Bigtable is available in all Bigtable locations.
Before you begin
Before you create an external table, gather some information and make sure you have permission to create the table.
Required roles
To create an external table to use to query your Bigtable data,
you must be a principal in the Bigtable Admin
(roles/bigtable.admin
) role for the instance that contains the source table.
You also need the bigquery.tables.create
BigQuery
Identity and Access Management (IAM) permission.
Each of the following predefined Identity and Access Management roles includes this permission:
- BigQuery Data Editor (
roles/bigquery.dataEditor
) - BigQuery Data Owner (
roles/bigquery.dataOwner
) - BigQuery Admin (
roles/bigquery.admin
)
If you are not a principal in any of these roles, ask your administrator to grant you access or to create the external table for you.
For more information on Identity and Access Management roles and permissions in BigQuery, see Predefined roles and permissions. To view information on Bigtable permissions, see Access control with Identity and Access Management.
Create or identify a dataset
Before you create an external table, you must create a dataset to contain the external table. You can also use an existing dataset.
Optional: Designate or create a cluster
If you plan to frequently query the same data that serves your production application, we recommend that you designate a cluster in your Bigtable instance to be used solely for BigQuery analysis. This isolates the traffic from the cluster or clusters that you use for your application's reads and writes. To learn more about replication and creating instances that have more than one cluster, see About replication.
Identify or create an app profile
Before you create an external table, decide which Bigtable app profile that BigQuery should use to read the data. We recommend that you use an app profile that you designate for use only with BigQuery.
If you have a cluster in your Bigtable instance that is dedicated to BigQuery access, configure the app profile to use single-cluster routing to that cluster.
To learn how Bigtable app profiles work, see About app profiles. To see how to create a new app profile, see Create and configure app profiles.
Retrieve the Bigtable URI
To create an external table for a Bigtable data source, you must provide the Bigtable URI. To retrieve the Bigtable URI, do the following:
Open the Bigtable page in the console.
Retrieve the following details about your Bigtable data source:
- Your project ID
- Your Bigtable instance ID
- The ID of the Bigtable app profile that you plan to use
- The name of your Bigtable table
Compose the Bigtable URI using the following format, where:
- project_id is the project containing your Bigtable instance
- instance_id is the Bigtable instance ID
- (Optional) app_profile is the app profile ID that you want to use
- table_name is the name of the table you're querying
https://googleapis.com/bigtable/projects/project_id/instances/instance_id[/appProfiles/app_profile]/tables/table_name
Create permanent external tables
When you create a permanent external table in BigQuery that is linked to a Bigtable data source, there are two options for specifying the format of the external table:
- If you are using the API or the bq command-line tool, you create a table definition file that defines the schema and metadata for the external table.
- If you are using SQL, you use the
uri
option of theCREATE EXTERNAL TABLE
statement to specify the Bigtable table to pull data from, and thebigtable_options
option to specify the table schema.
The external table data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying Bigtable data source.
To create a permanent table, choose one of the following methods.
SQL
You can create a permanent external table by running the
CREATE EXTERNAL TABLE
DDL statement.
You must specify the table schema explicitly as part of the statement
options.
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
CREATE EXTERNAL TABLE DATASET.NEW_TABLE OPTIONS ( format = 'CLOUD_BIGTABLE', uris = ['URI'], bigtable_options = BIGTABLE_OPTIONS );
Replace the following:
DATASET
: the dataset in which to create the Bigtable external table.NEW_TABLE
: the name for the Bigtable external table.URI
: the URI for the Bigtable table you want to use as a data source. This URI must follow the format described in Retrieving the Bigtable URI.BIGTABLE_OPTIONS
: the schema for the Bigtable table in JSON format. For a list of Bigtable table definition options, seeBigtableOptions
in the REST API reference.
Click
Run.
For more information about how to run queries, see Run an interactive query.
A statement to create an external Bigtable table might look similar to the following:
CREATE EXTERNAL TABLE mydataset.BigtableTable
OPTIONS (
format = 'CLOUD_BIGTABLE',
uris = ['https://googleapis.com/bigtable/projects/myproject/instances/myBigtableInstance/tables/table1'],
bigtable_options =
"""
{
columnFamilies: [
{
"familyId": "familyId1",
"type": "INTEGER",
"encoding": "BINARY"
}
],
readRowkeyAsString: true
}
"""
);
bq
You create a table in the bq command-line tool using the
bq mk
command. When
you use the bq command-line tool to create a table linked to an external data source,
you identify the table's schema using a
table definition file.
Use the
bq mk
command to create a permanent table.bq mk \ --external_table_definition=DEFINITION_FILE \ DATASET.TABLE
Replace the following:
DEFINITION_FILE
: the path to the table definition file on your local machine.DATASET
: the name of the dataset that contains the table.TABLE
: the name of the table you're creating.
API
Use the tables.insert
API method, and create an
ExternalDataConfiguration
in the Table
resource
that you pass in.
For the sourceUris
property in the Table
resource,
specify only one Bigtable URI. It must be a
valid HTTPS URL.
For the sourceFormat
property, specify "BIGTABLE"
.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Query external tables
For more information, see Query Bigtable data.
Generated schema
By default, BigQuery exposes the values in a column family as an array of columns and within that, an array of values written at different timestamps. This schema preserves the natural layout of data in Bigtable, but SQL queries can be challenging. It is possible to promote columns to subfields within the parent column family and to read only the latest value from each cell. This represents both of the arrays in the default schema as scalar values.
Example
You are storing user profiles for a fictional social network. One data model for
this might be a profile
column family with individual
columns for gender
, age
and email
:
rowkey | profile:gender| profile:age| profile:email
-------| --------------| -----------| -------------
alice | female | 30 | [email protected]
Using the default schema, a GoogleSQL query to count the number of male users over 30 is:
SELECT COUNT(1) FROM `dataset.table` OMIT RECORD IF NOT SOME(profile.column.name = "gender" AND profile.column.cell.value = "male") OR NOT SOME(profile.column.name = "age" AND INTEGER(profile.column.cell.value) > 30)
Querying the data is less challenging if gender
and age
are exposed as sub-
fields. To expose them as sub-fields, list gender
and age
as named columns
in the profile
column family when defining the table. You can also instruct
BigQuery to expose the latest values from this column family
because typically, only the latest value (and possibly the only value) is of
interest.
After exposing the columns as sub-fields, the GoogleSQL query to count the number of male users over 30 is:
SELECT COUNT(1) FROM `dataset.table` WHERE profile.gender.cell.value="male" AND profile.age.cell.value > 30
Notice how gender
and age
are referenced directly as fields. The JSON
configuration for this setup is:
"bigtableOptions": { "readRowkeyAsString": "true", "columnFamilies": [ { "familyId": "profile", "onlyReadLatest": "true", "columns": [ { "qualifierString": "gender", "type": "STRING" }, { "qualifierString": "age", "type": "INTEGER" } ] } ] }
Value encoding
Bigtable stores data as raw bytes, independent to data encoding. However, byte values are of limited use in SQL query analysis. Bigtable provides two basic types of scalar decoding: text and HBase-binary.
The text format assumes that all values are stored as alphanumeric text strings.
For example, an integer 768 will be stored as the string "768". The binary
encoding assumes that HBase's
Bytes.toBytes
class of methods were used to encode the data and applies an appropriate
decoding method.
Supported regions and zones
Querying data in Bigtable is available in all supported Bigtable zones. You can find the list of zones here. For multi-cluster instances, BigQuery routes traffic based on Bigtable app profile settings.
Limitations
For information about limitations that apply to external tables, see External table limitations.
Scopes for Compute Engine instances
When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud products, including Bigtable. Applications running on the VM use the service account to call Google Cloud APIs.
If you set up a Compute Engine instance to run as a
service account,
and that service account accesses an external table linked to a
Bigtable data source, you must add the Bigtable
read-only data access scope
(https://www.googleapis.com/auth/bigtable.data.readonly
) to the
instance. For more information, see
Creating a Compute Engine instance for Bigtable.
For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service accounts.