Introduction to BigQuery data preparation
This document describes AI-augmented data preparation in BigQuery. Data preparations are BigQuery Studio resources, which use Gemini in BigQuery to analyze your data and provide intelligent suggestions for cleaning, transforming, and enriching it. You can significantly reduce the time and effort required for manual data preparation tasks. Orchestration of data preparations is powered by Dataform.
Benefits
- You can reduce the time spent on data pipeline development with context-aware, Gemini-generated transformation suggestions.
- You can validate the generated results in a preview and receive data quality cleanup and enrichment suggestions with automated schema mapping.
- Dataform lets you use a continuous integration, continuous development (CI/CD) process, supporting cross-team collaboration for code reviews and source control.
Required roles
Users who are preparing the data and the Dataform service accounts that are running the jobs require Identity and Access Management (IAM) roles. For more information, see Required roles and Set up Gemini for BigQuery.
Data preparation entry points
You can create and manage data preparations in the BigQuery Studio page (see Open the data preparation editor in BigQuery).
When you open a table in BigQuery data preparation, a BigQuery job runs using your credentials. The run creates sample rows from the chosen table and writes the results into a temporary table in the same project. Gemini uses the sample data and schema to generate data preparation suggestions shown in the data preparation editor.
Views in the data preparation editor
Data preparations appear as tabs on the BigQuery Studio page. Each tab has a series of sub-tabs, or data preparation views, where you design and manage your data preparations.
Data view
When you create a new data preparation, a data preparation editor tab opens, displaying the data view, which contains a representative sample of the table. For existing data preparations, you can navigate to the data view by clicking a node in the graph view of your data preparation pipeline.
The data view lets you do the following:
- Interact with your data to form data preparation steps.
- Apply suggestions from Gemini.
- Improve the quality of the Gemini suggestions by entering example values in the cells.
Over each column in your table, a statistical profile (a histogram) shows the count for each column's top values in the preview rows.
Graph view
The graph view is a visual overview of your data preparation. It appears as a tab on the BigQuery Studio page in the console, when you open a data preparation. The graph displays nodes for all steps in your data preparation pipeline. You can select a node on the graph to configure the data preparation steps it represents.
Schema view
The data preparation schema view displays the current schema of the active data preparation step. The schema shown matches the columns in the data view.
In the schema view, you can perform dedicated schema operations, such as removing columns, which also creates steps in the Applied steps list.
Suggestions by Gemini
Gemini provides context-aware suggestions to assist with the following data preparation tasks:
- Applying transformations and data quality rules
- Standardizing and enriching data
- Automating schema mapping
Each suggestion appears in a card in the suggestions list of the data preparation editor. The card contains the following information:
- The high-level category of the step, such as Keep rows or Transformation
- A description of the step, such as Keep rows if
COLUMN_NAME
is notNULL
- The corresponding SQL expression used to execute the step
You can preview or apply the suggestion card, or fine-tune the suggestion. You can also add steps manually. For more information, see Prepare data with Gemini.
To fine-tune the suggestions from Gemini, give it an example of what to change in a column.
Data sampling
BigQuery uses data sampling to give you a preview of your data preparation. You can view the sample in the data view for each node. Data in the sample isn't automatically refreshed. For more information, see Refresh data preparation samples.
Write mode
To optimize costs and processing time, you can change the write mode settings to incrementally process new data from the source. For example, if you have a table in BigQuery where records are inserted daily, and a Looker dashboard that must reflect the changed data, you can schedule the BigQuery data preparation to incrementally read the new records from the source table and propagate them to the destination table.
To configure the way your prepared data is written into a destination table, see Optimize data preparation by incrementally processing data.
The following write modes are supported:
Write mode option | Description |
---|---|
Full refresh | Inserts the prepared data to replace all data in the destination table. The table is recreated, not truncated. Full refresh is the default mode when writing to a destination table. |
Append | Inserts the prepared data into new rows in the destination table. |
Incremental | Inserts only the new or, depending on your incremental column choice, changed data in the destination table. |
Supported data preparation steps
BigQuery supports the following types of data preparation steps:
Step type | Description |
---|---|
Source | Adds a source when you select a BigQuery table to read from or when you add a join step. |
Transformation | Cleans and transforms data using a SQL expression. You receive
suggestion cards for the following expressions:
You can also use any valid BigQuery SQL expressions in manual transformation steps. For example:
For more information, see Add a transformation. |
Filter | Removes rows through the WHERE clause syntax. When you
add a filter step, you can choose to make it into a validation step.
For more information, see Filter rows. |
Validation | Sends rows that meet the validation rule criteria to an error table.
If data fails the validation rule and no error table is configured, the
data preparation fails during execution.
For more information, see Configure the error table and add a validation rule. |
Join | Joins values from two sources. Tables must be in the same location.
Join key columns must be of the same data type. Data preparations
support the following join operations:
For more information, see Add a join operation. |
Destination | Defines a destination for outputting data preparation steps. If you
enter a destination table that doesn't exist, the data preparation
creates a new table using the current schema information. For more information, see Add or change a destination table. |
Delete columns | Deletes columns from the schema. You perform
this step from the schema view.
For more information, see Delete a column. |
Scheduling data preparation runs
To execute the data preparation steps and load the prepared data into the destination table, schedule a one-time or a recurring data preparation run. You can schedule data preparations from the data preparation editor, and manage them from the BigQuery Orchestration page. For more information, see Schedule data preparations.
APIs
BigQuery data preparation doesn't have its own API. To learn more about using BigQuery data preparation with Dataform, contact [email protected].
Limitations
Data preparation is available with the following limitations:
- All BigQuery data preparation source and destination datasets of a given data preparation must be in the same location. For more information, see Supported locations.
- During pipeline editing, data and interactions are sent to a US data center for processing. For more information, see Supported locations.
- Data preparation doesn't support natural language SQL query generation.
- BigQuery data preparations don't support viewing, comparing, or restoring data preparation versions.
- Responses from Gemini are based on a sample of the dataset you provide when you design your data preparation pipeline. For more information, see how Gemini for Google Cloud uses your data and the terms in the Gemini for Google Cloud Trusted Tester Program.
Locations
Data preparations support data processing in all BigQuery locations. The source and destination datasets of a given data preparation must be in the same location.
Pricing
Running data preparations and creating data preview samples use BigQuery resources, which are charged at the rates shown in BigQuery pricing.
Data preparation is included in the Gemini in BigQuery pricing. You can use BigQuery data preparation during Preview at no additional cost. For more information, see Set up Gemini in BigQuery.
Quotas
For more information, see Gemini's quotas in Google Cloud.
What's next
- Learn how to prepare data with Gemini in BigQuery.
- Learn how to manage the deployment and execution of data preparations.