Introduction to data transformation

This document describes the different ways you can transform data in your BigQuery tables.

For more information about data integrations, see Introduction to loading, transforming, and exporting data.

Methods of transforming data

You can transform data in BigQuery in the following ways:

  • Use data manipulation language (DML) to transform data in your BigQuery tables.
  • Use materialized views to automatically cache the results of a query for increased performance and efficiency.
  • Use continuous queries to analyze incoming data in real time and continuously insert the output rows into a BigQuery table or export to Pub/Sub or Bigtable.
  • Use Dataform to develop, test, control versions, and schedule SQL workflows in BigQuery.
  • Use data preparation with context-aware, AI-generated transformation recommendations to cleanse data for analysis.

The following table shows the different characteristics of each transformation method.

Transform method Transformation target Definition method Transformation frequency
Data manipulation language (DML) Table (in place) SQL DML User-initiated or scheduled
Materialized views Materialized view SQL query Automatic or manual refresh
Continuous queries Table, Pub/Sub topic, Bigtable table SQL query with EXPORT DATA Continuous
Dataform Table Dataform core (SQLX) Scheduled (workflows)
Data preparation Table Visual editor Scheduled

You can also review the change history of a BigQuery table to examine the transformations made to a table in a specified time range.

Transform data with DML

You can use data manipulation language (DML) to transform data in your BigQuery tables. DML statements are GoogleSQL queries that manipulate existing table data to add or delete rows, modify data in existing rows, or merge data with values from another table. DML transformations are also supported in partitioned tables.

You can run multiple DML statements concurrently, where BigQuery queues several DML statements that transform your data one after the other. BigQuery manages how concurrent DML statements are run, based upon the transformation type.

Transform data with materialized views

Materialized views views are precomputed views that periodically cache the results of a SQL query for increased performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever possible reads only changes from the base tables to compute up-to-date results.

Materialized views are precomputed in the background when the base tables change. Any incremental data changes from the base tables are automatically added to the materialized views, with no user action required.

Transform data with continuous queries

Continuous queries are SQL statements that run continuously. Continuous queries let you analyze incoming data in BigQuery in real time. You can insert the output rows produced by a continuous query into a BigQuery table or export them to Pub/Sub or Bigtable.

Transform data with Dataform

Dataform lets you manage data transformation in the extract, load, and transform (ELT) process for data integration. After extracting raw data from source systems and loading it into BigQuery, you can use Dataform to transform it into an organized, tested, and documented suite of tables. While in DML you take an imperative approach by telling BigQuery how exactly to transform your data, in Dataform you write declarative statements where Dataform then determines the transformation needed to achieve that state.

In Dataform, you can develop, test, and version control SQL workflows for data transformation from data source declarations to output tables, views, or materialized views. You can develop SQL workflows with Dataform core or pure JavaScript. Dataform core is an open source meta-language that extends SQL with SQLX and JavaScript. You can use Dataform core to manage dependencies, set up automated data quality testing, and document table or column descriptions within the code.

Dataform stores your SQL workflow code in repositories and uses Git to track file changes. Development workspaces in Dataform let you work on the contents of the repository without affecting the work of others who are working in the same repository. You can connect Dataform repositories to third-party Git providers, including Azure DevOps Services, Bitbucket, GitHub, and GitLab.

You can run or schedule SQL workflows with Dataform release configurations and workflow configurations. Alternatively, you can schedule executions either with Cloud Composer, or with Workflows and Cloud Scheduler. During execution, Dataform executes SQL queries in BigQuery in order of object dependencies in your SQL workflow. After execution, you can use your defined tables and views for analysis in BigQuery.

To learn more about creating data transformation SQL workflows in Dataform, see Dataform overview and Overview of Dataform features.

Prepare data in BigQuery

To reduce the toil of data preparation, BigQuery lets you clean data with Gemini-generated transformation suggestions. Data preparation in BigQuery offers the following assistance:

  • Applying transformations and data quality rules
  • Standardizing and enriching data
  • Automating schema mapping

You can validate the results in a preview of your data before executing the changes on all your data.

For more information, see Introduction to BigQuery data preparation.

What's next