Monitor data quality with scans
This document explains how to use BigQuery and Dataplex together to ensure that data meets your quality expectations. BigQuery uses Dataplex to define continuous data checks, monitor results, and troubleshoot issues with data quality.
For more information about automatic data quality, see About auto data quality.
Before you begin
-
For creating and modifying scans in your project: Enable the Dataplex API.
For cross-project scans: Create a Dataplex service identifier with the
gcloud beta services identity create
command. If a Dataplex service identifier does not exist, this command returns a new one. If a service identifier already exists, the command returns the existing one. This command might prompt you to install the gcloud CLI beta commands component.gcloud beta services identity create --service=dataplex.googleapis.com
Required roles
Ask your administrator to grant the following roles to the appropriate account principals based on the use cases they need access to. For more information about granting roles, see Manage access.
BigQuery roles
- BigQuery Data Viewer on a table to create a scan on that table without publishing the results.
- BigQuery Data Editor on a table to create a scan on that table with publishing.
- If the BigQuery table and the data quality scan are in
different projects, then you need to give the Dataplex
service account read permission
bigquery.tables.getData
(or the role BigQuery Data Viewer) on the corresponding BigQuery table. To get the service identity for a service account, see Before you begin. - If you are scanning a BigQuery external table from
Cloud Storage, then assign the Cloud Storage role
(
roles/storage.objectViewer
) to the Dataplex service account.
Dataplex roles
- Dataplex DataScan Administrator on the project level - to create scans.
- Dataplex DataScan Editor on a scan - to edit any properties of a scan (except permissions), execute the scan, and delete the scan.
- Dataplex DataScan DataViewer on a scan to view results of a scan.
These roles contain the permissions needed for the prior use cases. To see the exact permissions that are required, expand the Required permissions section.
Required permissions
The following permissions are required to use various aspects of data quality scans:
- To change the configuration of a data scan:
dataplex.datascans.update
- on the datascan resource - To change the policy of a data scan:
dataplex.datascans.setIamPolicy
- on the datascan resource - To create a data scan on a BigQuery table:
bigquery.tables.getData
- the table to scan - To create data scans in a project:
dataplex.datascans.create
- on the project - To delete a data scan:
dataplex.datascans.delete
- on the datascan resource - To export the data scan results to a BigQuery dataset:
bigquery.datasets.get
,bigquery.tables.create
,bigquery.tables.get
,bigquery.tables.update
,bigquery.tables.updateData
- the destination dataset - To publish the results of a data scan to a table:
bigquery.tables.update
- the destination table - To run a data scan:
dataplex.datascans.run
- on the datascan resource - To scan an external table from Cloud Storage:
storage.buckets.get
,storage.objects.get
- the bucket containing the tables to scan - To view the results of a data scan:
dataplex.datascans.getData
- on the data scan resource - To view the results of a data scan:
dataplex.datascans.get
- on the datascan resource - To view the results of a data scan:
dataplex.datascans.list
- on the datascan resource
You might also be able to get these permissions with custom roles or other predefined roles.
Create a data quality scan
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click a table for the data quality scan.
Click the Data quality tab.
Click Data quality scan > Create new scan.
Optional: Edit the following values:
- Display name: the mutable name of the resource in the console.
- ID: a unique identifier for the scan. It cannot be changed after the scan is created.
- Description: a description of the scan.
- Region: defines the region that the data scan is processed in.
- Scope: the range of data available for scanning. Select
either Incremental or Entire data.
If you choose Incremental, we recommend
including a
DATE
orTIMESTAMP
column that increases linearly. This column can be used to identify new records. For tables partitioned on columns of typeDATE
orTIMESTAMP
, it's recommended to use the partition column as the timestamp field. - Filters: filters to apply to the data before the scan runs. To
filter rows, select the Filter rows checkbox and enter a valid SQL
expression in the input text field. The expression must be in
BigQuery standard SQL syntax and can be used in a
WHERE
clause. - Sampling size: the percentage of data that you want to sample. For incremental data scans, only the latest increment is sampled.
- Publish results to the BigQuery and Dataplex Catalog UI: this option makes the latest results of the data profiling scan available in the BigQuery UI, under the Data quality tab for the source table. If a scan is running and set to publish, this option might be unavailable.
- Schedule: either On demand (default) or
Repeat. If you select Repeat, specify the frequency of the
scheduled scan with Daily, Weekly, Monthly, or Custom.
Custom uses the
cron
time format to specify the schedule. For example, a scan set to run on
the second Tuesday of the month at 1:00 AM would look like this:
0 1 8-14 * 2
.
To advance the pane to show data quality rules settings, click Continue.
Click Add rules and add one or more of the following rules as appropriate. Rules can also be deleted with
Remove.- Profile based recommendations
- Built-in rule types
- SQL row check rule
- SQL aggregate check rule
Optional: To advance the pane to show additional optional settings, click Continue and edit the following values:
- Export scan results to BigQuery table: select a BigQuery dataset and a table to save the results of the quality scan. If a dataset is defined but no table is defined, Dataplex creates a table for you. Tables created in this manner might incur storage costs.
- Labels: add a label to the scan.
Click one of the following buttons based on your needs:
- To save the scan settings, click Create.
- To save and run the scan, click Run.
Manage data quality scan permissions
To change the access permissions of existing quality scans, do the following:
Go to the BigQuery page.
In the Explorer pane, select a table for the data quality scan.
Click the Data quality tab.
Click Data quality scan > Manage scan permissions. This opens Dataplex in a new tab.
Click the Permissions tab.
- To grant access to a principal, click Grant access and grant Dataplex DataScan DataViewer to the associated principal.
- To remove access from a principal, click Remove access and remove Dataplex DataScan DataViewer from the associated principal.
Edit an existing data quality scan
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, select a table for the data quality scan.
Click Data quality scan > Edit scan configuration.
This opens the data quality scan settings, which can be modified and saved for future scans.
View data quality scan results
There are multiple ways to view data quality scan results. Select the option that best suits your needs.
View published results
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, select a table for the data quality scan.
Click the Data quality tab.
The latest published results are shown in this view.
View historical scan results
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, select a table for the data quality scan.
Click the Data quality tab.
Click Data quality scan > View historical results.
View all data quality scans on a table
To open Dataplex with a scan history for a specific table, do the following:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, select a table for the data quality scan.
Click Data quality scan > View all scans.