Introduction to data masking

BigQuery supports data masking at the column level. You can use data masking to selectively obscure column data for users groups, while still allowing them access to the column. Data masking functionality is built on top of column-level access control, so you should familiarize yourself with that feature before you proceed.

When you use data masking in combination with column-level access control, you can configure a range of access to column data, from full access to no access, based on the needs of different user groups. For example, for tax ID data, you might want to grant your accounting group full access, your analyst group masked access, and your sales group no access.

Benefits

Data masking provides the following benefits:

  • It streamlines the data sharing process. You can mask sensitive columns to make it possible to share tables with larger groups.
  • Unlike with column-level access control, you don't need to modify existing queries by excluding the columns that the user cannot access. When you configure data masking, existing queries automatically mask column data based on the roles the user has been granted.
  • You can apply data access policies at scale. You can write a data policy, associate it with a policy tag, and apply the policy tag to any number of columns.
  • It enables attribute-based access control. A policy tag attached to a column provides contextual data access, which is determined by the data policy and the principals that are associated with that policy tag.

Data masking workflow

Figure 1 shows the workflow for configuring data masking:

To enable data masking, you must create a taxonomy, create data policies for the policy tags in the taxonomy, and then associate the policy tags with table columns. Figure 1. Data masking components.

You configure data masking with the following steps:

  1. Set up a taxonomy and one or more policy tags.
  2. Configure data policies for the policy tags. A data policy maps a data masking rule and one or more principals, which represent users or groups, to the policy tag.

    When creating a data policy by using the Google Cloud console, you create the data masking rule and specify the principals in one step. When creating a data policy by using the BigQuery Data Policy API, you create the data policy and data masking rule in one step, and specify the principals for the data policy in a second step.

  3. Assign the policy tags to columns in BigQuery tables to apply the data policies.

  4. Assign users who should have access to masked data to the BigQuery Masked Reader role. As a best practice, assign the BigQuery Masked Reader role at the data policy level. Assigning the role at the project level or higher grants users permissions to all data policies under the project, which can lead to issues caused by excess permissions.

    The policy tag that is associated with a data policy can also be used for column-level access control. In that case, the policy tag is also associated with one or more principals who are granted the Data Catalog Fine-Grained Reader role. This enables these principals to access the original, unmasked column data.

Figure 2 shows how column-level access control and data masking work together:

Policy tags are associated with data policies to configure data masking, and then associated with table columns to enable masking. Figure 2. Data masking components.

For more information about role interaction, see How Masked Reader and Fine-Grained Reader roles interact. For more information about policy tag inheritance, see Roles and policy tag hierarchy.

Data masking rules

When you use data masking, a data masking rule is applied to a column at query runtime, based on the role of the user running the query. Masking takes precedence to any other operations involved in the query. The data masking rule determines the type of data masking applied to the column data.

You can use the following data masking rules:

  • Custom masking routine. Returns the column's value after applying a user-defined function (UDF) to the column. Routine permissions are required to manage the masking rule. This rule, by design, supports all BigQuery data types except for the STRUCT data type. However, support for data types other than STRING and BYTES is limited. The output depends on the defined function.

    For more information about creating UDFs for custom masking routines, see Create custom masking routines.

  • Date year mask. Returns the column's value after truncating the value to its year, setting all non-year parts of the value to the beginning of the year. You can only use this rule with columns that use the DATE, DATETIME, and TIMESTAMP data types. For example:

    Type Original Masked
    DATE 2030-07-17 2030-01-01
    DATETIME 2030-07-17T01:45:06 2030-01-01T00:00:00
    TIMESTAMP 2030-07-17 01:45:06 2030-01-01 00:00:00
  • Default masking value. Returns a default masking value for the column based on the column's data type. Use this when you want to hide the value of the column but reveal the data type. When this data masking rule is applied to a column, it makes it less useful in query JOIN operations for users with Masked Reader access. This is because a default value isn't sufficiently unique to be useful when joining tables.

    The following table shows the default masking value for each data type:

    Data type Default masking value
    STRING ""
    BYTES b''
    INTEGER 0
    FLOAT 0.0
    NUMERIC 0
    BOOLEAN FALSE
    TIMESTAMP 1970-01-01 00:00:00 UTC
    DATE 1970-01-01
    TIME 00:00:00
    DATETIME 1970-01-01T00:00:00
    GEOGRAPHY POINT(0 0)
    BIGNUMERIC 0
    ARRAY []
    STRUCT

    NOT_APPLICABLE

    Policy tags can't be applied to columns that use the STRUCT data type, but they can be associated with the leaf fields of such columns.

    JSON null
  • Email mask. Returns the column's value after replacing the username of a valid email with XXXXX. If the column's value is not a valid email address, then it returns the column's value after it has been run through the SHA-256 hash function. You can only use this rule with columns that use the STRING data type. For example:

    Original Masked
    [email protected] [email protected]
    randomtext jQHDyQuj7vJcveEe59ygb3Zcvj0B5FJINBzgM6Bypgw=
    test@[email protected] Qdje6MO+GLwI0u+KyRyAICDjHbLF1ImxRqaW08tY52k=
  • First four characters. Returns the first 4 characters of the column's value, replacing the rest of the string with XXXXX. If the column's value is equal to or less than 4 characters in length, then it returns the column's value after it has been run through the SHA-256 hash function. You can only use this rule with columns that use the STRING data type.

  • Hash (SHA-256). Returns the column's value after it has been run through the SHA-256 hash function. Use this when you want the end user to be able to use this column in a JOIN operation for a query. You can only use this rule with columns that use the STRING or BYTES data types.

    The SHA-256 function used in data masking is type preserving, so the hash value it returns has the same data type as the column value. For example, the hash value for a STRING column value also has a STRING data type.

  • Last four characters. Returns the last 4 characters of the column's value, replacing the rest of the string with XXXXX. If the column's value is equal to or less than 4 characters in length, then it returns the column's value after it has been run through the SHA-256 hash function. You can only use this rule with columns that use the STRING data type.

  • Nullify. Returns NULL instead of the column value. Use this when you want to hide both the value and the data type of the column. When this data masking rule is applied to a column, it makes it less useful in query JOIN operations for users with Masked Reader access. This is because a NULL value isn't sufficiently unique to be useful when joining tables.

Data masking rule hierarchy

You can configure up to nine data policies for a policy tag, each with a different data masking rule associated with it. One of these policies is reserved for column-level access control settings. This makes it possible for several data policies to be applied to a column in a user's query, based on the groups that the user is a member of. When this happens, BigQuery chooses which data masking rule to apply based on the following hierarchy:

  1. Custom masking routine
  2. Hash (SHA-256)
  3. Email mask
  4. Last four characters
  5. First four characters
  6. Date year mask
  7. Default masking value
  8. Nullify

For example, user A is a member of both the employees and the accounting groups. User A runs a query that includes the sales_total field, which has the confidential policy tag applied. The confidential policy tag has two data policies associated with it: one that has the employees role as the principal and applies the nullify data masking rule, and one that has the accounting role as the principal and applies the hash (SHA-256) data masking rule. In this case, the hash (SHA-256) data masking rule is prioritized over the nullify data masking rule, so the hash (SHA-256) rule is applied to the sales_total field value in user A's query.

Figure 3 shows this scenario:

When there is a conflict between applying the nullify and the hash (SHA-256)
data masking rules due to the groups a user is in, the hash (SHA-256) data
masking rule is
prioritized.

Figure 3. Data masking rule prioritization.

Roles and permissions

Roles for managing taxonomies and policy tags

You need the Data Catalog Policy Tag Admin role to create and manage taxonomies and policy tags.

Role/ID Permissions Description
Data Catalog Policy Tag Admin/datacatalog.categoryAdmin datacatalog.categories.getIamPolicy
datacatalog.categories.setIamPolicy
datacatalog.taxonomies.create
datacatalog.taxonomies.delete
datacatalog.taxonomies.get
datacatalog.taxonomies.getIamPolicy
datacatalog.taxonomies.list
datacatalog.taxonomies.setIamPolicy
datacatalog.taxonomies.update
resourcemanager.projects.get
resourcemanager.projects.list

Applies at the project level.

This role grants the ability to do the following:

  • Create, read, update, and delete taxonomies and policy tags.
  • Get and set IAM policies on policy tags.

Roles for creating and managing data policies

You need one of the following BigQuery roles to create and manage data policies:

Role/ID Permissions Description
BigQuery Data Policy Admin/bigquerydatapolicy.admin

BigQuery Admin/bigquery.admin

BigQuery Data Owner/bigquery.dataOwner
bigquery.dataPolicies.create
bigquery.dataPolicies.delete
bigquery.dataPolicies.get
bigquery.dataPolicies.getIamPolicy
bigquery.dataPolicies.list
bigquery.dataPolicies.setIamPolicy
bigquery.dataPolicies.update

The bigquery.dataPolicies.create and bigquery.dataPolicies.list permissions apply at the project level. The other permissions apply at the data policy level.

This role grants the ability to do the following:

  • Create, read, update, and delete data policies.
  • Get and set IAM policies on data policies.
You also need the datacatalog.taxonomies.get permission, which you can get from several of the Data Catalog predefined roles.

Roles for attaching policy tags to columns

You need the datacatalog.taxonomies.get and bigquery.tables.setCategory permissions to attach policy tags to columns. datacatalog.taxonomies.get is included in the Data Catalog Policy Tags Admin and Viewer roles. bigquery.tables.setCategory is included in the BigQuery Admin (roles/bigquery.admin) and BigQuery Data Owner (roles/bigquery.dataOwner) roles.

Roles for querying masked data

You need the BigQuery Masked Reader role to query the data from a column that has data masking applied.

Role/ID Permissions Description
Masked Reader/bigquerydatapolicy.maskedReader bigquery.dataPolicies.maskedGet

Applies at the data policy level.

This role grants the ability to view the masked data of a column that is associated with a data policy.

Additionally, a user must have appropriate permissions to query the table. For more information, see Required permissions.

How Masked Reader and Fine-Grained Reader roles interact

Data masking builds on top of column-level access control. For a given column, it is possible to have some users with the BigQuery Masked Reader role that allows them to read masked data, some users with the Data Catalog Fine-Grained Reader role that allows them to read unmasked data, some users with both, and some users with neither. These roles interact as follows:

  • User with both Fine-Grained Reader and Masked Reader roles: what the user sees depends on where in the policy tag hierarchy each role is granted. For more information, see Authorization inheritance in a policy tag hierarchy.
  • User with Fine-Grained Reader role: can see unmasked (unobscured) column data.
  • User with Masked Reader role: can see masked (obscured) column data.
  • User with neither role: permission denied.

In the case where a table has columns that are secured or secured and masked, in order to run a SELECT * FROM statement on that table, a user must be a member of appropriate groups such that they are granted Masked Reader or Fine-Grained Reader roles on all of these columns.

A user who is not granted these roles must instead specify only columns that they have access to in the SELECT statement, or use SELECT * EXCEPT (restricted_columns) FROM to exclude the secured or masked columns.

Authorization inheritance in a policy tag hierarchy

Roles are evaluated starting at the policy tag associated with a column, and then checked at each ascending level of the taxonomy, until the user either is determined to have appropriate permissions or the top of the policy tag hierarchy is reached.

For example, take the policy tag and data policy configuration shown in Figure 4:

Evaluating user access when Masked Reader is granted at a higher level of the taxonomy and Fine-Grained Reader is granted at a lower level of the taxonomy.

Figure 4. Policy tag and data policy configuration.

You have a table column that is annotated with the Financial policy tag, and a user who is a member of both the [email protected] and [email protected] groups. When this user runs a query that includes the annotated column, their access is determined by the hierarchy defined in the policy tag taxonomy. Because the user is granted the Data Catalog Fine-Grained Reader role by the Financial policy tag, the query returns unmasked column data.

If another user who is only a member of the [email protected] role runs a query that includes the annotated column, the query returns column data that has been hashed using the SHA-256 algorithm, because the user is granted the BigQuery Masked Reader role by the Confidential policy tag, which is the parent of the Financial policy tag.

A user who is not a member of either of those roles gets an access denied error if they try to query the annotated column.

In contrast with the preceding scenario, take the policy tag and data policy configuration shown in Figure 5:

Evaluating user access when Fine-Grained Reader is granted at a higher level of the taxonomy and Masked Reader is granted at a lower level of the taxonomy.

Figure 5. Policy tag and data policy configuration.

You have the same situation as shown in Figure 4, but the user is granted the Fine-Grained Reader role at a higher level of the policy tag hierarchy, and the Masked Reader role at a lower level of the policy tag hierarchy. Because of this, the query returns masked column data for this user. This happens even though the user is granted the Fine-Grained Reader role further up the tag hierarchy, because the service uses the first assigned role it encounters as it ascends the policy tag hierarchy to check for user access.

If you want to create a single data policy and have it apply to several levels of a policy tag hierarchy, you can set the data policy on the policy tag that represents the topmost hierarchy level to which it should apply. For example, take a taxonomy with the following structure:

  • Policy tag 1
    • Policy tag 1a
      • Policy tag 1ai
    • Policy tag 1b
      • Policy tag 1bi
      • Policy tag 1bii

If you want a data policy to apply to all of these policy tags, set the data policy on policy tag 1. If you want a data policy to apply to policy tag 1b and its children, set the data policy on policy tag 1b.

Data masking with incompatible features

When you use BigQuery features that aren't compatible with data masking, the service treats the masked column as a secured column, and only grants access to users who have the Data Catalog Fine-Grained Reader role.

For example, take the policy tag and data policy configuration shown in Figure 6:

The policy tag associated with the column is evaluated to determine if the user has permission to access unmasked data.

Figure 6. Policy tag and data policy configuration.

You have a table column that is annotated with the Financial policy tag, and a user who is a member of the [email protected] group. When this user tries to access the annotated column through one of the incompatible features, they get an access denied error. This is because they are granted the BigQuery Masked Reader by Financial policy tag, but in this case, they must have the Data Catalog Fine-Grained Reader role. Because the service has already determined an applicable role for the user, it does not continue to check farther up the policy tag hierarchy for additional permissions.

Data masking example with output

To see how tags, principals, and roles work together, consider this example.

At example.com, basic access is granted through the [email protected] group. All employees who need regular access to BigQuery data are members of this group, which is assigned all the necessary permissions to read from tables as well as the BigQuery Masked Reader role.

Employees are assigned to additional groups that provide access to secured or masked columns where that is required for their work. All members of these additional groups are also members of [email protected]. You can see how these groups are associated with appropriate roles in Figure 7:

Policy tags and data policies for example.com.

Figure 7. Policy tags and data policies for example.com.

The policy tags are then associated with table columns, as shown in Figure 8:

Example.com policy tags associated with table columns.

Figure 8. Example.com policy tags associated with table columns.

Given the tags that are associated with the columns, running SELECT * FROM Accounts; leads to the following results for the different groups:

  • [email protected]: This group has been granted the BigQuery Masked Reader role on both the PII and Confidential policy tags. The following results are returned:

    SSN Priority Lifetime value Creation date Email
    NULL "" 0 March 8, 1983 NULL
    NULL "" 0 December 29, 2009 NULL
    NULL "" 0 July 14, 2021 NULL
    NULL "" 0 May 5, 1997 NULL
  • [email protected]: This group has been granted the Data Catalog Fine-Grained Reader role on the SSN policy tag. The following results are returned:

    SSN Priority Lifetime value Creation date NULL
    123-45-6789 "" 0 March 8, 1983 NULL
    234-56-7891 "" 0 December 29, 2009 NULL
    345-67-8912 "" 0 July 14, 2021 NULL
    456-78-9123 "" 0 May 5, 1997 NULL
  • [email protected]: This group has been granted the Data Catalog Fine-Grained Reader role on the Confidential policy tag. The following results are returned:

    SSN Priority Lifetime value Creation date Email
    NULL High 90,000 March 8, 1983 NULL
    NULL High 84,875 December 29, 2009 NULL
    NULL Medium 38,000 July 14, 2021 NULL
    NULL Low 245 May 5, 1997 NULL
  • [email protected]: This group has been granted the BigQuery Masked Reader role on the Financial policy tag. The following results are returned:

    SSN Priority Lifetime value Creation date Email
    NULL "" Zmy9vydG5q= March 8, 1983 NULL
    NULL "" GhwTwq6Ynm= December 29, 2009 NULL
    NULL "" B6y7dsgaT9= July 14, 2021 NULL
    NULL "" Uh02hnR1sg= May 5, 1997 NULL
  • All other users: Any user who does not belong to one of the listed groups gets an access denied error, because they haven't been granted the Data Catalog Fine-Grained Reader or BigQuery Masked Reader roles. To query the Accounts table, they must instead specify only columns that they have access to in the SELECT * EXCEPT (restricted_columns) FROM Accounts to exclude the secured or masked columns.

Cost considerations

Data masking might indirectly affect the number of bytes processed, and therefore affect the cost of the query. If a user queries a column that is masked for them with the Nullify or Default Masking Value rules, then that column isn't scanned at all, resulting in fewer bytes processed.

Restrictions and limitations

The following sections describe the categories of restrictions and limitations that data masking is subject to.

Data policy management

  • This feature may not be available when using reservations that are created with certain BigQuery editions. For more information about which features are enabled in each edition, see Introduction to BigQuery editions.
  • You can create up to nine data policies for each policy tag. One of these policies is reserved for column-level access control settings.
  • Data policies, their associated policy tags, and any routines that use them must be in the same project.

Policy tags

  • The project containing the policy tag taxonomy must belong to an organization.
  • A policy tag hierarchy can be no more than five levels deep from the root node to the lowest-level subtag, as shown in the following screenshot:

    Policy tag depth.

Set access control

After a taxonomy has a data policy associated with at least one of its policy tags, access control is automatically enforced. If you want to turn off access control, you must first delete all of the data policies associated with the taxonomy.

Materialized views and repeated record masking queries

If you have existing materialized views, repeated record masking queries on the associated base table fail. To resolve this issue, delete the materialized view. If the materialized view is needed for other reasons, you can create it in another dataset.

Query masked columns in partitioned tables

Queries that include data masking on the partitioned or clustered columns are not supported.

SQL dialects

Legacy SQL is not supported.

Custom masking routines

Custom masking routines are subject to the following limitations:

  • Custom data masking supports all BigQuery data types except STRUCT, because data masking can only apply to leaf fields of the STRUCT data type.
  • Deleting a custom masking routine doesn't delete all data policies that use it. However, the data policies that use the deleted masking routine are left with an empty masking rule. Users with the Masked Reader role on other data policies with the same tag can see masked data. Others see the message Permission denied. Dangling references to empty masking rules might be cleaned by automated processes after seven days.

Compatibility with other BigQuery features

BigQuery API

Not compatible with the tabledata.list method. To call tabledata.list, you need full access to all of the columns returned by this method. The Data Catalog Fine-Grained Reader role grants appropriate access.

BigLake tables

Compatible. Data masking policies are enforced on BigLake tables.

BigQuery Storage Read API

Compatible. Data masking policies are enforced in the BigQuery Storage Read API.

BigQuery BI Engine

Compatible. Data masking policies are enforced in the BI Engine. Queries that have data masking in effect are not accelerated by BI Engine. Use of such queries in Looker Studio might cause related reports or dashboards to become slower and more expensive.

BigQuery Omni

Compatible. Data masking policies are enforced on the BigQuery Omni tables.

Collation

Not compatible. Collation is not supported on masked columns.

Copy jobs

Not compatible. To copy a table from source to the destination, you need to to have full access to all of the columns on the source table. The Data Catalog Fine-Grained Reader role grants appropriate access.

Data export

Compatible. If you have the BigQuery Masked Reader role, then the exported data is masked. If you have the Data Catalog Fine-Grained Reader role, then the exported data is not masked.

Row-level security

Compatible. Data masking is applied on top of row-level security. For example, if there is a row access policy applied on location = "US" and location is masked, then users are able to see rows where location = "US" but the location field is masked.

Search in BigQuery

Partially compatible. You can call the SEARCH function on indexed or unindexed columns that have data masking applied.

When you call the SEARCH function on columns that have data masking applied, you must use search criteria compatible with your level of access. For example, if you have Masked Reader access with a Hash (SHA-256) data masking rule, you would use the hash value in your SEARCH clause, similar to the following:

SELECT * FROM myDataset.Customers WHERE SEARCH(Email, "sg172y34shw94fujaweu");

If you have Fine-Grained Reader access, you would use the actual column value in your SEARCH clause, similar to the following:

SELECT * FROM myDataset.Customers WHERE SEARCH(Email, "[email protected]");

Searching is less likely to be useful if you have Masked Reader access to a column where the data masking rule used is Nullify or Default Masking Value. This is because the masked results you would use as search criteria, such as NULL or "", aren't sufficiently unique to be useful.

When searching on an indexed column that has data masking applied, the search index is only used if you have Fine-Grained Reader access to the column.

Snapshots

Not compatible. To create a snapshot of a table, you need full access to all of the columns on the source table. The Data Catalog Fine-Grained Reader role grants appropriate access.

Table renaming

Compatible. Table renaming is not affected by data masking.

Time travel

Compatible with both time decorators and the FOR SYSTEM_TIME AS OF option in SELECT statements. The policy tags for the current dataset schema are applied to the retrieved data.

Query caching

Partly compatible. BigQuery caches query results for approximately 24 hours, although the cache is invalidated if changes are made to the table data or schema before that. In the following circumstance, it is possible that a user who does not have the Data Catalog Fine-Grained Reader role granted on a column can still see the column data when they run a query:

  1. A user has been granted the Data Catalog Fine-Grained Reader role on a column.
  2. The user runs a query that includes the restricted column and the data is cached.
  3. Within 24 hours of Step 2, the user is granted the BigQuery Masked Reader role, and has the Data Catalog Fine-Grained Reader role revoked.
  4. Within 24 hours of Step 2, the user runs that same query, and the cached data is returned.

Wildcard table queries

Not compatible. You need full access to all of the referenced columns on all of the tables matching the wildcard query. The Data Catalog Fine-Grained Reader role grants appropriate access.

What's next