This guide explains how to use Sensitive Data Protection with Cloud Data Fusion.
Cloud Data Fusion provides a Sensitive Data Protection plugin that provides three transforms that can filter, redact, or decrypt your sensitive data:
-
The PII Filter transform lets you filter sensitive records from an input stream of data.
The Redact transform lets you transform sensitive data, such as masking the data or encrypting it.
The Decrypt transform lets you decrypt sensitive data that was previously encrypted using the Redact transform,
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
In the Google Cloud console, go to the project selector page and select or create a project.
Enable the Cloud Data Fusion API for your project.
Enable the DLP API (part of Sensitive Data Protection) for your project.
Grant Sensitive Data Protection permissions
In the Google Cloud console, go to the IAM page.
In the permissions table, select one of the following service accounts in the Principal column:
For permission to resources at runtime, select the service account that your Dataproc cluster uses. The default is the Compute Engine service account, which is not recommended for security reasons
For permission to resources when using Wrangler or Preview in Cloud Data Fusion (not at runtime), instead select the service account that matches the format:
service-project-number@gcp-sa-datafusion.iam.gserviceaccount.com
.
Click the pencil icon to the right of the service account.
Click Add Another Role.
Click the dropdown that appears.
Use the search bar to search and then select DLP Administrator.
Click Save. Check that DLP Administrator appears in the Role column.
Deploy the Sensitive Data Protection plugin
Go to your instance:
In the Google Cloud console, go to the Cloud Data Fusion page.
To open the instance in the Cloud Data Fusion Studio, click Instances, and then click View instance.
In the Cloud Data Fusion web UI, click Hub in the upper right.
Click the Data Loss Prevention plugin.
Click Deploy.
Click Finish.
Click Create a pipeline.
Use the PII Filter transform
This transform separates sensitive records from non-sensitive records. A record is considered sensitive if it matches criteria that you define in a Sensitive Data Protection template. For example, when you create your template, you can define sensitive data to be credit card information or Social Security numbers.
Open your pipeline in Cloud Data Fusion and click Studio > Transform.
Click the PII Filter transform.
Hold the pointer over the PII Filter node and click Properties.
Under Filter on, choose whether you want to filter records or fields.
In compliance with Sensitive Data Protection limits, if a record exceeds 0.5 MB, your Cloud Data Fusion pipeline will fail. To avoid such a failure, filter by field instead of record.
Under Template ID, enter the template ID of the Sensitive Data Protection template you created.
Under Error Handling, define how to proceed when your pipeline encounters sensitive data. Choose one of the following error handling options:
- Stop pipeline: Stops the pipeline as soon as an error is encountered.
- Skip record: Skips the record that caused the error. The pipeline continues to run, and no error is reported.
- Send to error: Sends errors to the error port. The pipeline continues to run.
Click the X button.
Use the Redact transform
This transform identifies sensitive records in the input stream and applies transformations that you define to those records. A record is considered sensitive if it matches predefined Sensitive Data Protection filters you chose or a custom template you defined.
In the Studio page of the Cloud Data Fusion web UI, click to expand the Transform menu.
Click the Redact transform.
Hold the pointer over the Redact node and click Properties.
Choose if you want to apply transformations to predefined filters or if you'd like to create your own.
You cannot combine these two options. You can either use predefined filters OR create a custom template.
Predefined filters
To apply transformations to predefined filters, leave the Custom Template set to No, and under Matching, define a rule:
Following Apply, click the dropdown and choose a transformation. Learn more about the available transformations in the Description section of the plugin's Documentation tab.
Following on, click the dropdown and choose a category, which is a set of predefined Sensitive Data Protection filters grouped together by type. For the full list of provided categories and what filters they contain, see the DLP Filter Mapping section in the plugin's Documentation tab.
To set multiple matching rules, click the + button.
Custom template
To apply transformations according to a custom template, set the Custom Template to Yes.
Back in the Cloud Data Fusion web UI, in the Redact properties menu, under Template ID, enter the template ID of the custom template you created.
Click the X button.
Use the Decrypt transform
This transform identifies records that were encrypted using Sensitive Data Protection in the input stream and applies decryption. Only records that were encrypted using a reversible algorithm such as Format Preserving Encryption or Deterministic Encryption can be decrypted.
In the Studio page of the Cloud Data Fusion web UI, click to expand the Transform menu.
Click the Decrypt transform.
Hold the pointer over the Decrypt node and click Properties.
Enter the same values that were used to configure the Redact plugin that encrypted this data. The properties for this plugin are identical to the Redact plugin.
Click the X button.
What's next
- Follow a tutorial to redact sensitive user data.
- Read more about Sensitive Data Protection.