The Dataproc provisioner in Cloud Data Fusion calls the Dataproc API to create and delete clusters in your Google Cloud projects. You can configure the clusters in the provisioner's settings.
For more information about compatibility between Cloud Data Fusion versions and Dataproc versions, see Version compatibility.
Properties
Property | Description |
---|---|
Project ID | The Google Cloud project where the Dataproc cluster gets created. The project must have the Dataproc API enabled. |
Creator service account key |
The service account key provided to the provisioner must have permission to access the Dataproc and Compute Engine APIs. Because your account key is sensitive, we recommend that you provide the account key using Secure Storage. After you create the secure key, you can add it to a namespace or a system compute profile. For a namespace compute profile, click the shield and select the secure key. For a system compute profile, enter the name of the key in the Secure Account Key field. |
Region | A geographical location where you can host your resources, such as the compute nodes for the Dataproc cluster. |
Zone | An isolated deployment area within a region. |
Network | The VPC network in your Google Cloud project that will be used when creating a Dataproc cluster. |
Network host project ID | If the network resides in another Google Cloud project, enter the ID of that project. For a Shared VPC, enter the host project ID where the network resides. |
Subnet | The subnet to use when creating clusters. It must be within the given network and in the region that the zone is in. If left blank, a subnet is selected based on the network and zone. |
Runner service account | The service account name of the Dataproc virtual machines (VM) that are used for running programs. If left blank, the default Compute Engine service account is used. |
Number of masters | The number of master nodes in the cluster. These nodes contain the YARN Resource Manager, HDFS NameNode, and all drivers. Must be set to 1 or 3. Default is 1. |
Master machine type | The type of master machine to use. Select one of the following machine types:
In Cloud Data Fusion version 6.7.2 and later, the default is e2. In version 6.7.1, the default is n2. In version 6.7.0 and earlier, the default is n1. |
Master cores | Number of virtual cores allocated to a master node. Default is 2. |
Master memory (GB) | The amount of memory, in gigabytes, allocated to a master node. Default is 8 GB. |
Master disk size (GB) | Disk size, in gigabytes, allocated to a master node. Default is 1000 GB. |
Master disk type | Type of boot disk for a master node:
Default is Standard Persistent Disk. |
Worker machine type | The type of worker machine to use. Select one of the following machine types:
In Cloud Data Fusion version 6.7.2 and later, the default is e2. In version 6.7.1, the default is n2. In version 6.7.0 and earlier, the default is n1. |
Worker cores | Number of virtual cores allocated to a worker node. Default is 2. |
Worker memory (GB) | The amount of memory, in gigabytes, allocated to a worker node. Default is 8 GB. |
Worker disk size (GB) | Disk size, in gigabytes, allocated to a worker node. Default is 1000 GB. |
Worker disk type | Type of boot disk for a worker node:
Default is Standard Persistent Disk. |
Use predefined Autoscaling | Enables using predefined Dataproc autoscaling. |
Number of primary workers | Worker nodes contain a YARN NodeManager and an HDFS DataNode. Default is 2. |
Number of secondary workers | Secondary worker nodes contain a YARN NodeManager, but not an HDFS DataNode. This is normally set to zero, unless an autoscaling policy requires it to be higher. |
Autoscaling policy | Path for the autoscaling policy ID or the resource URI. For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see When to use autoscaling and Autoscale Dataproc clusters. |
Metadata | Additional metadata for instances running in your cluster. You can typically use it for tracking billing and chargebacks. For more information, see Cluster metadata. |
Network tags | Assign Network tags to apply firewall rules to the specific nodes of a cluster. Network tags must start with a lowercase letter and can contain lowercase letters, numbers, and hyphens. Tags must end with a lowercase letter or number. |
Enable Secure Boot | Enables Secure Boot on the Dataproc VMs. Default is False. |
Enable vTPM | Enables virtual Trusted Platform Module (vTPM) on the Dataproc VMs. Default is False. |
Enable Integrity Monitoring | Enables virtual Integrity Monitoring on the Dataproc VMs. Default is False. |
Image version | The Dataproc image version. If left blank, one is automatically selected. If the Custom image URIproperty is left blank, this property is ignored. |
Custom image URI | The Dataproc image URI. If left blank, it's inferred from the Image version property. |
Staging bucket | Cloud Storage bucket used to stage job dependencies and config files for running pipelines in Dataproc. |
Temp bucket | Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark history files in Dataproc. This property was introduced in Cloud Data Fusion version 6.9.2. |
Encryption key name | The customer managed encryption key (CMEK) that's used by Dataproc. |
OAuth scopes | The OAuth 2.0 scopes that you might need to request to access Google APIs, depending on the level of access you need. Google Cloud Platform Scope is always included. This property was introduced in Cloud Data Fusion version 6.9.2. |
Initialization actions | A list of scripts to be executed during initialization of the cluster. Initialization actions should be placed on Cloud Storage. |
Cluster properties | Cluster properties overriding the default configuration properties of the Hadoop services. For more information on applicable key-value pairs, see Cluster properties. |
Common labels | Labels to organize the Dataproc clusters and jobs being created. You can label each resource and then filter the resources by labels. Information about labels is forwarded to the billing system, so customers can break down your billing charges by label. |
Max idle time | Configure Dataproc to delete a cluster if it's idle longer than the specified number of minutes. Clusters are normally deleted directly after a run ends, but deletion can fail in rare situations. For more information, see Troubleshoot deleting clusters. Default is 30 minutes. |
Skip cluster delete | Whether to skip cluster deletion at the end of a run. You must manually delete clusters. This should only be used when debugging a failed run. Default is False. |
Enable Stackdriver Logging Integration | Enable the Stackdriver logging integration. Default is True. |
Enable Stackdriver Monitoring Integration | Enable the Stackdriver monitoring integration. Default is True. |
Enable Component Gateway | Enable the component gateway to access to the cluster's interfaces, such as the YARN ResourceManager and Spark HistoryServer. Default is False. |
Prefer external IP | When the system is running on Google Cloud in the same network as the cluster, it normally uses the internal IP address when communicating with the cluster. To always use the external IP address, set this value to True. Default is False. |
Create poll delay | The number of seconds to wait after creating a cluster to begin polling to see if the cluster has been created. Default is 60 seconds. Polling settings control how often cluster status is polled when creating and deleting clusters. If you have many pipelines scheduled to run at the same time, you may want to change these settings. |
Create poll jitter | Maximum amount of random jitter, in seconds, to add to the delay when creating a cluster. You can use this property to prevent many simultaneous API calls in Google Cloud when you have a lot of pipelines that are scheduled to run at the exact same time. Default is 20 seconds. |
Delete poll delay | The number of seconds to wait after deleting a cluster to begin polling to see if the cluster has been deleted. Default is 30 seconds. |
Poll interval | The number of seconds to wait between polls for cluster status. Default is 2. |
Dataproc profile web interface properties mapped to JSON properties
Dataproc profile UI property name | Dataproc profile JSON property name |
---|---|
Profile label | name |
Profile name | label |
Description | description |
Project ID | projectId |
Creator service account key | accountKey |
Region | region |
Zone | zone |
Network | network |
Network host project ID | networkHostProjectId |
Subnet | subnet |
Runner service account | serviceAccount |
Number of masters | masterNumNodes |
Master machine type | masterMachineType |
Master cores | masterCPUs |
Master memory (GB) | masterMemoryMB |
Master disk size (GB) | masterDiskGB |
Master disk type | masterDiskType |
Number of primary workers | workerNumNodes |
Number of secondary workers | secondaryWorkerNumNodes |
Worker machine type | workerMachineType |
Worker cores | workerCPUs |
Worker memory (GB) | workerMemoryMB |
Worker disk size (GB) | workerDiskGB |
Worker disk type | workerDiskType |
Metadata | clusterMetaData |
Network tags | networkTags |
Enable Secure Boot | secureBootEnabled |
Enable vTPM | vTpmEnabled |
Enable Integrity Monitoring | integrityMonitoringEnabled |
Image version | imageVersion |
Custom image URI | customImageUri |
Cloud Storage bucket | gcsBucket |
Encryption key name | encryptionKeyName |
Autoscaling policy | autoScalingPolicy |
Initialization actions | initActions |
Cluster properties | clusterProperties |
Labels | clusterLabels |
Max idle time | idleTTL |
Skip cluster delete | skipDelete |
Enable Stackdriver Logging Integration | stackdriverLoggingEnabled |
Enable Stackdriver Monitoring Integration | stackdriverMonitoringEnabled |
Enable Component Gateway | componentGatewayEnabled |
Prefer external IP | preferExternalIP |
Create poll delay | pollCreateDelay |
Create poll jitter | pollCreateJitter |
Delete poll delay | pollDeleteDelay |
Poll interval | pollInterval |
Best Practices
When you create a static cluster for your pipelines, refer to the cluster configuration best practices.
What's next
- Learn more about managing compute profiles.