This initialization action installs the latest version of Apache Tez on a Google Cloud Dataproc cluster. In this initialization action:
- Required libraries such as
protobuf
are installed - Tez is compiled
- Tez is put into HDFS
- Hadoop is configured to work with Tez on the Dataproc cluster
You can use this initialization action to create a new Dataproc cluster with Apache Tez installed by:
-
Uploading a copy of this initialization action (
tez.sh
) to Google Cloud Storage. -
Using the
gcloud
command to create a new cluster with this initialization action. Note - you will need to increase theinitialization-action-timeout
to at least 15 minutes to allow for Tez and its dependencies to compile and install. The following command will create a new cluster named<CLUSTER_NAME>
, specify the initialization action stored in<GCS_BUCKET>
, and increase the timeout to 15 minutes.gcloud dataproc clusters create <CLUSTER_NAME> \ --initialization-actions gs://<GCS_BUCKET>/tez.sh --initialization-action-timeout 15m
-
Once the cluster is created, Tez should work with your Cloud Dataproc jobs.
You can find more information about using initialization actions with Dataproc in the Dataproc documentation.
- This script has a number of user-defined variables at the top, such as the Tez version and install location.
- This script builds from source which may cause issues if you use an unstable release.
- Be careful with the
protobuf
version - Tez often requires a very specific version to function properly.