This initialization action installs H2O Sparkling Water on all nodes of Google Cloud Dataproc cluster.
This initialization works with Dataproc image version 1.3
and newer, except
1.5
image.
You can use this initialization action to create a new Dataproc cluster with H2O Sparkling Water installed:
-
To create Dataproc 1.3 cluster use
conda
initialization action:REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --image-version 1.3 \ --scopes "cloud-platform" \ --initialization-actions "gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/h2o/h2o.sh"
-
To create Dataproc 1.4 cluster use
ANACONDA
optional component:REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --image-version 1.4 \ --optional-components ANACONDA \ --scopes "cloud-platform" \ --initialization-actions "gs://goog-dataproc-initialization-actions-${REGION}/h2o/h2o.sh"
-
To create Dataproc 2.0 cluster and newer you don't need any additional initialization actions or optional components:
REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --image-version 2.0 \ --scopes "cloud-platform" \ --initialization-actions "gs://goog-dataproc-initialization-actions-${REGION}/h2o/h2o.sh"
Submit sample job:
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc jobs submit pyspark --cluster ${CLUSTER_NAME} \
"gs://goog-dataproc-initialization-actions-${REGION}/h2o/sample-script.py"
H2O_SPARKLING_WATER_VERSION
: Sparkling Water version number. You can find the versions from the releases page on GitHub. Default is3.30.1.2-1
.