This initialization action installs Apache HBase libraries and the Google Cloud Bigtable HBase Client.
You can use this initialization action to create a Dataproc cluster configured to connect to Cloud Bigtable:
-
Create a Bigtable instance by following these directions.
-
Using the
gcloud
command to create a new cluster with this initialization action.REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigtable/bigtable.sh \ --metadata bigtable-instance=<BIGTABLE INSTANCE>
-
The cluster will have HBase libraries, the Bigtable client, and the Apache Spark - Apache HBase Connector installed.
-
In addition to running Hadoop and Spark jobs, you can SSH to the master (
gcloud compute ssh ${CLUSTER_NAME}-m
) and usehbase shell
to connect to your Bigtable instance.
-
Get the code:
git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/
-
Compile the example. This will create two jars: with and without dependencies included.
cd cloud-bigtable-examples/java/dataproc-wordcount/ mvn clean package -Dbigtable.projectID=<BIGTABLE PROJECT> -Dbigtable.instanceID=<BIGTABLE INSTANCE>
-
Submit the jar with dependencies as a Dataproc job. Note that
OUTPUT_TABLE
should not already exist. This job will create the table with the correct column family.REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc jobs submit hadoop --cluster ${CLUSTER_NAME} \ --class com.example.bigtable.sample.WordCountDriver --jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar \ -- \ wordcount-hbase gs://goog-dataproc-initialization-actions-${REGION}/README.md <OUTPUT_TABLE>
See Apache Spark - Apache HBase Connector for more information on using this connector in your own Spark jobs.
-
Submit the example jar as a Dataproc job:
CLUSTER_NAME=<cluster_name> gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} \ --class org.apache.spark.sql.execution.datasources.hbase.examples.HBaseSource \ --jars file:///usr/lib/spark/examples/jars/shc-examples.jar
- You can edit and upload your own copy of
bigtable.sh
to Google Cloud Storage and use that instead. - If you wish to use an instance in another project you can specify
--metadata bigtable-project=<PROJECT>
(this will setgoogle.bigtable.project.id
). Make sure your cluster's service account is authorized to access the instance, by default service account that created cluster is being used. - If you specify custom service account scopes, make sure to add
appropriate Bigtable scopes
or
cloud-platform
. Clusters havebigtable.admin.table
andbigtable.data
by default. - Apache Spark - Apache HBase Connector version is
1.1.1-2.1-s_2.11
.