This document describes how to use Cloud Shuffle Plugin with other Cloud Storage Service other than natively supported Amazon S3.
- Get Google Cloud Storage Connector for Spark and Hadoop library.
- You can get the library from Maven repository.
- Choose the appropriate version of GCS connector.
- Choose View All under Files.
- Download the file with
-shaded.jar
(e.g. gcs-connector-hadoop3-2.2.3-shaded.jar)
- Use spark-submit
--packages
to import the jars
- You can get the library from Maven repository.
- Create a Google Cloud Storage bucket
<GCS Shuffle Bucket>
for shuffle files storage. - Add the Cloud Shuffle Plugin jar and the shaded jar of Google Cloud Storage Connector for Spark and Hadoop into Spark driver/executor Classpath.
spark.driver.extraClassPath=<path to jars>
spark.executor.extraClassPath=<path to jars>
- Set up the permission to access the shuffle bucket, and use appropriate authentication mechanism to pass the credentials to the filesystem.
- Set the following Spark configurations for working with Google Cloud Storage.
-
--conf spark.shuffle.sort.io.plugin.class=com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin \ --conf spark.shuffle.storage.path=gs://<GCS Shuffle Bucket>/<shuffle dir> \ --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
-
- Set up the permission to access the shuffle bucket, choose appropriate authentication mechanism, and set the credentials for Google Cloud Storage. Here's two examples for authentication:
- JSON keyfile service account authentication
-
--conf spark.hadoop.fs.gs.auth.type=SERVICE_ACCOUNT_JSON_KEYFILE \ --conf spark.hadoop.fs.gs.auth.service.account.json.keyfile=<Your JSON key file path. The file must exist at the same path on all nodes>
-
- Google Compute Engine service account authentication (Be careful not to set the key in plaintext, instead you should load the private key and set programmatically via
SparkConf
.)-
--conf spark.hadoop.fs.gs.auth.service.account.enable=true \ --conf spark.hadoop.fs.gs.auth.service.account.email=<Your Service Account email> \ --conf spark.hadoop.fs.gs.auth.service.account.private.key.id=<Your private key ID extracted from the credential's JSON> --conf spark.hadoop.fs.gs.auth.service.account.private.key=<Your private key extracted from the credential's JSON>
-
- JSON keyfile service account authentication
For details about configurations and authentication, see https://github.com/GoogleCloudDataproc/hadoop-connectors/ and https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md.
- Get hadoop-azure library.
- You can get the library from Maven repository.
- Get azure-storage library.
- You can get the library from Maven repository.
- Create a storage account in Microsoft Azure.
- Create a Blob storage container
<Azure Blob container>
for shuffle files storage. - Add the Cloud Shuffle Plugin jar, hadoop-azure jar, and azure-storage jar into Spark driver/executor Classpath.
spark.driver.extraClassPath=<path to jars>
spark.executor.extraClassPath=<path to jars>
- Set up the permission to access the shuffle bucket, and use appropriate authentication mechanism to pass the credentials to the filesystem.
- Set the following Spark configurations for working with Microsoft Azure Blob Storage.
--conf spark.shuffle.sort.io.plugin.class=com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin \ --conf spark.shuffle.storage.path=wasbs://<Azure Blob container>@<Storage account>.blob.core.windows.net/<shuffle dir> \ --conf spark.hadoop.fs.azure.account.key.<Storage account>.blob.core.windows.net=<Your Azure Key>
For details about configurations, see https://hadoop.apache.org/docs/current/hadoop-azure/.