The Anaconda Component is the best way to use Anaconda with Cloud Dataproc. To learn more about Dataproc Components see here.
This folder contains initialization actions for use Miniconda / conda, which can be introduced as:
- Minconda, a barebones version of Anaconda of an open source (and totally legit) Python distro from Continuum Analytics, alongside,
- conda, an open source (and amazing) package and environment management system.
This allows Dataproc users to quickly and easily provision a Dataproc cluster leveraging conda's powerful management capabilities, by specifying a list of conda and / or pip packages along with use of conda environment definitions. All configuration is exposed via environment variables set to sane point-and-shoot defaults.
Starting with Dataproc image version 1.3, this initialization action may not be necessary:
-
Starting with image version
1.3
Anaconda can be installed via the Anaconda Optional Component. -
Starting with image version
1.4
Miniconda is the default Python interpreter. -
On version 1.3, the Python environment is based on Python 2.7. On version 1.4 and later the Python environment is Python 3.6.
Please see the following tutorial for full details https://cloud.google.com/dataproc/docs/tutorials/python-configuration.
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--initialization-actions \
gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh
You can add extra packages by using the metadata entries CONDA_PACKAGES
and PIP_PACKAGES
. These variables provide a space separated list of additional packages to install.
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--metadata 'CONDA_PACKAGES="numpy pandas",PIP_PACKAGES=pandas-gbq' \
--initialization-actions \
gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh
Alternatively, you can use environment variables, e.g.:
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/create-my-cluster.sh
Where create-my-cluster.sh
specifies a list of conda and/or pip packages to install:
#!/usr/bin/env bash
gsutil -m cp -r gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh .
gsutil -m cp -r gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh .
chmod 755 ./*conda*.sh
# Install Miniconda / conda
./bootstrap-conda.sh
# Update conda root environment with specific packages in pip and conda
CONDA_PACKAGES='pandas scikit-learn'
PIP_PACKAGES='plotly cufflinks'
CONDA_PACKAGES=$CONDA_PACKAGES PIP_PACKAGES=$PIP_PACKAGES ./install-conda-env.sh
Similarly, one can also specify a conda environment yml file:
#!/usr/bin/env bash
CONDA_ENV_YAML_GSC_LOC="gs://my-bucket/path/to/conda-environment.yml"
CONDA_ENV_YAML_PATH="/root/conda-environment.yml"
echo "Downloading conda environment at $CONDA_ENV_YAML_GSC_LOC to $CONDA_ENV_YAML_PATH ... "
gsutil -m cp -r $CONDA_ENV_YAML_GSC_LOC $CONDA_ENV_YAML_PATH
gsutil -m cp -r gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh .
gsutil -m cp -r gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh .
chmod 755 ./*conda*.sh
# Install Miniconda / conda
./bootstrap-conda.sh
# Create / Update conda environment via conda yaml
CONDA_ENV_YAML=$CONDA_ENV_YAML_PATH ./install-conda-env.sh
bootstrap-conda.sh
contains logic for quickly configuring and installing Miniconda across the dataproc cluster. Defaults to Miniconda3-4.5.4-Linux-x86_64.sh
package (e.g., Python 3), however, users can easily config for targeted versions via the following instance metadata keys:
MINICONDA_VARIANT
: the Python version can be2
or3
MINICONDA_VERSION
: the Miniconda version (e.g.,4.0.5
orlatest
)
In addition, the script:
- downloads and installs Miniconda to the
$HOME
directory - updates
$PATH
, exposing conda across all shell processes (for both interactive and batch sessions) - installs some useful extensions:
See the script source for more options on configuration. :)
install-conda-env.sh
contains logic for creating a conda environment and installing conda and/or pip packages. Defaults include:
- if no conda environment name is specified, uses
root
(recommended). - detects if conda environment has already been created.
- updates
/etc/profile
to activate the created environment at login (if needed)
Note: When creating a conda environment using an environment.yml (via setting .yml path in CONDA_ENV_YAML
), the install-conda-env.sh
script simply updates the root environment with dependencies specified in the file (i.e., ignoring the name:
key). This sidesteps some conda issues with source activate
, while still providing all dependencies across the Dataproc cluster.
A quick test to ensure a correct installation of conda, we can submit jobs that collect distinct paths to the Python distribution across all Spark executors. For both local (e.g., running from dataproc cluster master node) and remote (e.g. submitting a job via the dataproc API) jobs, the result should be a list with a single path: ['/opt/conda/bin/python']
. For example:
After sshing to master node (e.g., gcloud compute ssh $DATAPROC_CLUSTER_NAME-m
), run the get-sys-exec.py
script contained in this directory:
> spark-submit get-sys-exec.py
... # Lots of output
['/opt/conda/bin/python']
...
From command line of local / host machine, one can submit remote job:
> gcloud dataproc jobs submit pyspark --cluster $DATAPROC_CLUSTER_NAME get-sys-exec.py
... # Lots of output
['/opt/conda/bin/python']
...