Connect to Apache Spark
As a BigQuery administrator, you can create a connection to enable data analysts to run stored procedures for Apache Spark.
Before you begin
-
Enable the BigQuery Connection API.
-
To get the permissions that you need to create a Spark connection, ask your administrator to grant you the BigQuery Connection Admin (
roles/bigquery.connectionAdmin
) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.You might also be able to get the required permissions through custom roles or other predefined roles.
- Optional: To manage your metadata using Dataproc Metastore, ensure that you have created a Dataproc Metastore service.
- Optional: To view job history using Spark History Server web interfaces, ensure that you have created a Dataproc Persistent History Server (PHS).
Location considerations
When you choose a location for your data, consider the following:
Multi-regions
You must specify Google Cloud resources located in the same large geographic area:
A connection in the BigQuery US multi-region can reference a Spark History Server or a Dataproc Metastore in any single region in the US geographic area, such as
us-central1
,us-east4
, orus-west2
.A connection in the BigQuery EU multi-region can reference a Spark History Server or a Dataproc Metastore in member states of the European Union, such as
europe-north1
oreurope-west3
.
Single regions
A connection in a single region can only reference Google Cloud
resources in the same region. For example, a connection in the single
region us-east4
can only reference a Spark History Server or a
Dataproc Metastore in us-east4
.
Create connections
Select one of the following options:
Console
Go to the BigQuery page.
To create a connection, click add
Add data, and then click Connections to external data sources.In the Connection type list, select Apache Spark.
In the Connection ID field, enter a name for your connection—for example,
spark_connection
.In the Data location list, select a region.
You can create a connection in regions and multi-regions that support BigQuery. For more information, see Location considerations.
Optional: From the Metastore service list, select a Dataproc Metastore.
Optional: In the History server cluster field, enter a Dataproc Persistent History Server.
Click Create connection.
Click Go to connection.
In the Connection info pane, copy the service account ID for use in a following step.
bq
In a command-line environment, use the
bq mk
command to create a connection:bq mk --connection --connection_type='SPARK' \ --properties=PROPERTIES \ --project_id=PROJECT_ID \ --location=LOCATION CONNECTION_ID
Replace the following:
PROPERTIES
: a key-value pair to provide connection-specific parameters in JSON formatFor example:
--properties='{ "metastoreServiceConfig": {"metastoreService": "METASTORE_SERVICE_NAME"}, "sparkHistoryServerConfig": {"dataprocCluster": "DATAPROC_CLUSTER_NAME"} }'
Replace the following:
METASTORE_SERVICE_NAME
: the Dataproc Metastore with a gRPC network configuration—for example,projects/my-project-id/locations/us-central1/services/my-service
For more information, see how to access the stored Hive metastore metadata using an endpoint protocol.
DATAPROC_CLUSTER_NAME
: the Spark History Server configuration—for example,projects/my-project-id/regions/us-central1/clusters/my-cluster
For more information, see Create a Persistent History Server cluster.
PROJECT_ID
: your Google Cloud project IDLOCATION
: the location where you want to store the connection—for example,US
CONNECTION_ID
: the connection ID—for example,myconnection
When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example
projects/.../locations/.../connections/myconnection
Retrieve and copy the service account ID because you need it in another step:
bq show --location=LOCATION --connection PROJECT_ID.LOCATION.CONNECTION_ID
The output is similar to the following:
Connection myproject.us.myconnection name type properties ---------------------- ------- --------------------------------------------------- myproject.us.myconnection SPARK {"serviceAccountId": "[email protected]"}
For information about how to manage connections, see Manage connections.
Grant access to the service account
To let a stored procedure for Apache Spark access your Google Cloud resources, you need to grant the service account that's associated with the stored procedure's connection the necessary IAM permissions. Alternatively, you can use your custom service account for data access.
To read and write data from and to BigQuery, you need to give the service account the following IAM permissions:
bigquery.tables.*
on your BigQuery tablesbigquery.readsessions.*
on your project
The
roles/bigquery.admin
IAM role includes the permissions that the service account needs in order to read and write data from and to BigQuery.To read and write data from and to Cloud Storage, you need to give the service account the
storage.objects.*
permission on your Cloud Storage objects.The
roles/storage.objectAdmin
IAM role includes the permissions that the service account needs in order to read and write data from and to Cloud Storage.If you specify Dataproc Metastore when you create a connection, then for BigQuery to retrieve details about the metastore configuration, you need to give the service account the
metastore.services.get
permission on your Dataproc Metastore.The predefined
roles/metastore.metadataViewer
role includes the permission that the service account needs in order to retrieve details about the metastore configuration.You also need to grant the service account the
roles/storage.objectAdmin
role on the Cloud Storage bucket so that your stored procedure can access the Hive warehouse directory of your Dataproc Metastore (hive.metastore.warehouse.dir
). If your stored procedure performs operations on the metastore, you might need to give additional permissions. For more information about IAM roles and permissions in Dataproc Metastore, see Dataproc Metastore predefined roles and permissions.If you specify a Dataproc Persistent History Server when you create a connection, then you need to grant the service account the following roles:
- The
roles/dataproc.viewer
role on your Dataproc Persistent History Server that contains thedataproc.clusters.get
permission. - The
roles/storage.objectAdmin
role on the Cloud Storage bucket that you specify for the propertyspark:spark.history.fs.logDirectory
when you create the Dataproc Persistent History Server.
For more information, see Dataproc Persistent History Server and Dataproc roles and permissions.
- The
Share connections with users
You can grant the following roles to let users query data and manage connections:
roles/bigquery.connectionUser
: enables users to use connections to connect with external data sources and run queries on them.roles/bigquery.connectionAdmin
: enables users to manage connections.
For more information about IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Select one of the following options:
Console
Go to the BigQuery page.
Connections are listed in your project, in a group called External connections.
In the Explorer pane, click your project name > External connections > connection.
In the Details pane, click Share to share a connection. Then do the following:
In the Connection permissions dialog, share the connection with other principals by adding or editing principals.
Click Save.
bq
You cannot share a connection with the bq command-line tool. To share a connection, use the Google Cloud console or the BigQuery Connections API method to share a connection.
API
Use the
projects.locations.connections.setIAM
method
in the BigQuery Connections REST API reference section, and
supply an instance of the policy
resource.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
What's next
- Learn about different connection types.
- Learn about managing connections.
- Learn how to create a stored procedure for Apache Spark.
- Learn how to manage stored procedures.