Work with remote functions
A BigQuery remote function allows you to implement your function in other languages than SQL and Javascript or with the libraries or services which are not allowed in BigQuery user-defined functions.
Overview
A BigQuery remote function lets you incorporate GoogleSQL functionality with software outside of BigQuery by providing a direct integration with Cloud Run functions and Cloud Run. With BigQuery remote functions, you can deploy your functions in Cloud Run functions or Cloud Run implemented with any supported language, and then invoke them from GoogleSQL queries.
Workflow
- Create the HTTP endpoint in Cloud Run functions or Cloud Run.
- Create a remote function in BigQuery.
- Create a connection of type
CLOUD_RESOURCE
. - Create a remote function.
- Create a connection of type
- Use the remote function in a query just like any other user-defined functions.
Limitations
-
Remote functions only support one of the following data types as argument type or return type:
- Boolean
- Bytes
- Numeric
- String
- Date
- Datetime
- Time
- Timestamp
- JSON
Remote functions do not support
ARRAY
,STRUCT
,INTERVAL
, orGEOGRAPHY
types. You cannot create temporary remote functions.
You cannot create table-valued remote functions.
You cannot use remote functions when creating materialized views.
The return value of a remote function is always assumed to be non-deterministic so the result of a query calling a remote function is not cached.
You might see repeated requests with the same data to your endpoint, even after successful responses, due to transient network errors or BigQuery internal errors.
When a remote function evaluation is skipped for some rows due to short-circuiting, for example, in conditional expressions or a
MERGE
statement withWHEN [NOT] MATCHED
, batching is not used with the remote function. In this case, thecalls
field in the HTTP request body has exactly one element.If the dataset associated with the remote function is replicated to a destination region through cross-region dataset replication, the remote function can only be queried in the region that it was created in.
Create an endpoint
To create a remote function that can implement business logic, you must create an HTTP endpoint by using either Cloud Run functions or Cloud Run. The endpoint must be able to process a batch of rows in a single HTTP POST request and return the results for the batch as an HTTP response.
If you are creating the remote function by using BigQuery DataFrames, you don't have to manually create the HTTP endpoint; the service does that for you automatically.
See the Cloud Run functions tutorial and other Cloud Run functions documentation on how to write, deploy, test and maintain a Cloud Run function.
See the Cloud Run quick start and other Cloud Run documentation on how to write, deploy, test and maintain a Cloud Run service.
It's recommended that you keep the default authentication instead of allowing unauthenticated invocation of your Cloud Run function or Cloud Run service.
Input format
BigQuery sends HTTP POST requests with JSON body in the following format:
Field name | Description | Field type |
---|---|---|
requestId | Id of the request. Unique over multiple requests sent to this endpoint in a GoogleSQL query. | Always provided. String. |
caller | Job full resource name for the GoogleSQL query calling the remote function. | Always provided. String. |
sessionUser | Email of the user executing the GoogleSQL query. | Always provided. String. |
userDefinedContext | The user defined context that was used when creating the remote function in BigQuery. | Optional. A JSON object with key-value pairs. |
calls | A batch of input data. | Always provided. A JSON array.
Each element itself is a JSON array, which is a JSON encoded argument list of one remote function call. |
An example of a request:
{
"requestId": "124ab1c",
"caller": "//bigquery.googleapis.com/projects/myproject/jobs/myproject:US.bquxjob_5b4c112c_17961fafeaf",
"sessionUser": "[email protected]",
"userDefinedContext": {
"key1": "value1",
"key2": "v2"
},
"calls": [
[null, 1, "", "abc"],
["abc", "9007199254740993", null, null]
]
}
Output format
BigQuery expects the endpoint should return a HTTP response in the following format, otherwise BigQuery can't consume it and will fail the query calling the remote function.
Field name | Description | Value Range |
replies | A batch of return values. | Required for a successful response. A JSON array.
Each element corresponds to a JSON encoded return value of the external function.
Size of the array must match the size of the JSON array of |
errorMessage | Error message when the HTTP response code other than 200 is returned. For non-retryable errors, we return this as part of the BigQuery job's error message to the user. | Optional. String. Size should be less than 1KB. |
An example of a successful response:
{
"replies": [
1,
0
]
}
An example of a failed response:
{
"errorMessage": "Received but not expected that the argument 0 be null".
}
HTTP response code
Your endpoint should return the HTTP response code 200 for a successful response. When BigQuery receives any other value, BigQuery considers the response as a failure, and retries when the HTTP response code is 408, 429, 500, 503 or 504 until some internal limit.
JSON encoding of SQL data type
JSON encoding in HTTP request/response follows the existing BigQuery JSON encoding for TO_JSON_STRING function.
Sample Cloud Run function code
The following sample Python code implements adding all the integer arguments of the remote function. It handles a request with the arguments for batched invocations and returns all the result in a response.
import functions_framework
from flask import jsonify
# Max INT64 value encoded as a number in JSON by TO_JSON_STRING. Larger values are encoded as
# strings.
# See https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_encodings
_MAX_LOSSLESS=9007199254740992
@functions_framework.http
def batch_add(request):
try:
return_value = []
request_json = request.get_json()
calls = request_json['calls']
for call in calls:
return_value.append(sum([int(x) if isinstance(x, str) else x for x in call if x is not None]))
replies = [str(x) if x > _MAX_LOSSLESS or x < -_MAX_LOSSLESS else x for x in return_value]
return_json = jsonify( { "replies": replies } )
return return_json
except Exception as e:
return jsonify( { "errorMessage": str(e) } ), 400
Assuming that the function is deployed in the project my_gcf_project
in region
us-east1
as the function name remote_add
, it can be accessed via the
endpoint https://us-east1-my_gcf_project.cloudfunctions.net/remote_add
.
Sample Cloud Run code
The following sample Python code implements a web service, which can be built and deployed to Cloud Run for the same functionality.
import os
from flask import Flask, request, jsonify
# Max INT64 value encoded as a number in JSON by TO_JSON_STRING. Larger values are encoded as
# strings.
# See https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_encodings
_MAX_LOSSLESS=9007199254740992
app = Flask(__name__)
@app.route("/", methods=['POST'])
def batch_add():
try:
return_value = []
request_json = request.get_json()
calls = request_json['calls']
for call in calls:
return_value.append(sum([int(x) if isinstance(x, str) else x for x in call if x is not None]))
replies = [str(x) if x > _MAX_LOSSLESS or x < -_MAX_LOSSLESS else x for x in return_value]
return jsonify( { "replies" : replies } )
except Exception as e:
return jsonify( { "errorMessage": str(e) } ), 400
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
See the guide on how to build and deploy the code.
Assuming that the Cloud Run service is deployed in the project
my_gcf_project
in region us-east1
as the service name remote_add
, it can
be accessed via the endpoint
https://remote_add-<project_id_hash>-ue.a.run.app
.
Create a remote function
BigQuery uses a CLOUD_RESOURCE
connection to interact with your
Cloud Run function. In order to create a remote function, you must
create a CLOUD_RESOURCE
connection. If you are creating the remote function by
using BigQuery DataFrames
and you have been granted the
Project IAM Admin (roles/resourcemanager.projectIamAdmin
) role, then you
don't have to manually create the connection and grant it access; the service
does that for you automatically.
Create a connection
You must have a Cloud resource connection to connect to Cloud Run function and Cloud Run.
Select one of the following options:
Console
Go to the BigQuery page.
To create a connection, click
Add, and then click Connections to external data sources.In the Connection type list, select Vertex AI remote models, remote functions and BigLake (Cloud Resource).
In the Connection ID field, enter a name for your connection.
Click Create connection.
Click Go to connection.
In the Connection info pane, copy the service account ID for use in a later step.
bq
In a command-line environment, create a connection:
bq mk --connection --location=REGION --project_id=PROJECT_ID \ --connection_type=CLOUD_RESOURCE CONNECTION_ID
The
--project_id
parameter overrides the default project.Replace the following:
REGION
: your connection regionPROJECT_ID
: your Google Cloud project IDCONNECTION_ID
: an ID for your connection
When you create a connection resource, BigQuery creates a unique system service account and associates it with the connection.
Troubleshooting: If you get the following connection error, update the Google Cloud SDK:
Flags parsing error: flag --connection_type=CLOUD_RESOURCE: value should be one of...
Retrieve and copy the service account ID for use in a later step:
bq show --connection PROJECT_ID.REGION.CONNECTION_ID
The output is similar to the following:
name properties 1234.REGION.CONNECTION_ID {"serviceAccountId": "connection-1234-9u56h9@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}
Terraform
Use the
google_bigquery_connection
resource.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
The following example creates a Cloud resource connection named
my_cloud_resource_connection
in the US
region:
To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.
Prepare Cloud Shell
- Launch Cloud Shell.
-
Set the default Google Cloud project where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (also called a root module).
-
In Cloud Shell, create a directory and a new
file within that directory. The filename must have the
.tf
extension—for examplemain.tf
. In this tutorial, the file is referred to asmain.tf
.mkdir DIRECTORY && cd DIRECTORY && touch main.tf
-
If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly created
main.tf
.Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.
- Review and modify the sample parameters to apply to your environment.
- Save your changes.
-
Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the
-upgrade
option:terraform init -upgrade
Apply the changes
-
Review the configuration and verify that the resources that Terraform is going to create or
update match your expectations:
terraform plan
Make corrections to the configuration as necessary.
-
Apply the Terraform configuration by running the following command and entering
yes
at the prompt:terraform apply
Wait until Terraform displays the "Apply complete!" message.
- Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.
Set up access
You must give the new connection read-only access to your Cloud Run function or Cloud Run service. It is not recommended to allow unauthenticated invocation for your Cloud Run function or Cloud Run service.
To grant roles, follow these steps:
Go to the IAM & Admin page.
Click
Add.The Add principals dialog opens.
In the New principals field, enter the service account ID that you copied earlier.
In the Select a role field, select one of the following options:
- If you are using a 1st-gen Cloud Run function, choose Cloud Function, and then select Cloud Function Invoker role.
- If you are using a 2nd-gen Cloud Run function, choose Cloud Run, and then select Cloud Run Invoker role.
- If you are using a Cloud Run service, choose Cloud Run, and then select Cloud Run Invoker role.
Click Save.
Create a remote function
To create a remote function:
SQL
Run the following
CREATE FUNCTION
statement in BigQuery:
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
CREATE FUNCTION
PROJECT_ID.DATASET_ID
.remote_add(x INT64, y INT64) RETURNS INT64 REMOTE WITH CONNECTIONPROJECT_ID.LOCATION.CONNECTION_NAME
OPTIONS ( endpoint = 'ENDPOINT_URL' )Replace the following:
DATASET_ID
: the ID of your BigQuery dataset.ENDPOINT_URL
: the URL of your Cloud Run function or Cloud Run remote function endpoint.
Click
Run.
For more information about how to run queries, see Run an interactive query.
BigQuery DataFrames
- Enable the required APIs and make sure you have been granted the required roles, as described in the Requirements section of Remote functions.
Use the
remote_function
decorator:import bigframes.pandas as bpd # Set BigQuery DataFrames options bpd.options.bigquery.project = your_gcp_project_id bpd.options.bigquery.location = "us" # BigQuery DataFrames gives you the ability to turn your custom scalar # functions into a BigQuery remote function. It requires the GCP project to # be set up appropriately and the user having sufficient privileges to use # them. One can find more details about the usage and the requirements via # `help` command. help(bpd.remote_function) # Read a table and inspect the column of interest. df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins") df["body_mass_g"].head(10) # Define a custom function, and specify the intent to turn it into a remote # function. It requires a BigQuery connection. If the connection is not # already created, BigQuery DataFrames will attempt to create one assuming # the necessary APIs and IAM permissions are setup in the project. In our # examples we will be letting the default connection `bigframes-default-connection` # be used. We will also set `reuse=False` to make sure we don't # step over someone else creating remote function in the same project from # the exact same source code at the same time. Let's try a `pandas`-like use # case in which we want to apply a user defined scalar function to every # value in a `Series`, more specifically bucketize the `body_mass_g` value # of the penguins, which is a real number, into a category, which is a # string. @bpd.remote_function( float, str, reuse=False, ) def get_bucket(num: float) -> str: if not num: return "NA" boundary = 4000 return "at_or_above_4000" if num >= boundary else "below_4000" # Then we can apply the remote function on the `Series`` of interest via # `apply` API and store the result in a new column in the DataFrame. df = df.assign(body_mass_bucket=df["body_mass_g"].apply(get_bucket)) # This will add a new column `body_mass_bucket` in the DataFrame. You can # preview the original value and the bucketized value side by side. df[["body_mass_g", "body_mass_bucket"]].head(10) # The above operation was possible by doing all the computation on the # cloud. For that, there is a google cloud function deployed by serializing # the user code, and a BigQuery remote function created to call the cloud # function via the latter's http endpoint on the data in the DataFrame. # The BigQuery remote function created to support the BigQuery DataFrames # remote function can be located via a property `bigframes_remote_function` # set in the remote function object. print(f"Created BQ remote function: {get_bucket.bigframes_remote_function}") # The cloud function can be located via another property # `bigframes_cloud_function` set in the remote function object. print(f"Created cloud function: {get_bucket.bigframes_cloud_function}") # Warning: The deployed cloud function may be visible to other users with # sufficient privilege in the project, so the user should be careful about # having any sensitive data in the code that will be deployed as a remote # function. # Let's continue trying other potential use cases of remote functions. Let's # say we consider the `species`, `island` and `sex` of the penguins # sensitive information and want to redact that by replacing with their hash # code instead. Let's define another scalar custom function and decorate it # as a remote function. The custom function in this example has external # package dependency, which can be specified via `packages` parameter. @bpd.remote_function( str, str, reuse=False, packages=["cryptography"], ) def get_hash(input: str) -> str: from cryptography.fernet import Fernet # handle missing value if input is None: input = "" key = Fernet.generate_key() f = Fernet(key) return f.encrypt(input.encode()).decode() # We can use this remote function in another `pandas`-like API `map` that # can be applied on a DataFrame df_redacted = df[["species", "island", "sex"]].map(get_hash) df_redacted.head(10)
You need to have the permission bigquery.routines.create
on the dataset where
you create the remote function, and the bigquery.connections.delegate
permission (available from the BigQuery Connection Admin role) on the connection
that is used by the remote function.
Providing user defined context
You can specify user_defined_context
in OPTIONS
as a form of key-value
pairs, which will be part of every HTTP request to the endpoint. With user
defined context, you can create multiple remote functions but re-use a single
endpoint, that provides different behaviors based on the context passed to it.
The following examples create two remote functions to encrypt and decrypt
BYTES
data using the same endpoint.
CREATE FUNCTION `PROJECT_ID.DATASET_ID`.encrypt(x BYTES)
RETURNS BYTES
REMOTE WITH CONNECTION `PROJECT_ID.LOCATION.CONNECTION_NAME`
OPTIONS (
endpoint = 'ENDPOINT_URL',
user_defined_context = [("mode", "encryption")]
)
CREATE FUNCTION `PROJECT_ID.DATASET_ID`.decrypt(x BYTES)
RETURNS BYTES
REMOTE WITH CONNECTION `PROJECT_ID.LOCATION.CONNECTION_NAME`
OPTIONS (
endpoint = 'ENDPOINT_URL',
user_defined_context = [("mode", "decryption")]
)
Limiting number of rows in a batch request
You can specify max_batching_rows
in OPTIONS
as the maximum number of rows
in each HTTP request, to avoid
Cloud Run functions timeout. If not
specified, BigQuery will decide how many rows are included in a
batch.
Use a remote function in a query
Make sure you have granted the permission on your Cloud Run function, so that it is accessible to BigQuery's service account associated with the connection of the remote function.
You also need to have the permission bigquery.routines.get
on the dataset
where the remote function is, and the bigquery.connections.use
permission,
which you can get through the BigQuery Connection User
role, on
the connection which is used by the remote function.
You can use a remote function in a query just like a user defined function.
For example, you can use the remote_add
function in the example query:
SELECT
val,
`PROJECT_ID.DATASET_ID`.remote_add(val, 2)
FROM
UNNEST([NULL,2,3,5,8]) AS val;
This example produces the following output:
+------+-----+
| val | f0_ |
+------+-----+
| NULL | 2 |
| 2 | 4 |
| 3 | 5 |
| 5 | 7 |
| 8 | 10 |
+------+-----+
Supported regions
There are two types of locations in BigQuery:
A region is a specific geographic place, such as London.
A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.
Single regions
In a BigQuery single region dataset, you can only create a remote function that uses a Cloud Run function deployed in the same region. For example:
- A remote function in BigQuery single region
us-east4
can only use a Cloud Run function inus-east4
.
So for single regions, remote functions are only supported in regions that support both Cloud Run functions and BigQuery.
Multi-regions
In a BigQuery multi-region (US
, EU
) dataset, you can only
create a remote function that uses a Cloud Run function deployed in a
region within the same large geographic area (US, EU). For example:
- A remote function in BigQuery
US
multi-region can only use a Cloud Run function deployed in any single region in the US geographic area, such asus-central1
,us-east4
,us-west2
, etc. - A remote function in BigQuery
EU
multi-region can only use a Cloud Run function deployed in any single region in member states of the European Union, such aseurope-north1
,europe-west3
, etc.
For more information about BigQuery regions and multi-regions, see the Dataset Locations page. For more information about Cloud Run functions regions, see the Cloud Run functions Locations page.
Connections
For either a single-region location or multi-region location, you can only
create a remote function in the same location as the connection you use. For
example, to create a remote function in the US
multi-region, use a connection
located in the US
multi-region.
Pricing
Standard BigQuery pricing applies.
In addition, costs may be incurred for Cloud Run functions and Cloud Run by using this feature. Please review the Cloud Run functions and Cloud Run pricing pages for details.
Using VPC Service Controls
VPC Service Controls is a Google Cloud feature that allows
you to set up a secure perimeter to guard against data exfiltration. To use
VPC Service Controls with remote functions for additional security, or to use
endpoints with internal traffic
ingress settings,
follow the
VPC Service Controls guide to:
Create a service perimeter.
Add the BigQuery project of the query using the remote function into the perimeter.
Add the endpoint project into the perimeter and set
Cloud Functions API
orCloud Run API
in the restricted services based on your endpoint type. For more details, see Cloud Run functions VPC Service Controls and Cloud Run VPC Service Controls.
Best practices for remote functions
Prefilter your input: If your input can be easily filtered down before being passed to a remote function, your query will likely be faster and cheaper.
Keep your Cloud Run function scalable. Scalability is a function of minimum instances, maximum instances, and concurrency.
- Where possible, use the default value for your Cloud Run function's maximum number of instances.
- Note that there is no default limit for 1st gen HTTP Cloud Run functions. To avoid unbounded scaling events with 1st gen HTTP Cloud Run functions while testing or in production, we recommend setting a limit, for example, 3000.
Follow other Cloud Run function tips for better performance. Remote function queries interacting with a high latency Cloud Run function might fail due to timeout.
Implement your endpoint to return a right HTTP response code and payload for a failed response.
To minimize retries from BigQuery, use HTTP response codes other than 408, 429, 500, 503 and 504 for a failed response, and make sure to catch all exceptions in your function code. Otherwise, the HTTP service framework may automatically return 500 for any uncaught exceptions. You might still see retried HTTP requests when BigQuery retries a failed data partition or query.
Your endpoint should return a JSON payload in the defined format for a failed response. Even not strictly required, it helps BigQuery distinguish whether the failed response is from your function implementation or the infrastructure of Cloud Run functions/Cloud Run. For the latter, BigQuery may retry with a different internal limit.
Quotas
For information about remote function quotas, see Quotas and limits.