Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale. Vertex AI Vector Search allows users to search for semantically similar items using vector embeddings.
You can integrate your Cloud Spanner database with Vector Search to perform vector similarity search on your Spanner data. The general workflow is as follows:
- Generate and store vector embeddings in Spanner. You can manage them similarly to your operational data.
- Export and upload embeddings into a Vector Search index using the workflow presented on this page.
- Query the Vector Search index for similar items. You can query using a public endpoint or through VPC peering.
Exporting embeddings from Cloud Spanner to Vertex AI Vector Search is achieved by using the Cloud Workflow provided in this repository. For instructions on how to get started immediately, see Before you begin.
Figure: Export and sync Spanner data into Vector Search workflow.
This tutorial uses billable components of Google Cloud, including:
- Cloud Spanner: Store embeddings and operational data in Spanner.
- Vertex AI: Generate embeddings using models served by Vertex AI.
- Cloud Dataflow: Use a dataflow template to export the embeddings from Spanner to Cloud Storage.
- Google Cloud Storage (GCS): Store exported embeddings from Spanner in a GCS bucket in the input JSON format expected by Vector Search.
- Cloud Workflow: Orchestrate these two steps for the end-to-end flow..
- Export embeddings from Spanner to GCS as JSON.
- Build the Vector Search index from the JSON files in GCS.
- Cloud Scheduler: Used to trigger the Cloud Workflow.
- Ensure that your account has the required permissions.
- Generate and store
embeddings
in your Spanner database as
ARRAY<float64>
. For more details see Spanner schema.
To set up a periodic batch export from Spanner to a Vertex AI Vector Search index:
Follow the instructions on the
Create an index
page. In the folder that is passed to contentsDeltaUri
, create an empty
file called empty.json
. This creates an empty index.
If you already have an index, you can skip this step. The workflow will
overwrite your index.
There are multiple ways to clone a git repository, one way is to run the following command using the GitHub CLI:
gh repo clone cloudspannerecosystem/spanner-ai
cd spanner-ai/vertex-vector-search/workflows
This folder contains two files:
batch-export.yaml
: This is the workflow definition.sample-batch-input.json
: This is a sample of the workflow input parameters.
First, copy the sample JSON.
cp sample-batch-input.json input.json
Then edit input.json
with details for your project. See the Parameters in the
input.json file section for more information.
Deploy the workflow yaml file to your Google Cloud project. You can configure the region or location where the workflow will run when executed.
gcloud workflows deploy vector-export-workflow \
--source=batch-export.yaml [--location=<cloud region>] [--service account=<service_account>]
The workflow is now visible on the Workflows page in the Google Cloud console.
Note: You can also create and deploy the workflow from the Google Cloud
console. Follow the prompts in the Cloud console. For the workflow definition,
copy and paste the contents of batch-export.yaml
.
Run the following command to execute the workflow:
gcloud workflows execute \
vector-export-workflow --data="$(cat input.json)" \
[--location=<cloud region>]
The execution shows up in the Executions
tab in Workflows where you can
monitor it. For more information, see Monitor Workflows and Dataflow
jobs.
Note: You can also execute from the console using the Execute
button.
Follow the prompts and for the input, copy and paste the contents of your
customized input.json
.
Once the workflow executes successfully, schedule it periodically using Cloud Scheduler. This prevents your index from becoming stale as your embeddings change.
gcloud scheduler jobs create http vector-export-workflow \
--message-body="{ argument : $(cat input.json) }" \
--schedule="0 * * * *" --time-zone="PDT" \
--uri <invocation_url> [--service account=<service_account>]
The schedule argument accepts
unix-cron
format. The time-zone argument must be from the
tz database. See
scheduler help
for more information. The invocation_url
can be determined from the workflow
details page in the console by clicking on the Details
tab.
For production environments, we strongly recommend creating a new service account and granting it one or more IAM roles that contain the minimum permissions required for managing service. You can also choose to use different service accounts for different services as described below.
The following roles are needed to complete the instructions on this page.
- Cloud Scheduler Service Account:
- By default uses the Compute Engine default service account.
- If you use a manually configured service account, you must include the
following roles:
- Cloud Scheduler Service Agent role.
- To trigger the workflow: Workflows Invoker.
- Cloud Workflow Service Account:
- By default uses the Compute Engine default service account.
- If you use a manually configured service account, you must include the
following roles:
- To trigger dataflow job: Dataflow Admin, Dataflow Worker.
- To impersonate dataflow worker service account: Service Account User.
- To write Logs: : Logs Writer.
- To trigger Vertex AI Vector Search rebuild: Vertex AI User.
- Dataflow Worker Service Account:
- By default uses the Compute Engine default service account.
- If you use a manually configured service account, you must include the
following roles:
- To manage dataflow: Dataflow Admin, Dataflow Worker.
- To read data from Spanner: Cloud Spanner Database Reader.
- To write over selected GCS Container Registry: GCS Storage Bucket Owner.
Input to the workflow is provided using a JSON file. The included
sample-batch-input.json
contains both required and optional parameters. The value for each field in the
sample contains the description of the parameter. Copy the
sample-batch-input.json
and customize it according to the descriptions in the file. Delete the optional
parameters that you don’t want to pass.
The required parameters are organized by product. There are 4 sections -
dataflow
, gcs
, spanner
and vertex
- for each component that needs to be
configured. Outside of these sections, we have location
and project_id
.
These apply to all sections. The location
and project_id
arguments can be
overridden in any section if you need to run in a different location or use
resources from different projects.
Enter the instance_id
, database_id
, and table_name
. The
columns_to_export
parameter is used to list the columns to export as well as
which
Vector Search field
the column should map to if the column name differs from the field name expected by
Vector Search. The id
and embedding
fields are required in the update index
request. The restricts
and crowding_tag
fields are optional.
The format of the columns_to_export
parameter is a comma separated list of
fields in the following form:
<spanner_column_name> [: <vertex_field_name>]
For example, if the Spanner table contains the columns item_id
, embedding,
and crowding
, the columns_to_export
parameter needs to contain the following
columns and aliases:
item_id: id, embedding, crowding: crowding_tag
Since the embedding
column name matches the Vector Search field, it does not
need to be mapped.
temp_location
:Google Cloud Storage location to store temporary files
generated by the dataflow job.
gs://<bucket_name>/<folder_name>/
output_folder
: Google Cloud Storage location to store the JSON files generated
by the Dataflow job. To conserve storage space in Google Cloud Storage (GCS), it
is recommended that users configure a two-week TTL (time-to-live) rule for the
parent folder of any subfolders created by workflow runs. This will ensure that
any exported generated embeddings in those subfolders are automatically deleted
after two weeks. For more information on how to configure the TTL for an object,
see Object Lifecycle
Management.
gs://<bucket_name>/<folder_name>/
vector_search_index_id
: Vertex AI Vector Search Index which needs to be
updated.
The following parameters are optional to the workflow.
project_id
: GCP Project ID which contains a table with vector embeddings. By default, it is derived from the project_id specified at the root level of the JSON in the required parameters.data_boost_enabled
: Boolean parameter. The default setting isFalse
. Set toTrue
to execute the data export with near-zero impact to existing workloads on the provisioned Spanner instance.
service_account_email
: The Dataflow Worker Service Account email. By default, use the Compute Engine default service account of your project as the worker service account. The Compute Engine default service account has broad access to the resources of your project, which makes it easy to get started with Dataflow. However, for production workloads, we recommend that you create a new service account with all the roles listed in the [Permissions] section.project_id
: GCP Project ID on which the job runs. By default, it is derived from the project_id specified at the root level of the JSON in the required parameters.location
: Project Region from where the job runs. By default, it is derived from the location specified at the root level of the JSON in the required parameters.max_workers
: The maximum number of workers to run the job. The default is set to 1000.num_workers
: The initial numbers of workers to start the job. If auto-scaling is enabled, this is not required. generally not required with auto-scaling is enabled.job_name_prefix
: Dataflow Job Name Prefix. The default value is spanner-vectors-export. This prefix can be used to filter jobs in the Dataflow console.
output_file_prefix
: Exported JSON File Name Prefix. The default value is
vector-embeddings.
project_id
: GCP Project on which Vertex AI Vector Search index is built and deployed. By default, it is derived from the project_id specified at the root level of the JSON in the required parameters.location
: Project Region on which Vertex AI Vector Search index is built and deployed. By default, it is derived from the location specified at the root level of the JSON in the required parameters.
Vertex AI Vector Search accepts the following arguments when creating or updating the index.
id
(required): A string.embedding
(required): An array of floats.restricts
(optional): An array of objects, with each object being a nested structure that provides the namespace and the allow/denylist for the datapoint.crowding_tag
(optional): A string.
When defining your Spanner schema, you must have columns that will contain the
data for the required arguments. The names of the columns do not need to match
the name of the Vector Search arguments. If the column names are different, you
can alias them in the Cloud Workflow cloumns_to_export
parameter as described
in Spanner Parameters.
The data type for each of the columns in the Spanner schema should be as shown below.
id
: Any data type that can be converted to a string.embedding
: ARRAY.restricts
: JSON.crowding_tag
: String.
Note: The table may contain columns that are not relevant for the export and
sync workflow. The columns not specified in the columns_to_export
parameter
are ignored.
After you have deployed a Workflow, you can check the status of your workflow execution in the Execution tab of the Workflows page in the Cloud console. On the Execution details page, you can view the results of the execution including any output, the execution ID and state, and the current or final step of the workflow execution. Useful information is also printed to the log at the bottom of the page. If there are errors, they are shown in the log and Output section. For more information on debugging errors, see Debug Workflows.
After the workflow starts the Dataflow job, you can check the status of your job
execution from the Jobs
dashboard in the Cloud console.
Find the relevant job by filtering for the job_name_prefix
parameter that you
set in input.json
. For troubleshooting tips, see Pipeline troubleshooting and
debugging.
After the export completes, the workflow triggers the update of the Vector Search index. This is a long running operation and regular updates are logged in the Workflows Execution page until completion.