workflows

Cloud Spanner to Vertex Vector Search Export README

Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale. Vertex AI Vector Search allows users to search for semantically similar items using vector embeddings.

You can integrate your Cloud Spanner database with Vector Search to perform vector similarity search on your Spanner data. The general workflow is as follows:

Generate and store vector embeddings in Spanner. You can manage them similarly to your operational data.
Export and upload embeddings into a Vector Search index using the workflow presented on this page.
Query the Vector Search index for similar items. You can query using a public endpoint or through VPC peering.

Overview

Exporting embeddings from Cloud Spanner to Vertex AI Vector Search is achieved by using the Cloud Workflow provided in this repository. For instructions on how to get started immediately, see Before you begin.

Figure: Export and sync Spanner data into Vector Search workflow.

This tutorial uses billable components of Google Cloud, including:

Cloud Spanner: Store embeddings and operational data in Spanner.
Vertex AI: Generate embeddings using models served by Vertex AI.
Cloud Dataflow: Use a dataflow template to export the embeddings from Spanner to Cloud Storage.
Google Cloud Storage (GCS): Store exported embeddings from Spanner in a GCS bucket in the input JSON format expected by Vector Search.
Cloud Workflow: Orchestrate these two steps for the end-to-end flow..
1. Export embeddings from Spanner to GCS as JSON.
2. Build the Vector Search index from the JSON files in GCS.
Cloud Scheduler: Used to trigger the Cloud Workflow.

Before you begin

Ensure that your account has the required permissions.
Generate and store embeddings in your Spanner database as ARRAY<float64>. For more details see Spanner schema.

Set up Cloud Workflow

To set up a periodic batch export from Spanner to a Vertex AI Vector Search index:

1. Create an empty index:

Follow the instructions on the Create an index page. In the folder that is passed to contentsDeltaUri, create an empty file called empty.json. This creates an empty index. If you already have an index, you can skip this step. The workflow will overwrite your index.

2. Clone this git repository:

There are multiple ways to clone a git repository, one way is to run the following command using the GitHub CLI:

gh repo clone cloudspannerecosystem/spanner-ai
cd spanner-ai/vertex-vector-search/workflows

This folder contains two files:

batch-export.yaml: This is the workflow definition.
sample-batch-input.json: This is a sample of the workflow input parameters.

3. Setup input.json from the sample file:

First, copy the sample JSON.

cp sample-batch-input.json input.json

Then edit input.json with details for your project. See the Parameters in the input.json file section for more information.

4. Deploy the workflow:

Deploy the workflow yaml file to your Google Cloud project. You can configure the region or location where the workflow will run when executed.

  gcloud workflows deploy vector-export-workflow \
--source=batch-export.yaml [--location=<cloud region>] [--service account=<service_account>]

The workflow is now visible on the Workflows page in the Google Cloud console.

Note: You can also create and deploy the workflow from the Google Cloud console. Follow the prompts in the Cloud console. For the workflow definition, copy and paste the contents of batch-export.yaml.

5. Execute the workflow:

Run the following command to execute the workflow:

gcloud workflows execute \
    vector-export-workflow --data="$(cat input.json)" \
    [--location=<cloud region>]

The execution shows up in the Executions tab in Workflows where you can monitor it. For more information, see Monitor Workflows and Dataflow jobs.

Note: You can also execute from the console using the Execute button. Follow the prompts and for the input, copy and paste the contents of your customized input.json.

6. Schedule the workflow for periodic execution:

Once the workflow executes successfully, schedule it periodically using Cloud Scheduler. This prevents your index from becoming stale as your embeddings change.

gcloud scheduler jobs create http vector-export-workflow \
  --message-body="{ argument : $(cat input.json) }" \
  --schedule="0 * * * *" --time-zone="PDT" \
  --uri <invocation_url> [--service account=<service_account>]

The schedule argument accepts unix-cron format. The time-zone argument must be from the tz database. See scheduler help for more information. The invocation_url can be determined from the workflow details page in the console by clicking on the Details tab.

Appendix

Permissions

For production environments, we strongly recommend creating a new service account and granting it one or more IAM roles that contain the minimum permissions required for managing service. You can also choose to use different service accounts for different services as described below.

The following roles are needed to complete the instructions on this page.

Cloud Scheduler Service Account:
1. By default uses the Compute Engine default service account.
2. If you use a manually configured service account, you must include the following roles:
  1. Cloud Scheduler Service Agent role.
  2. To trigger the workflow: Workflows Invoker.
Cloud Workflow Service Account:
1. By default uses the Compute Engine default service account.
2. If you use a manually configured service account, you must include the following roles:
  1. To trigger dataflow job: Dataflow Admin, Dataflow Worker.
  2. To impersonate dataflow worker service account: Service Account User.
  3. To write Logs: : Logs Writer.
  4. To trigger Vertex AI Vector Search rebuild: Vertex AI User.
Dataflow Worker Service Account:
1. By default uses the Compute Engine default service account.
2. If you use a manually configured service account, you must include the following roles:
  1. To manage dataflow: Dataflow Admin, Dataflow Worker.
  2. To read data from Spanner: Cloud Spanner Database Reader.
  3. To write over selected GCS Container Registry: GCS Storage Bucket Owner.

Parameters in the input.json file

Input to the workflow is provided using a JSON file. The included sample-batch-input.json contains both required and optional parameters. The value for each field in the sample contains the description of the parameter. Copy the sample-batch-input.json and customize it according to the descriptions in the file. Delete the optional parameters that you don’t want to pass.

Required Parameters

The required parameters are organized by product. There are 4 sections - dataflow, gcs, spanner and vertex - for each component that needs to be configured. Outside of these sections, we have location and project_id. These apply to all sections. The location and project_id arguments can be overridden in any section if you need to run in a different location or use resources from different projects.

Spanner Parameters

Enter the instance_id, database_id, and table_name. The columns_to_export parameter is used to list the columns to export as well as which Vector Search field the column should map to if the column name differs from the field name expected by Vector Search. The id and embedding fields are required in the update index request. The restricts and crowding_tag fields are optional.

The format of the columns_to_export parameter is a comma separated list of fields in the following form:

<spanner_column_name> [: <vertex_field_name>]

For example, if the Spanner table contains the columns item_id, embedding, and crowding, the columns_to_export parameter needs to contain the following columns and aliases:

item_id: id, embedding, crowding: crowding_tag

Since the embedding column name matches the Vector Search field, it does not need to be mapped.

Dataflow Parameters

temp_location:Google Cloud Storage location to store temporary files generated by the dataflow job.

gs://<bucket_name>/<folder_name>/

GCS Parameters

output_folder: Google Cloud Storage location to store the JSON files generated by the Dataflow job. To conserve storage space in Google Cloud Storage (GCS), it is recommended that users configure a two-week TTL (time-to-live) rule for the parent folder of any subfolders created by workflow runs. This will ensure that any exported generated embeddings in those subfolders are automatically deleted after two weeks. For more information on how to configure the TTL for an object, see Object Lifecycle Management.

gs://<bucket_name>/<folder_name>/

Vertex Parameters

vector_search_index_id: Vertex AI Vector Search Index which needs to be updated.

Optional Parameters

The following parameters are optional to the workflow.

Spanner Parameters

project_id: GCP Project ID which contains a table with vector embeddings. By default, it is derived from the project_id specified at the root level of the JSON in the required parameters.
data_boost_enabled: Boolean parameter. The default setting is False. Set to True to execute the data export with near-zero impact to existing workloads on the provisioned Spanner instance.

Dataflow Parameters

service_account_email: The Dataflow Worker Service Account email. By default, use the Compute Engine default service account of your project as the worker service account. The Compute Engine default service account has broad access to the resources of your project, which makes it easy to get started with Dataflow. However, for production workloads, we recommend that you create a new service account with all the roles listed in the [Permissions] section.
project_id: GCP Project ID on which the job runs. By default, it is derived from the project_id specified at the root level of the JSON in the required parameters.
location: Project Region from where the job runs. By default, it is derived from the location specified at the root level of the JSON in the required parameters.
max_workers: The maximum number of workers to run the job. The default is set to 1000.
num_workers: The initial numbers of workers to start the job. If auto-scaling is enabled, this is not required. generally not required with auto-scaling is enabled.
job_name_prefix: Dataflow Job Name Prefix. The default value is spanner-vectors-export. This prefix can be used to filter jobs in the Dataflow console.

GCS Parameters

output_file_prefix: Exported JSON File Name Prefix. The default value is vector-embeddings.

Vertex Parameters

project_id: GCP Project on which Vertex AI Vector Search index is built and deployed. By default, it is derived from the project_id specified at the root level of the JSON in the required parameters.
location: Project Region on which Vertex AI Vector Search index is built and deployed. By default, it is derived from the location specified at the root level of the JSON in the required parameters.

Spanner schema

Vertex AI Vector Search accepts the following arguments when creating or updating the index.

id (required): A string.
embedding (required): An array of floats.
restricts (optional): An array of objects, with each object being a nested structure that provides the namespace and the allow/denylist for the datapoint.
crowding_tag (optional): A string.

When defining your Spanner schema, you must have columns that will contain the data for the required arguments. The names of the columns do not need to match the name of the Vector Search arguments. If the column names are different, you can alias them in the Cloud Workflow cloumns_to_export parameter as described in Spanner Parameters.

The data type for each of the columns in the Spanner schema should be as shown below.

id: Any data type that can be converted to a string.
embedding: ARRAY.
restricts: JSON.
crowding_tag: String.

Note: The table may contain columns that are not relevant for the export and sync workflow. The columns not specified in the columns_to_export parameter are ignored.

Monitor Workflows and Dataflow jobs

After you have deployed a Workflow, you can check the status of your workflow execution in the Execution tab of the Workflows page in the Cloud console. On the Execution details page, you can view the results of the execution including any output, the execution ID and state, and the current or final step of the workflow execution. Useful information is also printed to the log at the bottom of the page. If there are errors, they are shown in the log and Output section. For more information on debugging errors, see Debug Workflows.

After the workflow starts the Dataflow job, you can check the status of your job execution from the Jobs dashboard in the Cloud console. Find the relevant job by filtering for the job_name_prefix parameter that you set in input.json. For troubleshooting tips, see Pipeline troubleshooting and debugging.

After the export completes, the workflow triggers the update of the Vector Search index. This is a long running operation and regular updates are logged in the Workflows Execution page until completion.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
batch-export.yaml		batch-export.yaml
sample-batch-input.json		sample-batch-input.json
spanner-vector-search-batch-export.svg		spanner-vector-search-batch-export.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows

workflows

README.md

Cloud Spanner to Vertex Vector Search Export README

Overview

Before you begin

Set up Cloud Workflow

1. Create an empty index:

2. Clone this git repository:

3. Setup input.json from the sample file:

4. Deploy the workflow:

5. Execute the workflow:

6. Schedule the workflow for periodic execution:

Appendix

Permissions

Parameters in the input.json file

Required Parameters

Spanner Parameters

Dataflow Parameters

GCS Parameters

Vertex Parameters

Optional Parameters

Spanner Parameters

Dataflow Parameters

GCS Parameters

Vertex Parameters

Spanner schema

Monitor Workflows and Dataflow jobs

Files

workflows

Directory actions

More options

Directory actions

More options

Latest commit

History

workflows

Folders and files

parent directory

README.md

Cloud Spanner to Vertex Vector Search Export README

Overview

Before you begin

Set up Cloud Workflow

1. Create an empty index:

2. Clone this git repository:

3. Setup input.json from the sample file:

4. Deploy the workflow:

5. Execute the workflow:

6. Schedule the workflow for periodic execution:

Appendix

Permissions

Parameters in the input.json file

Required Parameters

Spanner Parameters

Dataflow Parameters

GCS Parameters

Vertex Parameters

Optional Parameters

Spanner Parameters

Dataflow Parameters

GCS Parameters

Vertex Parameters

Spanner schema

Monitor Workflows and Dataflow jobs