Get image descriptions using visual captioning

Visual captioning lets you generate a relevant description for an image. You can use this information for a variety of uses:

  • Get more detailed metadata about images for storing and searching.
  • Generate automated captioning to support accessibility use cases.
  • Receive quick descriptions of products and visual assets.
Sample captioned image

Image source: Santhosh Kumar on Unsplash (cropped)

Caption (short-form): a blue shirt with white polka dots is hanging on a hook

Supported languages

Visual captioning is available in the following languages:

  • English (en)
  • French (fr)
  • German (de)
  • Italian (it)
  • Spanish (es)

Performance and limitations

The following limits apply when you use this model:

Limits Value
Maximum number of API requests (short-form) per minute per project 500
Maximum number of tokens returned in response (short-form) 64 tokens
Maximum number of tokens accepted in request (VQA short-form only) 80 tokens

The following service latency estimates apply when you use this model. These values are meant to be illustrative and are not a promise of service:

Latency Value
API requests (short-form) 1.5 seconds

Locations

A location is a region you can specify in a request to control where data is stored at rest. For a list of available regions, see Generative AI on Vertex AI locations.

Responsible AI safety filtering

The image captioning and Visual Question Answering (VQA) feature model doesn't support user-configurable safety filters. However, the overall Imagen safety filtering occurs on the following data:

  • User input
  • Model output

As a result, your output may differ from the sample output if Imagen applies these safety filters. Consider the following examples.

Filtered input

If the input is filtered, the response is similar to the following:

{
  "error": {
    "code": 400,
    "message": "Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.DebugInfo",
        "detail": "[ORIGINAL ERROR] generic::invalid_argument: Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394 [google.rpc.error_details_ext] { message: \"Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394\" }"
      }
    ]
  }
}

Filtered output

If the number of responses returned is less than the sample count you specify, this means the missing responses are filtered by Responsible AI. For example, the following is a response to a request with "sampleCount": 2, but one of the responses is filtered out:

{
  "predictions": [
    "cappuccino"
  ]
}

If all the output is filtered, the response is an empty object similar to the following:

{}

Get short-form image captions

Use the following samples to generate short-form captions for an image.

REST

For more information about imagetext model requests, see the imagetext model API reference.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Your Google Cloud project ID.
  • LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.
  • B64_IMAGE: The image to get captions for. The image must be specified as a base64-encoded byte string. Size limit: 10 MB.
  • RESPONSE_COUNT: The number of image captions you want to generate. Accepted integer values: 1-3.
  • LANGUAGE_CODE: One of the supported language codes. Languages supported:
    • English (en)
    • French (fr)
    • German (de)
    • Italian (it)
    • Spanish (es)

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict

Request JSON body:

{
  "instances": [
    {
      "image": {
          "bytesBase64Encoded": "B64_IMAGE"
      }
    }
  ],
  "parameters": {
    "sampleCount": RESPONSE_COUNT,
    "language": "LANGUAGE_CODE"
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict" | Select-Object -Expand Content
The following sample responses are for a request with "sampleCount": 2. The response returns two prediction strings.

English (en):

{
  "predictions": [
    "a yellow mug with a sheep on it sits next to a slice of cake",
    "a cup of coffee with a heart shaped latte art next to a slice of cake"
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID",
  "model": "projects/PROJECT_ID/locations/LOCATION/models/MODEL_ID",
  "modelDisplayName": "MODEL_DISPLAYNAME",
  "modelVersionId": "1"
}

Spanish (es):

{
  "predictions": [
    "una taza de café junto a un plato de pastel de chocolate",
    "una taza de café con una forma de corazón en la espuma"
  ]
}

Python

Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

In this sample you use the load_from_file method to reference a local file as the base Image to get a caption for. After you specify the base image, you use the get_captions method on the ImageTextModel and print the output.


import vertexai
from vertexai.preview.vision_models import Image, ImageTextModel

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
# input_file = "input-image.png"

vertexai.init(project=PROJECT_ID, location="us-central1")

model = ImageTextModel.from_pretrained("imagetext@001")
source_img = Image.load_from_file(location=input_file)

captions = model.get_captions(
    image=source_img,
    # Optional parameters
    language="en",
    number_of_results=2,
)

print(captions)
# Example response:
# ['a cat with green eyes looks up at the sky']

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

In this sample, you call the predict method on a PredictionServiceClient. The service returns captions for the provided image.
/**
 * TODO(developer): Update these variables before running the sample.
 */
const projectId = process.env.CAIP_PROJECT_ID;
const location = 'us-central1';
const inputFile = 'resources/cat.png';

const aiplatform = require('@google-cloud/aiplatform');

// Imports the Google Cloud Prediction Service Client library
const {PredictionServiceClient} = aiplatform.v1;

// Import the helper module for converting arbitrary protobuf.Value objects
const {helpers} = aiplatform;

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: `${location}-aiplatform.googleapis.com`,
};

// Instantiates a client
const predictionServiceClient = new PredictionServiceClient(clientOptions);

async function getShortFormImageCaptions() {
  const fs = require('fs');
  // Configure the parent resource
  const endpoint = `projects/${projectId}/locations/${location}/publishers/google/models/imagetext@001`;

  const imageFile = fs.readFileSync(inputFile);
  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const instance = {
    image: {
      bytesBase64Encoded: encodedImage,
    },
  };
  const instanceValue = helpers.toValue(instance);
  const instances = [instanceValue];

  const parameter = {
    // Optional parameters
    language: 'en',
    sampleCount: 2,
  };
  const parameters = helpers.toValue(parameter);

  const request = {
    endpoint,
    instances,
    parameters,
  };

  // Predict request
  const [response] = await predictionServiceClient.predict(request);
  const predictions = response.predictions;
  if (predictions.length === 0) {
    console.log(
      'No captions were generated. Check the request parameters and image.'
    );
  } else {
    predictions.forEach(prediction => {
      console.log(prediction.stringValue);
    });
  }
}
await getShortFormImageCaptions();

Use parameters for image captioning

When you get image captions there are several parameters you can set depending on your use case.

Number of results

Use the number of results parameter to limit the amount of captions returned for each request you send. For more information, see the imagetext (image captioning) model API reference.

Seed number

A number you add to a request to make generated descriptions deterministic. Adding a seed number with your request is a way to assure you get the same prediction (descriptions) each time. However, the image captions aren't necessarily returned in the same order. For more information, see the imagetext (image captioning) model API reference.

What's next

Read articles about Imagen and other Generative AI on Vertex AI products: