[Feat]: What is the best workaround for grounding with image input?

### Is your feature request related to a problem? Please describe.

I'm frustrated that VertexAI grounding is not supported if input is non-text.

### Describe the solution you'd like

Is there currently a convenient workaround for it? I'd like to have this functionality of being able to ask for ex: 

input(what this item is made of? + [image]) -> grounded search of what item like this is usually made of in the document -> output(text)

### Describe alternatives you've considered

There is obviously a way of making 2 separate calls:

1. Querying about what is the item in the image
2. Pass that text output to a model with grounding which inputs only text

However, that increases costs massively.

### Additional context

_No response_

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat]: What is the best workaround for grounding with image input? #1194

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feat]: What is the best workaround for grounding with image input? #1194

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions