Is your feature request related to a problem? Please describe.
I'm frustrated that VertexAI grounding is not supported if input is non-text.
Describe the solution you'd like
Is there currently a convenient workaround for it? I'd like to have this functionality of being able to ask for ex:
input(what this item is made of? + [image]) -> grounded search of what item like this is usually made of in the document -> output(text)
Describe alternatives you've considered
There is obviously a way of making 2 separate calls:
- Querying about what is the item in the image
- Pass that text output to a model with grounding which inputs only text
However, that increases costs massively.
Additional context
No response
Code of Conduct