[ICML 2024] CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers.
framework transformer image-captioning visual-reasoning multimodal-learning visual-question-answering model-acceleration efficient-deep-learning vision-language-transformer image-text-retrieval text-image-retrieval token-ensemble token-matching
-
Updated
Oct 4, 2023