vision-language-model

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

ai gcc multimodality vlm cradle computer-control lmm grounding ai-agent large-language-models llm generative-ai vision-language-model ai-agents-framework general-computer-control personoid foundation-agent

Updated Nov 7, 2024
Python

AlibabaResearch / AdvancedLiterateMachinery

Star

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

ocr computer-vision artificial-intelligence text-recognition document text-detection document-analysis end-to-end-ocr multimodal scene-text-recognition multimodal-deep-learning scene-text-detection vision-language document-understanding scene-text-detection-recognition document-recognition document-intelligence documentai vision-language-transformer vision-language-model

Updated Nov 22, 2024
C++

NVlabs / prismer

Star

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

vqa image-captioning language-model multi-task-learning vision-and-language multi-modal-learning vision-language-model

Updated Jan 17, 2024
Python

illuin-tech / colpali

Star

The code used to train and run inference with the ColPali architecture.

information-retrieval vision-language-model retrieval-augmented-generation colpali

Updated Nov 24, 2024
Python

llm-jp / awesome-japanese-llm

Star

日本語LLMまとめ - Overview of Japanese LLMs

japanese generative-model japanese-language language-models language-model generative-models multimodal vision-and-language vision-language foundation-models large-language-models llm llms generative-ai large-language-model vision-language-model japanese-llm japanese-language-model llm-japanese

Updated Nov 16, 2024
TypeScript

PKU-YuanGroup / Chat-UniVi

Star

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

video-understanding image-understanding large-language-models vision-language-model

Updated Oct 16, 2024
Python

mbzuai-oryx / groundingLMM

Star

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

vision-and-language lmm foundation-models vision-language-model llm-agent

Updated Nov 23, 2024
Python

SunzeY / AlphaCLIP

Star

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

machine-learning deep-learning vision-and-language vision-language vision-transformer vision-language-model

Updated Jul 30, 2024
Jupyter Notebook

huangwl18 / VoxPoser

Star

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

robotics motion-planning robotic-manipulation embodied-ai foundation-models large-language-models vision-language-model

Updated May 8, 2024
Python

FoundationVision / Groma

Star

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization

llama multimodal grounding foundation-models large-language-models llm mllm vision-language-model llama2

Updated Jun 7, 2024
Python

zubair-irshad / Awesome-Robotics-3D

Star

A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites

computer-vision robotics navigation benchmarks simulations manipulation scene-graph grasping nerf 3d pointclouds vlm diffusion-models pretraining policy-learning foundation-models llm vision-language-model gaussian-splatting

Updated Nov 4, 2024

AIDC-AI / Ovis

Star

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

chatbot multimodality multimodal vision-language-model multimodal-large-language-models vision-language-learning qwen llama3

Updated Nov 4, 2024
Python

AlaaLab / InstructCV

Star

[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"

generative-model text-to-image multi-task-learning diffusion-models stable-diffusion vision-language-model

Updated Apr 27, 2024
Python

Improve this page

Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vision-language-model

Here are 216 public repositories matching this topic...

haotian-liu / LLaVA

OpenGVLab / InternVL

QwenLM / Qwen-VL

dvlab-research / MGM

InternLM / InternLM-XComposer

jingyi0000 / VLM_survey

deepseek-ai / DeepSeek-VL

BAAI-Agents / Cradle

AlibabaResearch / AdvancedLiterateMachinery

NVlabs / prismer

illuin-tech / colpali

llm-jp / awesome-japanese-llm

PKU-YuanGroup / Chat-UniVi

mbzuai-oryx / groundingLMM

SunzeY / AlphaCLIP

huangwl18 / VoxPoser

FoundationVision / Groma

zubair-irshad / Awesome-Robotics-3D

AIDC-AI / Ovis

AlaaLab / InstructCV

Improve this page

Add this topic to your repo