Stars
Aggregation framework for annotating datasets in computer vision tasks (detection, segmentation, video captioning etc.)
MERA (Multimodal Evaluation for Russian-language Architectures) is a new open benchmark for the Russian language for evaluating fundamental models.
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
MINT-1T: A one trillion token multimodal interleaved dataset.
Framework agnostic sliced/tiled inference + interactive ui + error analysis plots
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
YOLOv10: Real-Time End-to-End Object Detection [NeurIPS 2024]
GPT4V-level open-source multi-modal model based on Llama3-8B
Fast, modern C++ DSP framework, FFT, Sample Rate Conversion, FIR/IIR/Biquad Filters (SSE, AVX, AVX-512, ARM NEON)
Evaluation of the Optical Character Recognition (OCR) capabilities of GPT-4V(ision)
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Implementation of Nougat Neural Optical Understanding for Academic Documents
Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention
MiVOLO age & gender transformer neural network
[CVPR 2024] Real-Time Open-Vocabulary Object Detection
Mixture-of-Experts for Large Vision-Language Models
[ICML 2024] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
A high-throughput and memory-efficient inference and serving engine for LLMs
OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]
We write your reusable computer vision tools. 💜
Paper list of sign language, including sign language recognition(SLR), sign language translation(SLT) and other interesting work. Quick start your awesome work with us!! 🤟🤟🤟
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.