Skip to content

amusi/CVPR2025-Papers-with-Code

Repository files navigation

CVPR 2025 论文和开源项目合集(Papers with Code)

CVPR 2025 decisions are now available on OpenReview!22.1% = 2878 / 13008

注1:欢迎各位大佬提交issue,分享CVPR 2025论文和开源项目!

注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision

欢迎扫码加入【CVer学术交流群】,可以获取CVPR 2025等最前沿工作!这是最大的计算机视觉AI知识星球!每日更新,第一时间分享最新最前沿的计算机视觉、AIGC、扩散模型、多模态、深度学习、自动驾驶、医疗影像和遥感等方向的学习资料,快加入学起来!

【CVPR 2025 论文开源目录】

3DGS(Gaussian Splatting)

Agent

SpiritSight Agent: Advanced GUI Agent with One Look

Avatars

Backbone

Building Vision Models upon Heat Conduction

LSNet: See Large, Focus Small

CLIP

Mamba

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

MambaIC: State Space Models for High-Performance Learned Image Compression

Embodied AI

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

GAN

OCR

NeRF

DETR

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Prompt

多模态大语言模型(MLLM)

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

Retrieval-Augmented Personalization for Multimodal Large Language Models

BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

MMRL: Multi-Modal Representation Learning for Vision-Language Models

PAVE: Patching and Adapting Video Large Language Models

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

大语言模型(LLM)

NAS

ReID(重识别)

From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization

AirRoom: Objects Matter in Room Reidentification

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

扩散模型(Diffusion Models)

TinyFusion: Diffusion Transformers Learned Shallow

DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture

Vision Transformer

视觉和语言(Vision-Language)

NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

MMRL: Multi-Modal Representation Learning for Vision-Language Models

目标检测(Object Detection)

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

异常检测(Anomaly Detection)

目标跟踪(Object Tracking)

Multiple Object Tracking as ID Prediction

Omnidirectional Multi-Object Tracking

医学图像(Medical Image)

BrainMVP: Multi-modal Vision Pre-training for Medical Image Analysis

医学图像分割(Medical Image Segmentation)

Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation

自动驾驶(Autonomous Driving)

LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes

3D点云(3D-Point-Cloud)

Unlocking Generalization Power in LiDAR Point Cloud Registration

3D目标检测(3D Object Detection)

3D语义分割(3D Semantic Segmentation)

Low-level Vision

超分辨率(Super-Resolution)

AESOP: Auto-Encoded Supervision for Perceptual Image Super-Resolution

去噪(Denoising)

图像去噪(Image Denoising)

3D人体姿态估计(3D Human Pose Estimation)

Reconstructing Humans with a Biomechanically Accurate Skeleton

#3D Visual Grounding(3D视觉定位)

ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

图像生成(Image Generation)

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

PAR: Parallelized Autoregressive Visual Generation

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

视频生成(Video Generation)

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models

X-Dyna: Expressive Dynamic Human Image Animation

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

图像编辑(Image Editing)

Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing

h-Edit: Effective and Flexible Diffusion-Based Editing via Doob’s h-Transform

视频编辑(Video Editing)

3D生成(3D Generation)

Generative Gaussian Splatting for Unbounded 3D City Generation

StdGEN: Semantic-Decomposed 3D Character Generation from Single Images

3D重建(3D Reconstruction)

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

人体运动生成(Human Motion Generation)

SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance

视频理解(Video Understanding)

Temporal Grounding Videos like Flipping Manga

具身智能(Embodied AI)

Universal Actions for Enhanced Embodied Foundation Models

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

知识蒸馏(Knowledge Distillation)

深度估计(Depth Estimation)

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

MonSter: Marry Monodepth to Stereo Unleashes Power

立体匹配(Stereo Matching)

MonSter: Marry Monodepth to Stereo Unleashes Power

暗光图像增强(Low-light Image Enhancement)

HVI: A New color space for Low-light Image Enhancement

ReDDiT: Efficient Diffusion as Low Light Enhancer

图像压缩(Image Compression)](#IC)

MambaIC: State Space Models for High-Performance Learned Image Compression

场景图生成(Scene Graph Generation)

风格迁移(Style Transfer)

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

图像质量评价(Image Quality Assessment)

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

视频质量评价(Video Quality Assessment)

压缩感知(Compressive Sensing)

Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing

数据集(Datasets)

Objaverse++: Curated 3D Object Dataset with Quality Annotations

其他(Others)

DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry

Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation

EVOS: Efficient Implicit Neural Training via EVOlutionary Selector