Show Detail |
Timezone: America/Los_Angeles |
Filter Rooms:
SAT 28 SEP
11 p.m.
(ends 9:00 AM)
SUN 29 SEP
midnight
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
1:30 a.m.
4 a.m.
5 a.m.
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
6:30 a.m.
11 p.m.
(ends 9:00 AM)
MON 30 SEP
midnight
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Workshop:
(ends 4:00 AM)
Tutorial:
(ends 4:00 AM)
1:30 a.m.
4 a.m.
5 a.m.
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
Workshop:
(ends 9:00 AM)
6:30 a.m.
10 p.m.
(ends 9:30 AM)
11 p.m.
(ends 12:00 AM)
TUE 1 OCT
midnight
Orals 12:00-1:20
[12:00]
Towards Scene Graph Anticipation
[12:10]
OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation
[12:20]
PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
[12:30]
Bi-directional Contextual Attention for 3D Dense Captioning
[12:40]
OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects
[12:50]
ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting
[1:00]
A Fair Ranking and New Model for Panoptic Scene Graph Generation
[1:10]
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
(ends 1:30 AM)
Orals 12:00-1:20
[12:00]
Making Large Language Models Better Planners with Reasoning-Decision Alignment
[12:10]
MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping
[12:20]
M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation
[12:30]
H-V2X: A Large Scale Highway Dataset for BEV Perception
[12:40]
Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction
[12:50]
DriveLM: Driving with Graph Visual Question Answering
[1:00]
RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios
[1:10]
Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks
(ends 1:30 AM)
Orals 12:00-1:20
[12:00]
Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection
[12:10]
Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging
[12:20]
SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow
[12:30]
Photon Inhibition for Energy-Efficient Single-Photon Imaging
[12:40]
Minimalist Vision with Freeform Pixels
[12:50]
Flying with Photons: Rendering Novel Views of Propagating Light
[1:00]
A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging
[1:10]
GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
(ends 1:30 AM)
Demonstrations 12:00-3:30
(ends 3:30 AM)
1:30 a.m.
(ends 3:30 AM)
3 a.m.
3:30 a.m.
4:30 a.m.
Orals 4:30-6:20
[4:30]
EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
[4:40]
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
[4:50]
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
[5:00]
FlashTex: Fast Relightable Mesh Texturing with LightControlNet
[5:10]
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
[5:20]
LLMGA: Multimodal Large Language Model based Generation Assistant
[5:30]
Accelerating Image Generation with Sub-path Linear Approximation Model
[5:40]
SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation
[5:50]
Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture
[6:00]
Zero-Shot Detection of AI-Generated Images
[6:10]
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
(ends 6:30 AM)
Orals 4:30-6:20
[4:30]
Efficient Bias Mitigation Without Privileged Information
[4:40]
Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation
[4:50]
MobileNetV4: Universal Models for the Mobile Ecosystem
[5:00]
Momentum Auxiliary Network for Supervised Local Learning
[5:10]
From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition
[5:20]
Dataset Enhancement with Instance-Level Augmentations
[5:30]
Adaptive Parametric Activation
[5:40]
Relation DETR: Exploring Explicit Position Relation Prior for Object Detection
[5:50]
Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation
[6:00]
CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
[6:10]
On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines
(ends 6:30 AM)
Orals 4:30-6:20
[4:30]
Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition
[4:40]
COMO: Compact Mapping and Odometry
[4:50]
Smoothness, Synthesis, and Sampling: Re-thinking Unsupervised Multi-View Stereo with DIV Loss
[5:00]
ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation
[5:10]
SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments
[5:20]
Six-Point Method for Multi-Camera Systems with Reduced Solution Space
[5:30]
Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
[5:40]
Grounding Image Matching in 3D with MASt3R
[5:50]
ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images
[6:00]
Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection
[6:10]
Camera Calibration using a Collimator System
(ends 6:30 AM)
5:30 a.m.
Demonstrations 5:30-9:00
(ends 9:00 AM)
6:30 a.m.
Keynote:
Lourdes Agapito · Vittorio Ferrari
(ends 7:30 AM)
7:30 a.m.
Posters 7:30-9:30
SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models
Enhancing Source-Free Domain Adaptive Object Detection with Low-confidence Pseudo Label Distillation
(ends 9:30 AM)
9:30 a.m.
11 p.m.
(ends 9:30 AM)
WED 2 OCT
midnight
(ends 3:30 AM)
Orals 12:00-1:20
[12:00]
PetFace: A Large-Scale Dataset and Benchmark for Animal Identification
[12:10]
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
[12:20]
Towards Model-Agnostic Dataset Condensation by Heterogeneous Models
[12:30]
Parrot Captions Teach CLIP to Spot Text
[12:40]
Towards Open-ended Visual Quality Comparison
[12:50]
VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking
[1:00]
Insect Identification in the Wild: The AMI Dataset
[1:10]
MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description
(ends 1:30 AM)
Orals 12:00-1:20
[12:00]
PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
[12:10]
Self-Supervised Video Desmoking for Laparoscopic Surgery
[12:20]
CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos
[12:30]
Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction
[12:40]
Adaptive Correspondence Scoring for Unsupervised Medical Image Registration
[12:50]
Revisiting Adaptive Cellular Recognition Under Domain Shifts: A Contextual Correspondence View
[1:00]
SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images
[1:10]
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
(ends 1:30 AM)
Orals 12:00-1:20
[12:00]
HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation
[12:10]
PointLLM: Empowering Large Language Models to Understand Point Clouds
[12:20]
RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation
[12:30]
DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment
[12:40]
KeypointDETR: An End-to-End 3D Keypoint Detector
[12:50]
Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather
[1:00]
RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation
[1:10]
Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration
(ends 1:30 AM)
1:30 a.m.
(ends 3:30 AM)
3:30 a.m.
4:30 a.m.
Orals 4:30-6:20
[4:30]
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
[4:40]
Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering
[4:50]
Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration
[5:00]
FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information
[5:10]
RaFE: Generative Radiance Fields Restoration
[5:20]
Watch Your Steps: Local Image and Scene Editing by Text Instructions
[5:30]
MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
[5:40]
RPBG: Towards Robust Neural Point-based Graphics in the Wild
[5:50]
Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields
[6:00]
Learning 3D-aware GANs from Unposed Images with Template Feature Field
[6:10]
MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition
(ends 6:30 AM)
Orals 4:30-6:20
[4:30]
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
[4:40]
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
[4:50]
Efficient Neural Video Representation with Temporally Coherent Modulation
[5:00]
Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation
[5:10]
Video Editing via Factorized Diffusion Distillation
[5:20]
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
[5:30]
Audio-Synchronized Visual Animation
[5:40]
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
[5:50]
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
[6:00]
ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model
[6:10]
Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction
(ends 6:30 AM)
Orals 4:30-6:20
[4:30]
AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild
[4:40]
Sapiens: Foundation for Human Vision Models
[4:50]
POET: Prompt Offset Tuning for Continual Human Action Adaptation
[5:00]
Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation
[5:10]
SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
[5:20]
UGG: Unified Generative Grasping
[5:30]
NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
[5:40]
Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models
[5:50]
LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment
[6:00]
Controllable Human-Object Interaction Synthesis
[6:10]
NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction
(ends 6:30 AM)
5:30 a.m.
Demonstrations 5:30-9:00
(ends 9:00 AM)
6:30 a.m.
Keynote:
Sandra Wachter
(ends 7:30 AM)
7:30 a.m.
11 p.m.
(ends 9:30 AM)
THU 3 OCT
midnight
Demonstrations 12:00-3:30
(ends 3:30 AM)
Orals 12:00-1:20
[12:00]
WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
[12:10]
AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation
[12:20]
CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model
[12:30]
Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
[12:40]
Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels
[12:50]
ActionVOS: Actions as Prompts for Video Object Segmentation
[1:00]
Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
[1:10]
Diffusion Models for Open-Vocabulary Segmentation
(ends 1:30 AM)
Orals 12:00-1:20
[12:00]
Robust Fitting on a Gate Quantum Computer
[12:10]
Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views
[12:20]
Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance
[12:30]
MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery
[12:40]
Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception
[12:50]
Faceptor: A Generalist Model for Face Perception
[1:00]
A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability
[1:10]
COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation
(ends 1:30 AM)
Orals 12:00-1:20
[12:00]
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
[12:10]
Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization
[12:20]
Emergent Visual-Semantic Hierarchies in Image-Text Representations
[12:30]
Learning Multimodal Latent Generative Models with Energy-Based Prior
[12:40]
Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
[12:50]
SINDER: Repairing the Singular Defects of DINOv2
[1:00]
Denoising Vision Transformers
[1:10]
Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking
(ends 1:30 AM)
1:30 a.m.
(ends 3:30 AM)
3:30 a.m.
4:30 a.m.
Orals 4:30-6:20
[4:30]
Controlling the World by Sleight of Hand
[4:40]
Pyramid Diffusion for Fine 3D Large Scene Generation
[4:50]
FMBoost: Boosting Latent Diffusion with Flow Matching
[5:00]
ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction
[5:10]
Exact Diffusion Inversion via Bidirectional Integration Approximation
[5:20]
Tackling Structural Hallucination in Image Translation with Local Diffusion
[5:30]
Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems
[5:40]
Adversarial Diffusion Distillation
[5:50]
Arc2Face: A Foundation Model for ID-Consistent Human Faces
[6:00]
Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning
[6:10]
OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model
(ends 6:30 AM)
Orals 4:30-6:20
[4:30]
E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation
[4:40]
Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos
[4:50]
Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
[5:00]
MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment
[5:10]
C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
[5:20]
LongVLM: Efficient Long Video Understanding via Large Language Models
[5:30]
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
[5:40]
Towards Neuro-Symbolic Video Understanding
[5:50]
Classification Matters: Improving Video Action Detection with Class-Specific Attention
[6:00]
DEVIAS: Learning Disentangled Video Representations of Action and Scene
[6:10]
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
(ends 6:30 AM)
Orals 4:30-6:20
[4:30]
GiT: Towards Generalist Vision Transformer through Universal Language Interface
[4:40]
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
[4:50]
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
[5:00]
MMBENCH: Is Your Multi-Modal Model an All-around Player?
[5:10]
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
[5:20]
Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation
[5:30]
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars
[5:40]
HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts
[5:50]
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
[6:00]
uCAP: An Unsupervised Prompting Method for Vision-Language Models
[6:10]
BRAVE: Broadening the visual encoding of vision-language models
(ends 6:30 AM)
5:30 a.m.
Demonstrations 5:30-9:00
(ends 9:00 AM)
6:30 a.m.
7:30 a.m.
10:30 a.m.
11 p.m.
(ends 3:30 AM)
11:30 p.m.
Orals 11:30-1:10
[11:30]
On the Topology Awareness and Generalization Performance of Graph Neural Networks
[11:40]
Improving Knowledge Distillation via Regularizing Feature Direction and Norm
[11:50]
Spline-based Transformers
[12:00]
Anytime Continual Learning for Open Vocabulary Classification
[12:10]
Weighted Ensemble Models Are Strong Continual Learners
[12:20]
COD: Learning Conditional Invariant Representation for Domain Adaptation Regression
[12:30]
Echoes of the Past: Boosting Long-tail Recognition via Reflective Learning
[12:40]
Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild
[12:50]
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
[1:00]
HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
(ends 1:30 AM)
Orals 11:30-1:00
[11:30]
Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks
[11:40]
Adversarial Robustification via Text-to-Image Diffusion Models
[11:50]
Flatness-aware Sequential Learning Generates Resilient Backdoors
[12:00]
A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks
[12:10]
Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks
[12:20]
R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
[12:30]
Privacy-Preserving Adaptive Re-Identification without Image Transfer
[12:40]
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
[12:50]
Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
(ends 1:30 AM)
Orals 11:30-1:10
[11:30]
A Direct Approach to Viewing Graph Solvability
[11:40]
Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees
[11:50]
Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering
[12:00]
A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
[12:10]
Physics-Based Interaction with 3D Objects via Video Generation
[12:20]
Shape from Heat Conduction
[12:30]
Rasterized Edge Gradients: Handling Discontinuities Differentially
[12:40]
ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems
[12:50]
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
[1:00]
Model Stock: All we need is just a few fine-tuned models
(ends 1:30 AM)
FRI 4 OCT