Trending Papers

GitHub 6.46k arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI · Nov 27, 2025

GitHub 6.46k arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI · Published on Nov 27, 2025

183

GitHub 6.52k arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI · Nov 27, 2025

183

GitHub 6.52k arXiv Page

Submitted by

taesiri

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Wan-Move enhances motion control in video generative models by integrating motion-aware features into latent space, enabling high-quality and scalable video synthesis.

TongyiLab · Published on Dec 9, 2025

108

GitHub 257 arXiv Page

Submitted by

taesiri

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Wan-Move enhances motion control in video generative models by integrating motion-aware features into latent space, enabling high-quality and scalable video synthesis.

TongyiLab · Dec 9, 2025

108

GitHub 257 arXiv Page

Submitted by

Zuica96

Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform

Visionary is an open web-native platform enabling real-time rendering of 3D Gaussian Splatting and meshes with efficient GPU-based inference, supporting dynamic content and generative models.

24 authors

· Published on Dec 9, 2025

70

GitHub 263 arXiv Page

Submitted by

Zuica96

Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform

Visionary is an open web-native platform enabling real-time rendering of 3D Gaussian Splatting and meshes with efficient GPU-based inference, supporting dynamic content and generative models.

24 authors

· Dec 9, 2025

70

GitHub 263 arXiv Page

Submitted by

akhaliq

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Live Avatar uses a 14-billion-parameter diffusion model with Timestep-forcing Pipeline Parallelism and Rolling Sink Frame Mechanism to achieve real-time, high-fidelity avatar generation.

Quark · Published on Dec 4, 2025

166

GitHub 869 arXiv Page

Submitted by

akhaliq

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Live Avatar uses a 14-billion-parameter diffusion model with Timestep-forcing Pipeline Parallelism and Rolling Sink Frame Mechanism to achieve real-time, high-fidelity avatar generation.

Quark · Dec 4, 2025

166

GitHub 869 arXiv Page

Submitted by

nuojohnchen

PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

PaperDebugger is an in-editor academic writing assistant that integrates large language models, enabling direct interaction within LaTeX editors for document state management, revision, and literature search.

National University of Singapore · Published on Dec 2, 2025

56

GitHub 1.02k arXiv Page

Submitted by

nuojohnchen

PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

PaperDebugger is an in-editor academic writing assistant that integrates large language models, enabling direct interaction within LaTeX editors for document state management, revision, and literature search.

National University of Singapore · Dec 2, 2025

56

GitHub 1.02k arXiv Page

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Published on Dec 8, 2025

GitHub 11.8k arXiv Page

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Dec 8, 2025

GitHub 11.8k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

24

GitHub 25.8k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

24

GitHub 25.8k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Published on Nov 20, 2025

109

GitHub 5.65k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Nov 20, 2025

109

GitHub 5.65k arXiv Page

Submitted by

kenshinn

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

TwinFlow is a 1-step generative model framework that enhances inference efficiency without requiring fixed pretrained teacher models or standard adversarial networks, achieving high performance on text-to-image tasks and scaling efficiently.

inclusionAI · Published on Dec 3, 2025

GitHub 128 arXiv Page

Submitted by

kenshinn

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

TwinFlow is a 1-step generative model framework that enhances inference efficiency without requiring fixed pretrained teacher models or standard adversarial networks, achieving high performance on text-to-image tasks and scaling efficiently.

inclusionAI · Dec 3, 2025

GitHub 128 arXiv Page

Submitted by

taesiri

LongCat-Image Technical Report

LongCat-Image is a bilingual open-source foundation model for image generation that addresses multilingual text rendering, photorealism, and deployment efficiency through rigorous data curation, compact design, and comprehensive open-source support.

LongCat · Published on Dec 8, 2025

16

GitHub 382 arXiv Page

Submitted by

taesiri

LongCat-Image Technical Report

LongCat-Image is a bilingual open-source foundation model for image generation that addresses multilingual text rendering, photorealism, and deployment efficiency through rigorous data curation, compact design, and comprehensive open-source support.

LongCat · Dec 8, 2025

16

GitHub 382 arXiv Page

Submitted by

RuoyuFeng

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Semantic-First Diffusion (SFD) enhances image generation by asynchronously denoising semantic and texture latents, improving convergence and quality.

Xi'an Jiaotong University · Published on Dec 4, 2025

40

GitHub 398 arXiv Page

Submitted by

RuoyuFeng

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Semantic-First Diffusion (SFD) enhances image generation by asynchronously denoising semantic and texture latents, improving convergence and quality.

Xi'an Jiaotong University · Dec 4, 2025

40

GitHub 398 arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Published on Sep 27, 2024

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Sep 27, 2024

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Published on Oct 16, 2025

GitHub 66.1k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Oct 16, 2025

GitHub 66.1k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

136

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

136

Submitted by

Alicezrzhao

Composing Concepts from Images and Videos via Concept-prompt Binding

Bind & Compose uses Diffusion Transformers with hierarchical binders and temporal strategies to accurately compose complex visual concepts from images and videos.

MMLab@HKUST · Published on Dec 10, 2025

GitHub 44 arXiv Page

Submitted by

Alicezrzhao

Composing Concepts from Images and Videos via Concept-prompt Binding

Bind & Compose uses Diffusion Transformers with hierarchical binders and temporal strategies to accurately compose complex visual concepts from images and videos.

MMLab@HKUST · Dec 10, 2025

GitHub 44 arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

174

GitHub 63.8k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

174

GitHub 63.8k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Published on Feb 8, 2025

GitHub 16.5k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Feb 8, 2025

GitHub 16.5k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Published on Oct 23, 2024

GitHub 50.8k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Oct 23, 2024

GitHub 50.8k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

GitHub 44.2k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

GitHub 44.2k arXiv Page

Submitted by

xandergos

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Terrain Diffusion uses diffusion models and a novel algorithm called InfiniteDiffusion to generate realistic, seamless, and boundless procedural worlds with constant-time random access.

1 authors

· Published on Dec 9, 2025

Submitted by

xandergos

Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Terrain Diffusion uses diffusion models and a novel algorithm called InfiniteDiffusion to generate realistic, seamless, and boundless procedural worlds with constant-time random access.

1 authors

· Dec 9, 2025

Submitted by

taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

AI at Meta · Published on Nov 20, 2025

GitHub 4.74k arXiv Page

Submitted by

taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

AI at Meta · Nov 20, 2025

GitHub 4.74k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

14

GitHub 26.5k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

14

GitHub 26.5k arXiv Page

Submitted by

fengerhu

MobiAgent: A Systematic Framework for Customizable Mobile Agents

MobiAgent, a comprehensive mobile agent system, achieves state-of-the-art performance in real-world mobile scenarios through its MobiMind-series models, AgentRR framework, and MobiFlow benchmarking suite, while also reducing data annotation costs.

10 authors

· Published on Aug 30, 2025

GitHub 1.44k arXiv Page

Submitted by

fengerhu

MobiAgent: A Systematic Framework for Customizable Mobile Agents

MobiAgent, a comprehensive mobile agent system, achieves state-of-the-art performance in real-world mobile scenarios through its MobiMind-series models, AgentRR framework, and MobiFlow benchmarking suite, while also reducing data annotation costs.

10 authors

· Aug 30, 2025

GitHub 1.44k arXiv Page

Submitted by

SereinH

RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

RealGen is a photorealistic text-to-image framework that uses an LLM for prompt optimization and a diffusion model for image generation, enhanced by a Detector Reward mechanism and RealBench for automated evaluation.

10 authors

· Published on Nov 29, 2025

GitHub 191 arXiv Page

Submitted by

SereinH

RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

RealGen is a photorealistic text-to-image framework that uses an LLM for prompt optimization and a diffusion model for image generation, enhanced by a Detector Reward mechanism and RealBench for automated evaluation.

10 authors

· Nov 29, 2025

GitHub 191 arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

8

GitHub 21k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

8

GitHub 21k arXiv Page

Submitted by

wenyi

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

A vision-language model (VLM) named GLM-4.1V-Thinking, developed with a reasoning-centric training framework, achieves state-of-the-art performance across various tasks, including STEM problem solving, video understanding, and long document understanding, outperforming larger models on many benchmarks.

77 authors

· Published on Jul 1, 2025

242

GitHub 1.94k arXiv Page

Submitted by

wenyi

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

A vision-language model (VLM) named GLM-4.1V-Thinking, developed with a reasoning-centric training framework, achieves state-of-the-art performance across various tasks, including STEM problem solving, video understanding, and long document understanding, outperforming larger models on many benchmarks.

77 authors

· Jul 1, 2025

242

GitHub 1.94k arXiv Page

Submitted by

Gynjn

Multi-view Pyramid Transformer: Look Coarser to See Broader

MVP, a scalable multi-view transformer architecture, efficiently reconstructs large 3D scenes from multiple images using dual hierarchies and achieves state-of-the-art quality.

6 authors

· Published on Dec 8, 2025

19

GitHub 96 arXiv Page

Submitted by

Gynjn

Multi-view Pyramid Transformer: Look Coarser to See Broader

MVP, a scalable multi-view transformer architecture, efficiently reconstructs large 3D scenes from multiple images using dual hierarchies and achieves state-of-the-art quality.

6 authors

· Dec 8, 2025

19

GitHub 96 arXiv Page

Submitted by

hiyouga

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

A framework called Easy Dataset synthesizes fine-tuning data from unstructured documents using a GUI and LLMs, improving domain-specific performance of LLMs while maintaining general knowledge.

7 authors

· Published on Jul 5, 2025

51

GitHub 12.4k arXiv Page

Submitted by

hiyouga

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

A framework called Easy Dataset synthesizes fine-tuning data from unstructured documents using a GUI and LLMs, improving domain-specific performance of LLMs while maintaining general knowledge.

7 authors

· Jul 5, 2025

51

GitHub 12.4k arXiv Page

Submitted by

akhaliq

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS, a native GUI agent model using screenshots as input, outperforms commercial models in various benchmarks through enhanced perception, unified action modeling, system-2 reasoning, and iterative training with reflective online traces.

35 authors

· Published on Jan 21, 2025

GitHub 8.49k arXiv Page

Submitted by

akhaliq

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS, a native GUI agent model using screenshots as input, outperforms commercial models in various benchmarks through enhanced perception, unified action modeling, system-2 reasoning, and iterative training with reflective online traces.

35 authors

· Jan 21, 2025

GitHub 8.49k arXiv Page

Submitted by

taesiri

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

DAComp is a benchmark of 210 tasks that evaluates the capabilities of agents in real-world data engineering and data analysis workflows, revealing significant deficiencies in both areas.

ByteDance Seed · Published on Dec 3, 2025

147

GitHub 161 arXiv Page

Submitted by

taesiri

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

DAComp is a benchmark of 210 tasks that evaluates the capabilities of agents in real-world data engineering and data analysis workflows, revealing significant deficiencies in both areas.

ByteDance Seed · Dec 3, 2025

147

GitHub 161 arXiv Page

Submitted by

AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance Seed · Published on Nov 13, 2025

93

GitHub 3.42k arXiv Page

Submitted by

AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance Seed · Nov 13, 2025

93

GitHub 3.42k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Published on Oct 14, 2025

49

GitHub 10.9k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Oct 14, 2025

49

GitHub 10.9k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Published on Oct 26, 2025

GitHub 17.4k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Oct 26, 2025

GitHub 17.4k arXiv Page

Submitted by

daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

8 authors

· Published on Aug 5, 2025

121

GitHub 9.57k arXiv Page

Submitted by

daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

8 authors

· Aug 5, 2025

121

GitHub 9.57k arXiv Page

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

16 authors

· Published on Apr 21, 2023

5

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

16 authors

· Apr 21, 2023

5

Submitted by

KaituoFeng

OneThinker: All-in-one Reasoning Model for Image and Video

OneThinker, an all-in-one multimodal reasoning model, unifies image and video understanding across various tasks using RL and demonstrates strong performance and knowledge transfer.

14 authors

· Published on Dec 2, 2025

30

GitHub 282 arXiv Page

Submitted by

KaituoFeng

OneThinker: All-in-one Reasoning Model for Image and Video

OneThinker, an all-in-one multimodal reasoning model, unifies image and video understanding across various tasks using RL and demonstrates strong performance and knowledge transfer.

14 authors

· Dec 2, 2025

30

GitHub 282 arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Published on Jun 28, 2020

4

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Jun 28, 2020

4

Submitted by

Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

GigaAI · Published on Oct 22, 2025

48

GitHub 558 arXiv Page

Submitted by

Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

GigaAI · Oct 22, 2025

48

GitHub 558 arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Published on Apr 14, 2025

303

arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Apr 14, 2025

303

arXiv Page

InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

InkSight converts offline handwriting to digital ink using novel reading and writing priors, effectively derendering handwritten text in diverse and challenging conditions.

7 authors

· Published on Feb 8, 2024

1

GitHub 864 arXiv Page

InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

InkSight converts offline handwriting to digital ink using novel reading and writing priors, effectively derendering handwritten text in diverse and challenging conditions.

7 authors

· Feb 8, 2024

1

GitHub 864 arXiv Page

Submitted by

zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab · Published on Oct 19, 2025

104

GitHub 3.05k arXiv Page

Submitted by

zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab · Oct 19, 2025

104

GitHub 3.05k arXiv Page

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Published on Aug 22, 2025

53

GitHub 14.4k arXiv Page

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Aug 22, 2025

53

GitHub 14.4k arXiv Page

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

A simple head-specific sigmoid gate applied after Scaled Dot-Product Attention improves performance, stability, and scaling in large models, mitigating 'attention sink' and enhancing long-context extrapolation.

13 authors

· Published on May 10, 2025

GitHub 545 arXiv Page

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

A simple head-specific sigmoid gate applied after Scaled Dot-Product Attention improves performance, stability, and scaling in large models, mitigating 'attention sink' and enhancing long-context extrapolation.

13 authors

· May 10, 2025

GitHub 545 arXiv Page

Submitted by

Owen777

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

LucidFlux, a caption-free UIR framework using a diffusion transformer, achieves robust image restoration through adaptive conditioning and SigLIP features without text prompts.

W2GenAI Lab · Published on Sep 26, 2025

GitHub 739 arXiv Page

Submitted by

Owen777

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

LucidFlux, a caption-free UIR framework using a diffusion transformer, achieves robust image restoration through adaptive conditioning and SigLIP features without text prompts.

W2GenAI Lab · Sep 26, 2025

GitHub 739 arXiv Page

Submitted by

gaomingqi

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos

SAM-Body4D is a training-free framework that enhances 3D human mesh recovery from videos by ensuring temporal consistency and robustness to occlusions through masklet generation and refinement.

3 authors

· Published on Dec 9, 2025

Submitted by

gaomingqi

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos

SAM-Body4D is a training-free framework that enhances 3D human mesh recovery from videos by ensuring temporal consistency and robustness to occlusions through masklet generation and refinement.

3 authors

· Dec 9, 2025

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

A vision-action policy using correlated noise for flow matching and learnable mixed-layer attention wins the 2025 BEHAVIOR Challenge with high performance across diverse household tasks.

3 authors

· Published on Dec 7, 2025

-

GitHub 88 arXiv Page

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

A vision-action policy using correlated noise for flow matching and learnable mixed-layer attention wins the 2025 BEHAVIOR Challenge with high performance across diverse household tasks.

3 authors

· Dec 7, 2025

-

GitHub 88 arXiv Page

Submitted by

XiangpengYang

Unified Video Editing with Temporal Reasoner

VideoCoF, a Chain-of-Frames approach, improves video editing precision and instruction-to-region mapping by using reasoning tokens without requiring user-provided masks.

University of Technology Sydney · Published on Dec 8, 2025

41

GitHub 56 arXiv Page

Submitted by

XiangpengYang

Unified Video Editing with Temporal Reasoner

VideoCoF, a Chain-of-Frames approach, improves video editing precision and instruction-to-region mapping by using reasoning tokens without requiring user-provided masks.

University of Technology Sydney · Dec 8, 2025

41

GitHub 56 arXiv Page

Submitted by

nielsr

Back to Basics: Let Denoising Generative Models Denoise

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Massachusetts Institute of Technology · Published on Nov 17, 2025

GitHub 1.73k arXiv Page

Submitted by

nielsr

Back to Basics: Let Denoising Generative Models Denoise

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Massachusetts Institute of Technology · Nov 17, 2025

GitHub 1.73k arXiv Page

Submitted by

Paranioar

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

NEO, a novel family of native Vision-Language Models, addresses fundamental constraints and integrates vision and language within a unified framework, achieving competitive performance with limited data.

SenseTime · Published on Oct 16, 2025

GitHub 444 arXiv Page

Submitted by

Paranioar

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

NEO, a novel family of native Vision-Language Models, addresses fundamental constraints and integrates vision and language within a unified framework, achieving competitive performance with limited data.

SenseTime · Oct 16, 2025