new

Get trending papers in your email inbox once a day!

Get trending papers in your email inbox!

Trending Papers

byAK and the research community

Trending Papers

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

54 authors

· Published on Nov 14, 2025

GitHub 3.91k arXiv Page

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

54 authors

· Nov 14, 2025

GitHub 3.91k arXiv Page

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

29 authors

· Published on Jan 6, 2026

GitHub 1.72k arXiv Page

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

29 authors

· Jan 6, 2026

GitHub 1.72k arXiv Page

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

VideoRAG enhances large language models for multi-modal video processing with a dual-channel architecture that integrates textual knowledge grounding and multi-modal context encoding.

6 authors

· Published on Feb 3, 2025

GitHub 2.23k arXiv Page

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

VideoRAG enhances large language models for multi-modal video processing with a dual-channel architecture that integrates textual knowledge grounding and multi-modal context encoding.

6 authors

· Feb 3, 2025

GitHub 2.23k arXiv Page

Submitted by

hongyuw

BitNet b1.58 2B4T Technical Report

BitNet b1.58 2B4T, a 1-bit Large Language Model with 2 billion parameters, matches the performance of full-precision models while improving computational efficiency.

8 authors

· Published on Apr 16, 2025

GitHub 25.6k arXiv Page

Submitted by

hongyuw

BitNet b1.58 2B4T Technical Report

BitNet b1.58 2B4T, a 1-bit Large Language Model with 2 billion parameters, matches the performance of full-precision models while improving computational efficiency.

8 authors

· Apr 16, 2025

GitHub 25.6k arXiv Page

Submitted by

JiaaqiLiu

SimpleMem: Efficient Lifelong Memory for LLM Agents

To support reliable long-term interaction in complex environments, LLM agents require memory systems that efficiently manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which applies entropy-aware filtering to distill unstructured interactions into compact, multi-view indexed memory units; (2) Recursive Memory Consolidation, an asynchronous process that integrates related units into higher-level abstract representations to reduce redundancy; and (3) Adaptive Query-Aware Retrieval, which dynamically adjusts retrieval scope based on query complexity to construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

8 authors

· Published on Jan 5, 2026

GitHub 428 arXiv Page

Submitted by

JiaaqiLiu

SimpleMem: Efficient Lifelong Memory for LLM Agents

To support reliable long-term interaction in complex environments, LLM agents require memory systems that efficiently manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which applies entropy-aware filtering to distill unstructured interactions into compact, multi-view indexed memory units; (2) Recursive Memory Consolidation, an asynchronous process that integrates related units into higher-level abstract representations to reduce redundancy; and (3) Adaptive Query-Aware Retrieval, which dynamically adjusts retrieval scope based on query complexity to construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

8 authors

· Jan 5, 2026

GitHub 428 arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

ibm-granite

IBM Granite · Published on Mar 14, 2025

GitHub 49.6k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

ibm-granite

IBM Granite · Mar 14, 2025

GitHub 49.6k arXiv Page

Submitted by

buaahsh

BitNet Distillation

BitNet Distillation fine-tunes large language models to 1.58-bit precision using SubLN, multi-head attention distillation, and continual pre-training, achieving comparable performance with significant memory and inference speed improvements.

MicrosoftResearch

Microsoft Research · Published on Oct 15, 2025

GitHub 25.6k arXiv Page

Submitted by

buaahsh

BitNet Distillation

BitNet Distillation fine-tunes large language models to 1.58-bit precision using SubLN, multi-head attention distillation, and continual pre-training, achieving comparable performance with significant memory and inference speed improvements.

MicrosoftResearch

Microsoft Research · Oct 15, 2025

GitHub 25.6k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Published on Nov 17, 2025

GitHub 14.8k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Nov 17, 2025

GitHub 14.8k arXiv Page

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Bitnet.cpp enhances edge inference for ternary LLMs using a novel mixed-precision matrix multiplication library, achieving significant speed improvements over baselines.

10 authors

· Published on Feb 17, 2025

GitHub 25.6k arXiv Page

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Bitnet.cpp enhances edge inference for ternary LLMs using a novel mixed-precision matrix multiplication library, achieving significant speed improvements over baselines.

10 authors

· Feb 17, 2025

GitHub 25.6k arXiv Page

Submitted by

sliuau

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

nvidia

NVIDIA · Published on Jan 8, 2026

GitHub 64 arXiv Page

Submitted by

sliuau

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

nvidia

NVIDIA · Jan 8, 2026

GitHub 64 arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

GitHub 65.3k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

GitHub 65.3k arXiv Page

Submitted by

taesiri

UniVideo: Unified Understanding, Generation, and Editing for Videos

UniVideo, a dual-stream framework combining a Multimodal Large Language Model and a Multimodal DiT, extends unified modeling to video generation and editing, achieving state-of-the-art performance and supporting task composition and generalization.

KlingTeam

Kling Team · Published on Oct 9, 2025

GitHub 192 arXiv Page

Submitted by

taesiri

UniVideo: Unified Understanding, Generation, and Editing for Videos

UniVideo, a dual-stream framework combining a Multimodal Large Language Model and a Multimodal DiT, extends unified modeling to video generation and editing, achieving state-of-the-art performance and supporting task composition and generalization.

KlingTeam

Kling Team · Oct 9, 2025

GitHub 192 arXiv Page

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

MicrosoftResearch

Microsoft Research · Published on Aug 26, 2025

GitHub 20.1k arXiv Page

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

MicrosoftResearch

Microsoft Research · Aug 26, 2025

GitHub 20.1k arXiv Page

Submitted by

XuGuo699

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

A novel video face swapping framework combines image face swapping techniques with diffusion transformers and curriculum learning to achieve superior identity preservation and visual realism.

ByteDance

ByteDance · Published on Jan 4, 2026

GitHub 325 arXiv Page

Submitted by

XuGuo699

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

A novel video face swapping framework combines image face swapping techniques with diffusion transformers and curriculum learning to achieve superior identity preservation and visual realism.

ByteDance

ByteDance · Jan 4, 2026

GitHub 325 arXiv Page

Submitted by

taesiri

NitroGen: An Open Foundation Model for Generalist Gaming Agents

NitroGen is a vision-action foundation model trained on extensive gameplay data that demonstrates strong cross-game generalization and effective transfer learning capabilities.

nvidia

NVIDIA · Published on Jan 4, 2026

GitHub 1.56k arXiv Page

Submitted by

taesiri

NitroGen: An Open Foundation Model for Generalist Gaming Agents

NitroGen is a vision-action foundation model trained on extensive gameplay data that demonstrates strong cross-game generalization and effective transfer learning capabilities.

nvidia

NVIDIA · Jan 4, 2026

GitHub 1.56k arXiv Page

Submitted by

rajkumarrawal

Recursive Language Models

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

MIT

Massachusetts Institute of Technology · Published on Dec 31, 2025

GitHub 841 arXiv Page

Submitted by

rajkumarrawal

Recursive Language Models

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

MIT

Massachusetts Institute of Technology · Dec 31, 2025

GitHub 841 arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Published on Sep 12, 2023

GitHub 67.2k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Sep 12, 2023

GitHub 67.2k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle

PaddlePaddle · Published on Oct 16, 2025

GitHub 67.7k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle

PaddlePaddle · Oct 16, 2025

GitHub 67.7k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

GitHub 27.8k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

GitHub 27.8k arXiv Page

Submitted by

taesiri

HunyuanVideo 1.5 Technical Report

HunyuanVideo 1.5 is a lightweight video generation model with state-of-the-art visual quality and motion coherence, using a DiT architecture with SSTA and an efficient video super-resolution network.

81 authors

· Published on Nov 24, 2025

GitHub 2.99k arXiv Page

Submitted by

taesiri

HunyuanVideo 1.5 Technical Report

HunyuanVideo 1.5 is a lightweight video generation model with state-of-the-art visual quality and motion coherence, using a DiT architecture with SSTA and an efficient video super-resolution network.

81 authors

· Nov 24, 2025

GitHub 2.99k arXiv Page

Submitted by

yantaiyang05

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

InfiniteVGGT enables continuous 3D visual geometry understanding through a causal transformer with adaptive memory management, outperforming existing streaming methods in long-term stability while introducing a new benchmark for extended evaluation.

AutoLab · Published on Jan 5, 2026

GitHub 179 arXiv Page

Submitted by

yantaiyang05

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

InfiniteVGGT enables continuous 3D visual geometry understanding through a causal transformer with adaptive memory management, outperforming existing streaming methods in long-term stability while introducing a new benchmark for extended evaluation.

AutoLab · Jan 5, 2026

GitHub 179 arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Published on Feb 7, 2025

GitHub 62.7k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Feb 7, 2025

GitHub 62.7k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

GitHub 45.3k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

GitHub 45.3k arXiv Page

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus

14 authors

· Published on Jun 29, 2025

GitHub 477 arXiv Page

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus

14 authors

· Jun 29, 2025

GitHub 477 arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI

Tongyi-MAI · Published on Nov 27, 2025

GitHub 8.8k arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI

Tongyi-MAI · Nov 27, 2025

GitHub 8.8k arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Published on Sep 27, 2024

GitHub 51.8k arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Sep 27, 2024

GitHub 51.8k arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI

Tongyi-MAI · Published on Nov 27, 2025

GitHub 8.8k arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI

Tongyi-MAI · Nov 27, 2025

GitHub 8.8k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

GitHub 51.8k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

GitHub 51.8k arXiv Page

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Published on Dec 8, 2025

GitHub 13.8k arXiv Page

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Dec 8, 2025

GitHub 13.8k arXiv Page

Submitted by

amael-apple

Sharp Monocular View Synthesis in Less Than a Second

SHARP synthesizes photorealistic views from a single image using a 3D Gaussian representation, achieving state-of-the-art results with rapid processing.

apple

Apple · Published on Dec 11, 2025

GitHub 6.68k arXiv Page

Submitted by

amael-apple

Sharp Monocular View Synthesis in Less Than a Second

SHARP synthesizes photorealistic views from a single image using a 3D Gaussian representation, achieving state-of-the-art results with rapid processing.

apple

Apple · Dec 11, 2025

GitHub 6.68k arXiv Page

Submitted by

zgzaacm

Video Generation Models Are Good Latent Reward Models

PRFL optimizes video generation preferences in latent space, improving alignment with human preferences while reducing memory consumption and training time.

Tencent-Hunyuan

Tencent Hunyuan · Published on Nov 26, 2025

GitHub 31 arXiv Page

Submitted by

zgzaacm

Video Generation Models Are Good Latent Reward Models

PRFL optimizes video generation preferences in latent space, improving alignment with human preferences while reducing memory consumption and training time.

Tencent-Hunyuan

Tencent Hunyuan · Nov 26, 2025

GitHub 31 arXiv Page

Submitted by

leo1117

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

NextFlow is a unified decoder-only autoregressive transformer that processes interleaved text-image tokens, enabling fast multimodal generation through novel next-token and next-scale prediction strategies.

ByteDance

ByteDance · Published on Jan 5, 2026

GitHub 174 arXiv Page

Submitted by

leo1117

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

NextFlow is a unified decoder-only autoregressive transformer that processes interleaved text-image tokens, enabling fast multimodal generation through novel next-token and next-scale prediction strategies.

ByteDance

ByteDance · Jan 5, 2026

GitHub 174 arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Published on Feb 8, 2025

GitHub 17.6k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Feb 8, 2025

GitHub 17.6k arXiv Page

Submitted by

taesiri

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Youtu-LLM is a lightweight language model optimized for computational efficiency and agentic intelligence through a compact architecture, STEM-focused training curriculum, and scalable mid-training strategies for planning and reasoning tasks.

tencent

Tencent · Published on Dec 31, 2025

GitHub 207 arXiv Page

Submitted by

taesiri

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Youtu-LLM is a lightweight language model optimized for computational efficiency and agentic intelligence through a compact architecture, STEM-focused training curriculum, and scalable mid-training strategies for planning and reasoning tasks.

tencent

Tencent · Dec 31, 2025

GitHub 207 arXiv Page

Submitted by

zbhpku

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

DataFlow is an LLM-driven data preparation framework that enhances data quality and reproducibility for various tasks, improving LLM performance with automatically generated pipelines.

PekingUniversity

Peking University · Published on Dec 18, 2025

GitHub 2.26k arXiv Page

Submitted by

zbhpku

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

DataFlow is an LLM-driven data preparation framework that enhances data quality and reproducibility for various tasks, improving LLM performance with automatically generated pipelines.

PekingUniversity

Peking University · Dec 18, 2025

GitHub 2.26k arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Published on Jul 23, 2024

GitHub 66.4k arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Jul 23, 2024

GitHub 66.4k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

GitHub 21.8k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

GitHub 21.8k arXiv Page

Submitted by

taesiri

SAM Audio: Segment Anything in Audio

SAM Audio, a diffusion transformer-based foundation model, achieves superior performance in general audio separation using unified text, visual, and temporal span prompts across various audio types.

AI at Meta · Published on Dec 19, 2025

GitHub 2.94k arXiv Page

Submitted by

taesiri

SAM Audio: Segment Anything in Audio

SAM Audio, a diffusion transformer-based foundation model, achieves superior performance in general audio separation using unified text, visual, and temporal span prompts across various audio types.

AI at Meta · Dec 19, 2025

GitHub 2.94k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

GitHub 27.2k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

GitHub 27.2k arXiv Page

Submitted by

Lost-Cloud

MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

MindWatcher is a tool-integrated reasoning agent that combines interleaved thinking and multimodal chain-of-thought reasoning to autonomously perform complex decision-making tasks through coordinated tool invocation without human prompts.

LiAuto-Foundation-Model

LiAuto Foundation Model · Published on Dec 29, 2025

GitHub 33 arXiv Page

Submitted by

Lost-Cloud

MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

MindWatcher is a tool-integrated reasoning agent that combines interleaved thinking and multimodal chain-of-thought reasoning to autonomously perform complex decision-making tasks through coordinated tool invocation without human prompts.

LiAuto-Foundation-Model

LiAuto Foundation Model · Dec 29, 2025

GitHub 33 arXiv Page

Submitted by

Yuppie1204

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

NeoVerse is a scalable 4D world model that enables pose-free reconstruction and novel-trajectory video generation from monocular videos with state-of-the-art performance.

6 authors

· Published on Jan 1, 2026

GitHub 237 arXiv Page

Submitted by

Yuppie1204

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

NeoVerse is a scalable 4D world model that enables pose-free reconstruction and novel-trajectory video generation from monocular videos with state-of-the-art performance.

6 authors

· Jan 1, 2026

GitHub 237 arXiv Page

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin, a multimodal document image parsing model, uses heterogeneous anchor prompting to achieve state-of-the-art performance on diverse page-level and element-level tasks through an efficient analyze-then-parse paradigm.

13 authors

· Published on May 20, 2025

GitHub 8.5k arXiv Page

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin, a multimodal document image parsing model, uses heterogeneous anchor prompting to achieve state-of-the-art performance on diverse page-level and element-level tasks through an efficient analyze-then-parse paradigm.

13 authors

· May 20, 2025

GitHub 8.5k arXiv Page

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

mmGRPO, a multi-module extension of GRPO, enhances accuracy in modular AI systems by optimizing LM calls and prompts across various tasks.

13 authors

· Published on Aug 6, 2025

GitHub 31.4k arXiv Page

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

mmGRPO, a multi-module extension of GRPO, enhances accuracy in modular AI systems by optimizing LM calls and prompts across various tasks.

13 authors

· Aug 6, 2025

GitHub 31.4k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

hkuds

Data Intelligence Lab@HKU · Published on Oct 14, 2025

GitHub 12k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

hkuds

Data Intelligence Lab@HKU · Oct 14, 2025

GitHub 12k arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Published on Apr 14, 2025

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Apr 14, 2025

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Published on Aug 22, 2025

GitHub 15.3k arXiv Page

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Aug 22, 2025

GitHub 15.3k arXiv Page

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

8 authors

· Published on Jul 25, 2024

GitHub 15.3k arXiv Page

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

8 authors

· Jul 25, 2024

GitHub 15.3k arXiv Page

Multi-Agent Software Development through Cross-Team Collaboration

Cross-Team Collaboration improves software quality by enabling multiple LLM agent teams to propose and communicate decisions.

8 authors

· Published on Jun 13, 2024

GitHub 28.1k arXiv Page

Multi-Agent Software Development through Cross-Team Collaboration

Cross-Team Collaboration improves software quality by enabling multiple LLM agent teams to propose and communicate decisions.

8 authors

· Jun 13, 2024

GitHub 28.1k arXiv Page

Scaling Large-Language-Model-based Multi-Agent Collaboration

Multi-agent collaboration networks enhance collective intelligence, outperforming baselines across various topologies and showing emergent abilities earlier than neural scaling laws suggest.

10 authors

· Published on Jun 11, 2024

GitHub 28.1k arXiv Page

Scaling Large-Language-Model-based Multi-Agent Collaboration

Multi-agent collaboration networks enhance collective intelligence, outperforming baselines across various topologies and showing emergent abilities earlier than neural scaling laws suggest.

10 authors

· Jun 11, 2024

GitHub 28.1k arXiv Page

Submitted by

Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

open-gigaai

GigaAI · Published on Oct 22, 2025

GitHub 1.47k arXiv Page

Submitted by

Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

open-gigaai

GigaAI · Oct 22, 2025

GitHub 1.47k arXiv Page