Dahua Lin
YOU?
Author Swipe
View article: SS4D: Native 4D Generative Model via Structured Spacetime Latents
SS4D: Native 4D Generative Model via Structured Spacetime Latents Open
We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generat…
View article: More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning
More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning Open
The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in r…
View article: SIM-CoT: Supervised Implicit Chain-of-Thought
SIM-CoT: Supervised Implicit Chain-of-Thought Open
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption. We identify a core latent instability issue…
View article: InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling Open
Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., …
View article: MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy Open
Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming hu…
View article: Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes
Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes Open
3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless co…
View article: Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning Open
Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy …
View article: The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner Open
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on …
View article: CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling Open
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail…
View article: ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing Open
This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimoda…
View article: InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions Open
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a gl…
View article: Video World Models with Long-term Spatial Memory
Video World Models with Long-term Spatial Memory Open
Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maint…
View article: Consultant Decoding: Yet Another Synergistic Mechanism
Consultant Decoding: Yet Another Synergistic Mechanism Open
The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates requi…
View article: Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control Open
Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primaril…
View article: MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence Open
Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoni…
View article: Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering Open
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, …
View article: Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models Open
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-fr…
View article: Visual Agentic Reinforcement Fine-Tuning
Visual Agentic Reinforcement Fine-Tuning Open
A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source res…
View article: InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Open
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that su…
View article: Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning
Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning Open
Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions. By modifying responses of unknown questions in the training data to refusal responses such as ''I don't know", RAIT enhance…
View article: UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios Open
Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evalua…
View article: MM-IFEngine: Towards Multimodal Instruction Following
MM-IFEngine: Towards Multimodal Instruction Following Open
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data …
View article: HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance Open
Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the sca…
View article: Multi-identity Human Image Animation with Structural Video Diffusion
Multi-identity Human Image Animation with Structural Video Diffusion Open
Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while…
View article: OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
OpenHuEval: Evaluating Large Language Model on Hungarian Specifics Open
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we…
View article: LEGION: Learning to Ground and Explain for Synthetic Image Detection
LEGION: Learning to Ground and Explain for Synthetic Image Detection Open
The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection metho…
View article: Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM Open
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabiliti…
View article: Long Context Tuning for Video Generation
Long Context Tuning for Video Generation Open
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots…