Explanipedia

SS4D: Native 4D Generative Model via Structured Spacetime Latents Open

Dahua Lin · 2025

We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generat…

More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning Open

Yijie Zhao, Simin Guo, Ziqing Yang, Shifan Han, Dahua Lin , et al. · 2025

The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in r…

SIM-CoT: Supervised Implicit Chain-of-Thought Open

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuan Cao , et al. · 2025

Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption. We identify a core latent instability issue…

InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling Open

Ping Li, Jiasheng Ye, Zhigang Yu, Jiacheng Chen, Wenwei Zhang , et al. · 2025

Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., …

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy Open

Shuyu Zhan, Yuan‐Cheng Lai, Ziyu Lu, Dahua Lin, Ziqing Yang , et al. · 2025

Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming hu…

Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes Open

Xijie Yang, Linning Xu, Lihan Jiang, Dahua Lin, Bo Dai · 2025

Computer science

3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless co…

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning Open

Junhao Shen, Haiteng Zhao, Yuantong Gu, Songyang Gao, Kuikun Liu , et al. · 2025

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy …

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner Open

Zile Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao , et al. · 2025

Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on …

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling Open

Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang , et al. · 2025

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail…

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing Open

Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang , et al. · 2025

This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimoda…

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions Open

Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, C. Liang, Gaojie Lin , et al. · 2025

End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a gl…

Video World Models with Long-term Spatial Memory Open

Tong Wu, Shuai Yang, Ryan Po, Ziwei Liu, Dahua Lin , et al. · 2025

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maint…

Consultant Decoding: Yet Another Synergistic Mechanism Open

Chuanghao Ding, Jiaping Wang, Ziqing Yang, Xiaoliang Wang, Dahua Lin , et al. · 2025

The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates requi…

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control Open

Xiao Fu, Xintao Wang, Liu Xian, Jianhong Bai, Runsen Xu , et al. · 2025

Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primaril…

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence Open

Sihan Yang, Yan Xie, Mo Li, Haodong Duan, Xiangyu Yue , et al. · 2025

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoni…

Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering Open

Jiang Wu, Z C Jiang, Yifan He, Rui Min, Haote Yang , et al. · 2025

We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, …

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models Open

Runsen Xu, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu , et al. · 2025

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-fr…

Visual Agentic Reinforcement Fine-Tuning Open

Ziyu Liu, Yuhang Zang, Yifeng Zou, Ziwen Liang, Xiaoyi Dong , et al. · 2025

A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source res…

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Open

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Sheng‐Long Ye , et al. · 2025

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that su…

Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning Open

Rupeng Zhu, Zhiyuan Ma, Jiang Wu, Junyuan Gao, Jiaqi Wang , et al. · 2025

Computer science Mathematics Philosophy

Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions. By modifying responses of unknown questions in the training data to refusal responses such as ''I don't know", RAIT enhance…

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios Open

Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai , et al. · 2025

Computer science Geography

Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evalua…

MM-IFEngine: Towards Multimodal Instruction Following Open

Steven X. Ding, Siyu Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan , et al. · 2025

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data …

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance Open

Peng Ling, Yujie Zhou, Pan Zhang, Tong Wu, Yuhang Cao , et al. · 2025

Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the sca…

Multi-identity Human Image Animation with Structural Video Diffusion Open

Yuwei Guo, Dahua Lin, Bo Dai · 2025

Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while…

OpenHuEval: Evaluating Large Language Model on Hungarian Specifics Open

Haote Yang, Xingjian Wei, Jiang Wu, Noémi Ligeti-Nagy, Jiaxing Sun , et al. · 2025

We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we…

LEGION: Learning to Ground and Explain for Synthetic Image Detection Open

Hengrui Kang, S. X. Wen, Zichen Wen, Junyan Ye, Weijia Li , et al. · 2025

The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection metho…

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM Open

Xinyu Fang, Zhijian Chen, Kai Lan, MA Li-xin, Steven X. Ding , et al. · 2025

Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabiliti…

Long Context Tuning for Video Generation Open

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin , et al. · 2025

Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots…

Dahua Lin YOU? Author Swipe