Junchi Yan
YOU?
Author Swipe
View article: ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning
ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning Open
Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical …
View article: FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers Open
Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query de…
View article: Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization Open
The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work posit…
View article: The LLM Era Demands Natural-Language-Aligned Theorem Provers for Mathematics
The LLM Era Demands Natural-Language-Aligned Theorem Provers for Mathematics Open
View article: Physics-informed Neural-operator Predictive Control for Drag Reduction in Turbulent Flows
Physics-informed Neural-operator Predictive Control for Drag Reduction in Turbulent Flows Open
Assessing turbulence control effects for wall friction numerically is a significant challenge since it requires expensive simulations of turbulent fluid dynamics. We instead propose an efficient deep reinforcement learning (RL) framework f…
View article: Structure Alignment-driven Cross-Graph Modeling for Functional RNA Design
Structure Alignment-driven Cross-Graph Modeling for Functional RNA Design Open
RNAs are critical for biological processes, with their biological functions closely tied to their three-dimensional structures. RNA inverse folding, the design of RNA sequences that fold into target 3D structures, is a complex challenge du…
View article: Fast Multi-objective RNA Optimization with Autoregressive Reinforcement Learning
Fast Multi-objective RNA Optimization with Autoregressive Reinforcement Learning Open
Codon optimization is essential in mRNA vaccine development, while existing tools face limitations in the computational efficiency, sequence diversity and universality. To address these challenges, we develop RNAJog (RNA Joint Optimization…
View article: Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency
Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency Open
Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In t…
View article: BiQAP: Neural Bi-level Optimization-based Framework for Solving Quadratic Assignment Problems
BiQAP: Neural Bi-level Optimization-based Framework for Solving Quadratic Assignment Problems Open
View article: Reinvent the Operation not the Architecture: Quantum-inspired High-order Product for Compatible and Improved LLMs Training
Reinvent the Operation not the Architecture: Quantum-inspired High-order Product for Compatible and Improved LLMs Training Open
View article: ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs
ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs Open
Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer rem…
View article: SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence Open
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple spatial capabilities, even for handling simple and normal ta…
View article: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space Open
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each b…
View article: New Evidence of the Two-Phase Learning Dynamics of Neural Networks
New Evidence of the Two-Phase Learning Dynamics of Neural Networks Open
Understanding how deep neural networks learn remains a fundamental challenge in modern machine learning. A growing body of evidence suggests that training dynamics undergo a distinct phase transition, yet our understanding of this transiti…
View article: Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving
Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving Open
As a seemingly self-explanatory task, problem-solving has been a significant component of science and engineering. However, a general yet concrete formulation of problem-solving itself is missing. With the recent development of AI-based pr…
View article: Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions
Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions Open
The rise of foundation models paves the way for generalist robot policies in the physical world. Existing methods relying on text-only instructions often struggle to generalize to unseen scenarios. We argue that interleaved image-text inpu…
View article: TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving Open
Mathematical geometric problem solving (GPS) demands verifiable logical coherence and multimodal reasoning capabilities. While large language models (LLMs) have shown rapid progress in GPS, their advancement is hindered by the lack of reli…
View article: Int2Planner: An Intention-based Multi-modal Motion Planner for Integrated Prediction and Planning
Int2Planner: An Intention-based Multi-modal Motion Planner for Integrated Prediction and Planning Open
Motion planning is a critical module in autonomous driving, with the primary challenge of uncertainty caused by interactions with other participants. As most previous methods treat prediction and planning as separate tasks, it is difficult…
View article: On the Cone Effect in the Learning Dynamics
On the Cone Effect in the Learning Dynamics Open
Understanding the learning dynamics of neural networks is a central topic in the deep learning community. In this paper, we take an empirical perspective to study the learning dynamics of neural networks in real-world settings. Specificall…
View article: DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving
DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving Open
End-to-end autonomous driving (E2E-AD) has emerged as a trend in the field of autonomous driving, promising a data-driven, scalable approach to system design. However, existing E2E-AD methods usually adopt the sequential paradigm of percep…
View article: Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach Open
Existing video tokenizers typically use the traditional Variational Autoencoder (VAE) architecture for video compression and reconstruction. However, to achieve good performance, its training process often relies on complex multi-stage tra…
View article: The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training Open
Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is imp…
View article: Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection
Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection Open
Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To e…
View article: Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances
Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances Open
With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging ta…
View article: Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization
Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization Open
Diffusion models have recently advanced Combinatorial Optimization (CO) as a powerful backbone for neural solvers. However, their iterative sampling process requiring denoising across multiple noise levels incurs substantial overhead. We p…
View article: PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection
PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection Open
With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to e…
View article: Int2Planner: An Intention-based Multi-modal Motion Planner for Integrated Prediction and Planning
Int2Planner: An Intention-based Multi-modal Motion Planner for Integrated Prediction and Planning Open
Motion planning is a critical module in autonomous driving, with the primary challenge of uncertainty caused by interactions with other participants. As most previous methods treat prediction and planning as separate tasks, it is difficult…
View article: Efficient Packaging Line Object Counting by Cross-Frame Association With Wavelet Convolutions and Trajectory Compensation
Efficient Packaging Line Object Counting by Cross-Frame Association With Wavelet Convolutions and Trajectory Compensation Open
Real-time object counting in the industry pipeline is critical for improving efficiency and accuracy in industries like manufacturing and logistics. This paper introduces a novel multi-object association method, namely tracking method, whi…
View article: Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives
Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives Open
View article: Knowledge-Empowered, Collaborative, and Co-Evolving AI Models: The Post-LLM Roadmap
Knowledge-Empowered, Collaborative, and Co-Evolving AI Models: The Post-LLM Roadmap Open
Large language models (LLMs) have significantly advanced artificial intelligence (AI) by excelling in tasks such as understanding, generation, and reasoning across multiple modalities. Despite these achievements, LLMs have inherent limitat…