Lean Wang
YOU?
Author Swipe
View article: DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models Open
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We intro…
View article: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning Open
General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs) 1,2 and chain-of-thought (CoT) prompting 3 , have achieved considerabl…
View article: DHGRPO: Domain-Induced, Hierarchical Group Relative Policy Optimization
DHGRPO: Domain-Induced, Hierarchical Group Relative Policy Optimization Open
DHGRPO (Domain-Induced Hierarchical Group Relative Policy Optimization) is a mathematically grounded extension of Group Relative Policy Optimization (GRPO) that mitigates group-level failure modes in preference-based fine-tuning of large l…
View article: TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos Open
The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instant…
View article: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Open
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving …
View article: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Open
View article: DeepSeek-V3 Technical Report
DeepSeek-V3 Technical Report Open
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attenti…
View article: Temporal Reasoning Transfer from Text to Video
Temporal Reasoning Transfer from Text to Video Open
Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitat…
View article: DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models Open
The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in MLLMs. However, measuring the effectiveness of projectors in vision-language alignment remains u…
View article: Towards Codable Watermarking for Injecting Multi-bits Information to LLMs
Towards Codable Watermarking for Injecting Multi-bits Information to LLMs Open
As large language models (LLMs) generate texts with increasing fluency and realism, there is a growing need to identify the source of texts to prevent the abuse of LLMs. Text watermarking techniques have proven reliable in distinguishing w…
View article: Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning Open
In-context learning (ICL) emerges as a promising capability of large language models (LLMs) by providing them with demonstration examples to perform diverse tasks. However, the underlying mechanism of how LLMs learn from the provided conte…
View article: Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning Open
In-context learning (ICL) emerges as a promising capability of large language models (LLMs) by providing them with demonstration examples to perform diverse tasks. However, the underlying mechanism of how LLMs learn from the provided conte…
View article: Gradient Knowledge Distillation for Pre-trained Language Models
Gradient Knowledge Distillation for Pre-trained Language Models Open
Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning in…