Kaipeng Zhang
YOU?
Author Swipe
View article: Brackish water irrigation boosts honeysuckle (Lonicera japonica Thunb.)-salt tolerance by regulating sodium partitioning and potassium homeostasis: implications for coastal saline soil
Brackish water irrigation boosts honeysuckle (Lonicera japonica Thunb.)-salt tolerance by regulating sodium partitioning and potassium homeostasis: implications for coastal saline soil Open
Introduction Agricultural development in coastal saline-alkali lands is constrained by freshwater scarcity. Utilizing brackish water for irrigation presents a viable pathway to alleviate this pressure. Honeysuckle ( Lonicera japonica Thunb…
View article: TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning Open
The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet exist…
View article: From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration Open
Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making i…
View article: Comparative Proteomics Analysis Reveals Differential Immune Responses of Paralichthys olivaceus to Edwardsiella tarda Infection Under High and Low Temperature
Comparative Proteomics Analysis Reveals Differential Immune Responses of Paralichthys olivaceus to Edwardsiella tarda Infection Under High and Low Temperature Open
Fluctuating water temperatures and bacterial pathogens such as Edwardsiella tarda pose a serious threat to mariculture, resulting in significant economic losses within the flounder industry. A previous study revealed that elevated temperat…
View article: Physiological Response Mechanisms of Triplophysa strauchii Under Salinity Stress
Physiological Response Mechanisms of Triplophysa strauchii Under Salinity Stress Open
Salinity, a critical environmental factor for fish survival, remains poorly understood in terms of how Triplophysa strauchii, a characteristic fish in Northwest China, physiologically responds to salinity stress. This study aimed to determ…
View article: EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models Open
Recent advancements have shown that the Mixture of Experts (MoE) approach significantly enhances the capacity of large language models (LLMs) and improves performance on downstream tasks. Building on these promising results, multi-modal la…
View article: SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model
SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model Open
Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and…
View article: IA-T2I: Internet-Augmented Text-to-Image Generation
IA-T2I: Internet-Augmented Text-to-Image Generation Open
Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain. For example, a T2I model released in February would struggle to generate a…
View article: LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis Open
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a h…
View article: Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning
Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning Open
This paper investigates the role of explicit thinking process in rule-based reinforcement fine-tuning (RFT) for MLLMs. We first propose CLS-RL for MLLM image classification, using verifiable rewards for fine-tuning. Experiments show CLS-RL…
View article: MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning Open
DeepSeek R1, and o1 have demonstrated powerful reasoning capabilities in the text domain through stable large-scale reinforcement learning. To enable broader applications, some works have attempted to transfer these capabilities to multimo…
View article: LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation
LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation Open
In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use s…
View article: EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models Open
View article: MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification Open
View article: Frequency-shifted Interferometry Lidar System for Simultaneous Ranging and Velocimetry
Frequency-shifted Interferometry Lidar System for Simultaneous Ranging and Velocimetry Open
View article: OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Open
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding…
View article: Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping Open
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchma…
View article: Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation
Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation Open
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation. Vision Transformers (ViTs) have advanced global modeling through self-attention …
View article: EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models Open
Large language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers …
View article: SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge
SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge Open
Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if…
View article: MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI Open
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimod…
View article: Adapting LLaMA Decoder to Vision Transformer
Adapting LLaMA Decoder to Vision Transformer Open
This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with …
View article: ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models Open
This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopt…
View article: TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training
TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training Open
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descrip…
View article: Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification Open
Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text co…
View article: B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions
B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions Open
Large Vision-Language Models (LVLMs) have shown significant progress in responding well to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attac…
View article: BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation Open
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast num…
View article: ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning Open
Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses cha…
View article: T3M: Text Guided 3D Human Motion Synthesis from Speech
T3M: Text Guided 3D Human Motion Synthesis from Speech Open
Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, l…
View article: TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training
TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training Open
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descrip…