Can Qin
YOU?
Author Swipe
View article: BLIP3o-NEXT: Next Frontier of Native Image Generation
BLIP3o-NEXT: Next Frontier of Native Image Generation Open
We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demon…
View article: UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG Open
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation…
View article: CoDA: Coding LM via Diffusion Adaptation
CoDA: Coding LM via Diffusion Adaptation Open
Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully o…
View article: HoliTom: Holistic Token Merging for Fast Video Large Language Models
HoliTom: Holistic Token Merging for Fast Video Large Language Models Open
Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the L…
View article: Design and Adaptability Analysis of Integrated Pressurization–Gas Lifting Multifunctional Compressor for Enhanced Shale Gas Production Flexibility
Design and Adaptability Analysis of Integrated Pressurization–Gas Lifting Multifunctional Compressor for Enhanced Shale Gas Production Flexibility Open
Shale gas development has made significant contributions to the increase in natural gas production capacity in recent years, particularly in promoting the transformation of the energy structure and enhancing energy autonomy. However, with …
View article: Two-Stage Rapid Expansion Optimization Method for Complex Natural Gas Pipeline Networks Integrating Congestion Identification and Multiple Expansion Modes
Two-Stage Rapid Expansion Optimization Method for Complex Natural Gas Pipeline Networks Integrating Congestion Identification and Multiple Expansion Modes Open
View article: DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models Open
Video large language models (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual t…
View article: A Data Augmentation Method and the Embedding Mechanism for Detection of Pulmonary Nodules on Small Samples
A Data Augmentation Method and the Embedding Mechanism for Detection of Pulmonary Nodules on Small Samples Open
Lung Computed Tomography (CT) screening for pulmonary nodules provides an effective method for early diagnosis. The deep-learning-based computer-aided detection (CAD) system effectively identifies and precisely localizes suspicious pulmona…
View article: Too much social media? Unveiling the effects of determinants in social media fatigue
Too much social media? Unveiling the effects of determinants in social media fatigue Open
Introduction With the boom in social media, many people spend a lot of time on these platforms. Among them, some developed negative emotions, such as fatigue, depression, or disinterest in communicating, and used social media temporarily o…
View article: STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering Open
Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on b…
View article: M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking
M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking Open
3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving. Sparse and occluded data in scene point clouds introduce variations in the appearance of tracked object…
View article: Multi-Period Optimal Configuration and Scheduling of Natural Gas Storage Facilities: A Holistic Approach to Ensure Pipeline Network Supply Stability and Economy
Multi-Period Optimal Configuration and Scheduling of Natural Gas Storage Facilities: A Holistic Approach to Ensure Pipeline Network Supply Stability and Economy Open
View article: M3SOT: Multi-frame, Multi-field, Multi-space 3D Single Object Tracking
M3SOT: Multi-frame, Multi-field, Multi-space 3D Single Object Tracking Open
3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving. Sparse and occluded data in scene point clouds introduce variations in the appearance of tracked object…
View article: Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection
Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection Open
Camouflaged objects that blend into natural scenes pose significant challenges for deep-learning models to detect and synthesize. While camouflaged object detection is a crucial task in computer vision with diverse real-world applications,…
View article: UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild Open
Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when…
View article: Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations Open
Existing instance segmentation models learn task-specific information using manual mask annotations from base (training) categories. These mask annotations require tremendous human effort, limiting the scalability to annotate novel (new) c…
View article: GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation Open
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I m…
View article: HIVE: Harnessing Human Feedback for Instructional Visual Editing
HIVE: Harnessing Human Feedback for Instructional Visual Editing Open
Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models, where outputs are generated based on …
View article: Image as Set of Points
Image as Set of Points Open
What is an image and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in local region; Vision Transformers (ViTs…
View article: Making Reconstruction-based Method Great Again for Video Anomaly Detection
Making Reconstruction-based Method Great Again for Video Anomaly Detection Open
Anomaly detection in videos is a significant yet challenging problem. Previous approaches based on deep neural networks employ either reconstruction-based or prediction-based approaches. Nevertheless, existing reconstruction-based methods …
View article: Why is the State of Neural Network Pruning so Confusing? On the Fairness, Comparison Setup, and Trainability in Network Pruning
Why is the State of Neural Network Pruning so Confusing? On the Fairness, Comparison Setup, and Trainability in Network Pruning Open
The state of neural network pruning has been noticed to be unclear and even confusing for a while, largely due to "a lack of standardized benchmarks and metrics" [3]. To standardize benchmarks, first, we need to answer: what kind of compar…
View article: Unveiling the power of transfer learning towards efficient artificial intelligence
Unveiling the power of transfer learning towards efficient artificial intelligence Open
Large-scale models, abundant data, and dense computation are the pivotal pillars of deep neural networks. The present-day deep learning models have made significant strides in various areas such as Computer Vision (CV), Natural Language Pr…
View article: A Close Look at Spatial Modeling: From Attention to Convolution
A Close Look at Spatial Modeling: From Attention to Convolution Open
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesti…
View article: Detection and ranging of small targets on water based on binocular camera and improved YOLOv5 algorithm
Detection and ranging of small targets on water based on binocular camera and improved YOLOv5 algorithm Open
In order to meet the needs of intelligent ships to capture and grasp small targets while navigating on water and to be able to sense and avoid small targets, a water target detection method based on the YOLOv5-s algorithm is proposed, and …
View article: MemREIN: Rein the Domain Shift for Cross-Domain Few-Shot Learning
MemREIN: Rein the Domain Shift for Cross-Domain Few-Shot Learning Open
Few-shot learning aims to enable models generalize to new categories (query instances) with only limited labeled samples (support instances) from each category. Metric-based mechanism is a promising direction which compares feature embeddi…
View article: Recent Advances on Neural Network Pruning at Initialization
Recent Advances on Neural Network Pruning at Initialization Open
Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first…
View article: Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework
Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework Open
Point cloud analysis is challenging due to irregularity and unordered data structure. To capture the 3D geometries, prior works mainly rely on exploring sophisticated local geometric extractors using convolution, graph, or attention mechan…
View article: Self-directed online machine learning for topology optimization
Self-directed online machine learning for topology optimization Open
View article: Semi-Supervised Domain Adaptive Structure Learning
Semi-Supervised Domain Adaptive Structure Learning Open
Semi-supervised domain adaptation (SSDA) is quite a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains. Unfortunately, a simple combination of domain…
View article: SLA$^2$P: Self-supervised Anomaly Detection with Adversarial Perturbation
SLA$^2$P: Self-supervised Anomaly Detection with Adversarial Perturbation Open
Anomaly detection is a fundamental yet challenging problem in machine learning due to the lack of label information. In this work, we propose a novel and powerful framework, dubbed as SLA$^2$P, for unsupervised anomaly detection. After ext…