Pengchuan Zhang
YOU?
Author Swipe
View article: Pilot-scale Research on Advanced Treatment of Surface Water from Beijing-Hangzhou Grand Canal with Seasonably Variable Temperature Based on Hollow Fiber Nanofiltration
Pilot-scale Research on Advanced Treatment of Surface Water from Beijing-Hangzhou Grand Canal with Seasonably Variable Temperature Based on Hollow Fiber Nanofiltration Open
As the first pilot-scale study based on hollow fiber nanofiltration (HFNF) technology using natural surface water as feed streams in China, the research aims at exploring and developing low-expense and high-efficiency operating technologie…
View article: TLDR: Token-Level Detective Reward Model for Large Vision Language Models
TLDR: Token-Level Detective Reward Model for Large Vision Language Models Open
Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assignin…
View article: Learning Video Context as Interleaved Multimodal Sequences
Learning Video Context as Interleaved Multimodal Sequences Open
Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce M…
View article: GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation Open
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct…
View article: BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation Open
The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic…
View article: Evaluating Text-to-Visual Generation with Image-to-Text Generation
Evaluating Text-to-Visual Generation with Image-to-Text Generation Open
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (gen…
View article: The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task Open
The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated pe…
View article: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning Open
Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including …
View article: Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding Open
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a sh…
View article: UniVTG: Towards Unified Video-Language Temporal Grounding
UniVTG: Towards Unified Video-Language Temporal Grounding Open
Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Mos…
View article: EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone Open
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn t…
View article: Parameter-Efficient Model Adaptation for Vision Transformers
Parameter-Efficient Model Adaptation for Vision Transformers Open
In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model param…
View article: Revisiting the Role of Language Priors in Vision-Language Models
Revisiting the Role of Language Priors in Vision-Language Models Open
Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word …
View article: Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality Open
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has highlig…
View article: DIME-FM: DIstilling Multimodal and Efficient Foundation Models
DIME-FM: DIstilling Multimodal and Efficient Foundation Models Open
Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to…
View article: Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality Open
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. However, recent research has highlighted severe limitations of these models in their ability to perform composit…
View article: Unifying Tracking and Image-Video Object Detection
Unifying Tracking and Image-Video Object Detection Open
Objection detection (OD) has been one of the most fundamental tasks in computer vision. Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other han…
View article: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone Open
Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and …
View article: GLIPv2: Unifying Localization and Vision-Language Understanding
GLIPv2: Unifying Localization and Vision-Language Understanding Open
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies …
View article: Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding Open
Combining multiple datasets enables performance boost on many computer vision tasks. But similar trend has not been witnessed in object detection when combining multiple datasets due to two inconsistencies among detection datasets: taxonom…
View article: Grounded Language-Image Pre-training
Grounded Language-Image Pre-training Open
This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unificati…
View article: K-LITE: Learning Transferable Visual Models with External Knowledge
K-LITE: Learning Transferable Visual Models with External Knowledge Open
The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability…
View article: Missingness Bias in Model Debugging
Missingness Bias in Model Debugging Open
Missingness, or the absence of features from an input, is a concept fundamental to many model debugging tools. However, in computer vision, pixels cannot simply be removed from an image. One thus tends to resort to heuristics such as black…
View article: ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models Open
Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datas…
View article: Unified Contrastive Learning in Image-Text-Label Space
Unified Contrastive Learning in Image-Text-Label Space Open
Visual recognition is recently learned via either supervised learning on human-annotated image-label data or language-image contrastive learning with webly-crawled image-text pairs. While supervised learning may result in a more discrimina…
View article: Parameter-efficient Model Adaptation for Vision Transformers
Parameter-efficient Model Adaptation for Vision Transformers Open
In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model param…
View article: RegionCLIP: Region-based Language-Image Pretraining
RegionCLIP: Region-based Language-Image Pretraining Open
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize…
View article: Florence: A New Foundation Model for Computer Vision
Florence: A New Foundation Model for Computer Vision Open
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on …
View article: An Empirical Study of Training End-to-End Vision-and-Language Transformers
An Empirical Study of Training End-to-End Vision-and-Language Transformers Open
Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, th…
View article: Image Scene Graph Generation (SGG) Benchmark
Image Scene Graph Generation (SGG) Benchmark Open
There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection. Due to the lack of a good benc…