Explanipedia

Pilot-scale Research on Advanced Treatment of Surface Water from Beijing-Hangzhou Grand Canal with Seasonably Variable Temperature Based on Hollow Fiber Nanofiltration Open

Mu Liu, Mengyuan Duan, Zehua Li, Yingqiang Su, Kai Sun , et al. · 2025

As the first pilot-scale study based on hollow fiber nanofiltration (HFNF) technology using natural surface water as feed streams in China, the research aims at exploring and developing low-expense and high-efficiency operating technologie…

TLDR: Token-Level Detective Reward Model for Large Vision Language Models Open

Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang , et al. · 2024

Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assignin…

Learning Video Context as Interleaved Multimodal Sequences Open

Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen , et al. · 2024

Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce M…

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation Open

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei , et al. · 2024

While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct…

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation Open

Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gökmen, Chengshu Li , et al. · 2024

The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic…

Evaluating Text-to-Visual Generation with Image-to-Text Generation Open

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia , et al. · 2024

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (gen…

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task Open

Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oğuz, James C. Gee , et al. · 2023

The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated pe…

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning Open

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu , et al. · 2023

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including …

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding Open

Mohamed Afham, Satya Narayan Shukla, Omid Poursaeed, Pengchuan Zhang, Ashish S. Shah , et al. · 2023

While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a sh…

UniVTG: Towards Unified Video-Language Temporal Grounding Open

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao , et al. · 2023

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Mos…

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone Open

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah , et al. · 2023

Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn t…

Parameter-Efficient Model Adaptation for Vision Transformers Open

Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, Xin Wang · 2023

In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model param…

Revisiting the Role of Language Priors in Vision-Language Models Open

Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan · 2023

Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word …

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality Open

Harman Preet Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong , et al. · 2023

Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has highlig…

DIME-FM: DIstilling Multimodal and Efficient Foundation Models Open

Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko , et al. · 2023

Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to…

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality Open

Harman Preet Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong , et al. · 2023

Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. However, recent research has highlighted severe limitations of these models in their ability to perform composit…

Unifying Tracking and Image-Video Object Detection Open

Peirong Liu, Rui Wang, Pengchuan Zhang, Omid Poursaeed, Zhou Yipin , et al. · 2022

Objection detection (OD) has been one of the most fundamental tasks in computer vision. Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other han…

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone Open

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang , et al. · 2022

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and …

GLIPv2: Unifying Localization and Vision-Language Understanding Open

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen‐Chun Chen, Liunian Harold Li , et al. · 2022

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies …

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding Open

Lingchen Meng, Xiyang Dai, Yinpeng Chen, Pengchuan Zhang, Dongdong Chen , et al. · 2022

Combining multiple datasets enables performance boost on many computer vision tasks. But similar trend has not been witnessed in object detection when combining multiple datasets due to two inconsistencies among detection datasets: taxonom…

Grounded Language-Image Pre-training Open

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li , et al. · 2022

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unificati…

K-LITE: Learning Transferable Visual Models with External Knowledge Open

Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang , et al. · 2022

The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability…

Missingness Bias in Model Debugging Open

Saachi Jain, Hadi Salman, Eric Wong, Pengchuan Zhang, Vibhav Vineet , et al. · 2022

Missingness, or the absence of features from an input, is a concept fundamental to many model debugging tools. However, in computer vision, pixels cannot simply be removed from an image. One thus tends to resort to heuristics such as black…

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models Open

Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja , et al. · 2022

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datas…

Unified Contrastive Learning in Image-Text-Label Space Open

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu , et al. · 2022

Visual recognition is recently learned via either supervised learning on human-annotated image-label data or language-image contrastive learning with webly-crawled image-text pairs. While supervised learning may result in a more discrimina…

Parameter-efficient Model Adaptation for Vision Transformers Open

Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, Xin Wang · 2022

In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model param…

RegionCLIP: Region-based Language-Image Pretraining Open

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella , et al. · 2021

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize…

Florence: A New Foundation Model for Computer Vision Open

Lu Yuan, Dongdong Chen, Yi‐Ling Chen, Noel Codella, Xiyang Dai , et al. · 2021

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on …

An Empirical Study of Training End-to-End Vision-and-Language Transformers Open

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang , et al. · 2021

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, th…

Image Scene Graph Generation (SGG) Benchmark Open

Xiaotian Han, Jianwei Yang, Houdong Hu, Lei Zhang, Jianfeng Gao , et al. · 2021

There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection. Due to the lack of a good benc…

Pengchuan Zhang YOU? Author Swipe