Explanipedia

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Open

Qiong Yang, Sicong Yao, Weixuan Chen, Shenghao Fu, Detao Bai , et al. · 2025

With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinfo…

A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection Open

Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie , et al. · 2025

Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that d…

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models Open

Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei , et al. · 2025

Computer science Philosophy

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed cap…

HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding Open

Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Sicong Yao , et al. · 2025

Computer science Philosophy

In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial. While recent omni models can process multiple modalities, they generally lack effectiveness in human-centric scenes due to the ab…

Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models Open

Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie , et al. · 2024

Computer science Geography

Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, …

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation Open

Junkai Yan, Yipeng Gao, Qize Yang, Xihan Wei, Xuansong Xie , et al. · 2024

Computer science

Text-to-3D generation, which synthesizes 3D assets according to an overall text description, has significantly progressed. However, a challenge arises when the specific appearances need customizing at designated viewpoints but referring so…

Xihan Wei YOU? Author Swipe