Xihan Wei
YOU?
Author Swipe
View article: HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Open
With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinfo…
View article: A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection
A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection Open
Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that d…
View article: LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models Open
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed cap…
View article: HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding Open
In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial. While recent omni models can process multiple modalities, they generally lack effectiveness in human-centric scenes due to the ab…
View article: Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models Open
Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, …
View article: DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation Open
Text-to-3D generation, which synthesizes 3D assets according to an overall text description, has significantly progressed. However, a challenge arises when the specific appearances need customizing at designated viewpoints but referring so…