Hui Xiong
YOU?
Author Swipe
View article: See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model Open
We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities li…
View article: VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding Open
Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus beco…
View article: Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation
Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation Open
Effective content moderation is essential for video platforms to safeguard user experience and uphold community standards. While traditional video classification models effectively handle well-defined moderation tasks, they struggle with c…
View article: Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study
Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study Open
Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains…
View article: Deep Generative Architectures for Automated Music Composition: Optimizing Neural Structures and Multimodal Inputs for Style-Conscious Melody and Harmony Generation
Deep Generative Architectures for Automated Music Composition: Optimizing Neural Structures and Multimodal Inputs for Style-Conscious Melody and Harmony Generation Open
This study explores the application of deep generative models in the field of intelligent composition, focusing on the impact of network architecture optimization and multimodal input integration on music style fidelity and emotional expre…
View article: ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research
ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research Open
Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than exp…
View article: On the Transferability and Discriminability of Repersentation Learning in Unsupervised Domain Adaptation
On the Transferability and Discriminability of Repersentation Learning in Unsupervised Domain Adaptation Open
In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adve…
View article: LLMs as Better Recommenders with Natural Language Collaborative Signals: A Self-Assessing Retrieval Approach
LLMs as Better Recommenders with Natural Language Collaborative Signals: A Self-Assessing Retrieval Approach Open
Incorporating collaborative information (CI) effectively is crucial for leveraging LLMs in recommendation tasks. Existing approaches often encode CI using soft tokens or abstract identifiers, which introduces a semantic misalignment with t…
View article: GCAL: Adapting Graph Models to Evolving Domain Shifts
GCAL: Adapting Graph Models to Evolving Domain Shifts Open
This paper addresses the challenge of graph domain adaptation on evolving, multiple out-of-distribution (OOD) graphs. Conventional graph domain adaptation methods are confined to single-step adaptation, making them ineffective in handling …
View article: Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs Open
Large language models (LLMs) excel at complex tasks thanks to advances in their reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and efficiency, often encouraging unnecessarily long reas…
View article: From Events to Enhancement: A Survey on Event-Based Imaging Technologies
From Events to Enhancement: A Survey on Event-Based Imaging Technologies Open
Event cameras offering high dynamic range and low latency have emerged as disruptive technologies in imaging. Despite growing research on leveraging these benefits for different imaging tasks, a comprehensive study of recently advances and…
View article: Unleashing the Power of Large Language Model for Denoising Recommendation
Unleashing the Power of Large Language Model for Denoising Recommendation Open
Recommender systems are crucial for personalizing user experiences but often depend on implicit feedback data, which can be noisy and misleading. Existing denoising studies involve incorporating auxiliary information or learning strategies…
View article: A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook
A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook Open
Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artifi…
View article: TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning
TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning Open
Large language models (LLMs) have shown promise in automating travel planning, yet they often fall short in addressing nuanced spatiotemporal rationality. While existing benchmarks focus on basic plan validity, they neglect critical aspect…
View article: Cognitive Disentanglement for Referring Multi-Object Tracking
Cognitive Disentanglement for Referring Multi-Object Tracking Open
As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language…
View article: From Understanding to Excelling: Template-Free Algorithm Design through Structural-Functional Co-Evolution
From Understanding to Excelling: Template-Free Algorithm Design through Structural-Functional Co-Evolution Open
Large language models (LLMs) have greatly accelerated the automation of algorithm generation and optimization. However, current methods such as EoH and FunSearch mainly rely on predefined templates and expert-specified functions that focus…
View article: SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models Open
In recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic …