Gaowen Liu
YOU?
Author Swipe
View article: GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation
GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation Open
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that system…
View article: Physics-informed machine learning-based real-time long-horizon temperature fields prediction in metallic additive manufacturing
Physics-informed machine learning-based real-time long-horizon temperature fields prediction in metallic additive manufacturing Open
Real-time long-horizon temperature prediction in wire arc additive manufacturing is critical for process control and quality assurance. However, finite element methods are computationally expensive, and the existing data-driven models suff…
View article: Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos
Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos Open
We present Orientation-anchored Gaussian Splatting (OriGS), a novel framework for high-quality 4D reconstruction from casually captured monocular videos. While recent advances extend 3D Gaussian Splatting to dynamic scenes via various moti…
View article: Efficient Multimodal Dataset Distillation via Generative Models
Efficient Multimodal Dataset Distillation via Generative Models Open
Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the im…
View article: A Content-dependent Watermark for Safeguarding Image Attribution
A Content-dependent Watermark for Safeguarding Image Attribution Open
The rapid growth of digital and AI-generated images has amplified the need for secure and verifiable methods of image attribution. While digital watermarking offers more robust protection than metadata-based approaches--which can be easily…
View article: Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search Open
Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limite…
View article: How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench Open
Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $…
View article: A multi-model management approach for power system transient stability assessment based on multi-moment feature clustering
A multi-model management approach for power system transient stability assessment based on multi-moment feature clustering Open
View article: Collision- and Reachability-Aware Multi-Robot Control with Grounded LLM Planners
Collision- and Reachability-Aware Multi-Robot Control with Grounded LLM Planners Open
Large language models (LLMs) have demonstrated strong performance in various robot control tasks. However, their deployment in real-world applications remains constrained. Even state-ofthe-art LLMs, such as GPT-o4mini, frequently produce i…
View article: Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation
Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation Open
Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. This enables the identification of novel v…
View article: VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization
VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization Open
We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic …
View article: UniMuMo: Unified Text, Music, and Motion Generation
UniMuMo: Unified Text, Music, and Motion Generation Open
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired…
View article: Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model Open
Conditional diffusion models have gained increasing attention since their impressive results for cross-modal synthesis, where the strong alignment between conditioning input and generated output can be achieved by training a time-condition…
View article: Towards Vector Optimization on Low-Dimensional Vector Symbolic Architecture
Towards Vector Optimization on Low-Dimensional Vector Symbolic Architecture Open
Vector Symbolic Architecture (VSA) is emerging in machine learning due to its efficiency, but they are hindered by issues of hyperdimensionality and accuracy. As a promising mitigation, the Low-Dimensional Computing (LDC) method significan…
View article: Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning
Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning Open
Reasoning abilities of LLMs have been a key focus in recent years. One challenging reasoning domain with interesting nuances is legal reasoning, which requires careful application of rules, and precedents while balancing deductive and anal…
View article: Discrete Element-Based Design of a High-Speed Rotary Tiller for Saline-Alkali Land and Verification of Optimal Tillage Parameters
Discrete Element-Based Design of a High-Speed Rotary Tiller for Saline-Alkali Land and Verification of Optimal Tillage Parameters Open
Aiming at the saline soil in Binhai New Area, which is solid and sclerotic, and addressing the problem of poor quality and low efficiency of traditional rotary tillage, this research designed a high-speed rotary tiller that can realize the…
View article: Deep learning-based novel aluminum furniture design style recognition and key technology research
Deep learning-based novel aluminum furniture design style recognition and key technology research Open
This study explores a pioneering research effort focusing on the use of deep learning techniques to achieve high-precision automatic recognition of aluminum furniture design styles, and proposes an innovative convolutional neural network (…
View article: Gaze-Based Map Interaction Method Driven by Generative Large Models
Gaze-Based Map Interaction Method Driven by Generative Large Models Open
View article: Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning
Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning Open
View article: How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on tau-bench
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on tau-bench Open
View article: Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation
Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation Open
Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. This enables the identification of novel v…
View article: Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization
Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization Open
Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. However, despite recent advances, these models are still prone to generating unsafe images cont…
View article: A kind of efficient drilling holes technology modularistically for aircraft beam products
A kind of efficient drilling holes technology modularistically for aircraft beam products Open
View article: UniMuMo: Unified Text, Music and Motion Generation
UniMuMo: Unified Text, Music and Motion Generation Open
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired…
View article: Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check
Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check Open
In the domain of autonomous driving, the integration of multi-modal perception techniques based on data from diverse sensors has demonstrated substantial progress. Effectively surpassing the capabilities of state-of-the-art single-modality…
View article: SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding Open
Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, …
View article: Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge
Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge Open
This paper investigates how to efficiently deploy vision transformers on edge devices for small workloads. Recent methods reduce the latency of transformer neural networks by removing or merging tokens, with small accuracy degradation. How…
View article: Riemannian Multinomial Logistics Regression for SPD Neural Networks
Riemannian Multinomial Logistics Regression for SPD Neural Networks Open
View article: Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference
Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference Open
As Large Language Models (LLMs) demonstrate extensive capability in learning from documents, LLM unlearning becomes an increasingly important research area to address concerns of LLMs in terms of privacy, copyright, etc. A conventional LLM…
View article: Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization
Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization Open
Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enabl…