Aston Zhang
YOU?
Author Swipe
View article: OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents
OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents Open
Autonomous graphical user interface (GUI) agents powered by multimodal large language models have shown great promise. However, a critical yet underexplored issue persists: over-execution, where the agent executes tasks in a fully autonomo…
View article: A Systematic Examination of Preference Learning through the Lens of Instruction-Following
A Systematic Examination of Preference Learning through the Lens of Instruction-Following Open
View article: OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents
OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents Open
View article: Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions
Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions Open
View article: A Systematic Examination of Preference Learning through the Lens of Instruction-Following
A Systematic Examination of Preference Learning through the Lens of Instruction-Following Open
Preference learning is a widely adopted post-training technique that aligns large language models (LLMs) to human preferences and improves specific downstream task capabilities. In this work we systematically investigate how specific attri…
View article: Law of the Weakest Link: Cross Capabilities of Large Language Models
Law of the Weakest Link: Cross Capabilities of Large Language Models Open
The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for …
View article: Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions
Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions Open
This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by enviro…
View article: In-Context Learning with Iterative Demonstration Selection
In-Context Learning with Iterative Demonstration Selection Open
Spurred by advancements in scale, large language models (LLMs) have demonstrated strong few-shot learning ability via in-context learning (ICL). However, the performance of ICL has been shown to be highly sensitive to the selection of few-…
View article: You Only Look at Screens: Multimodal Chain-of-Action Agents
You Only Look at Screens: Multimodal Chain-of-Action Agents Open
Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LL…
View article: Automated Few-shot Classification with Instruction-Finetuned Language Models
Automated Few-shot Classification with Instruction-Finetuned Language Models Open
A particularly successful class of approaches for few-shot learning combines language models with prompts -- hand-crafted task descriptions that complement data samples. However, designing prompts by hand for each task commonly requires do…
View article: Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens Open
Transformers are central in modern natural language processing and computer vision applications. Despite recent works devoted to reducing the quadratic cost of such models (as a function of the sequence length), dealing with ultra long seq…
View article: Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition Open
This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-t…
View article: AIM: Adapting Image Models for Efficient Video Action Recognition
AIM: Adapting Image Models for Efficient Video Action Recognition Open
Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationall…
View article: Automated Few-Shot Classification with Instruction-Finetuned Language Models
Automated Few-Shot Classification with Instruction-Finetuned Language Models Open
A particularly successful class of approaches for few-shot learning combines language models with prompts - hand-crafted task descriptions that complement data samples. However, designing prompts by hand for each task commonly requires dom…
View article: Learning Multimodal Data Augmentation in Feature Space
Learning Multimodal Data Augmentation in Feature Space Open
The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, th…
View article: SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning
SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning Open
Pre-trained large language models can efficiently interpolate human-written prompts in a natural way. Multitask prompted learning can help generalization through a diverse set of tasks at once, thus enhancing the potential for more effecti…
View article: PHNNs: Lightweight Neural Networks via Parameterized Hypercomplex Convolutions
PHNNs: Lightweight Neural Networks via Parameterized Hypercomplex Convolutions Open
Hypercomplex neural networks have proven to reduce the overall number of parameters while ensuring valuable performance by leveraging the properties of Clifford algebras. Recently, hypercomplex linear layers have been further improved by i…
View article: Automatic Chain of Thought Prompting in Large Language Models
Automatic Chain of Thought Prompting in Large Language Models Open
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. On…
View article: Partial and Asymmetric Contrastive Learning for Out-of-Distribution Detection in Long-Tailed Recognition
Partial and Asymmetric Contrastive Learning for Out-of-Distribution Detection in Long-Tailed Recognition Open
Existing out-of-distribution (OOD) detection methods are typically benchmarked on training sets with balanced class distributions. However, in real-world applications, it is common for the training sets to have long-tailed distributions. I…
View article: Removing Batch Normalization Boosts Adversarial Training
Removing Batch Normalization Boosts Adversarial Training Open
Adversarial training (AT) defends deep neural networks against adversarial attacks. One challenge that limits its practical application is the performance degradation on clean samples. A major bottleneck identified by previous works is the…
View article: Lightweight Convolutional Neural Networks By Hypercomplex Parameterization.
Lightweight Convolutional Neural Networks By Hypercomplex Parameterization. Open
Hypercomplex neural networks have proved to reduce the overall number of parameters while ensuring valuable performances by leveraging the properties of Clifford algebras. Recently, hypercomplex linear layers have been further improved by …
View article: PHNNs: Lightweight Neural Networks via Parameterized Hypercomplex Convolutions
PHNNs: Lightweight Neural Networks via Parameterized Hypercomplex Convolutions Open
Hypercomplex neural networks have proven to reduce the overall number of parameters while ensuring valuable performance by leveraging the properties of Clifford algebras. Recently, hypercomplex linear layers have been further improved by i…
View article: Dive into Deep Learning
Dive into Deep Learning Open
This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, an…
View article: Controllable and Diverse Text Generation in E-commerce
Controllable and Diverse Text Generation in E-commerce Open
In E-commerce, a key challenge in text generation is to find a good trade-off between word diversity and accuracy (relevance) in order to make generated text appear more natural and human-like. In order to improve the relevance of generate…
View article: Controllable and Diverse Text Generation in E-commerce
Controllable and Diverse Text Generation in E-commerce Open
In E-commerce, a key challenge in text generation is to find a good trade-off between word diversity and accuracy (relevance) in order to make generated text appear more natural and human-like. In order to improve the relevance of generate…
View article: Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with $1/n$ Parameters
Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with $1/n$ Parameters Open
Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, "fully-connected layers with Quaternions" (4D hypercomplex numbers), which replace real-valued matrix multiplications in full…
View article: On Orthogonality Constraints for Transformers
On Orthogonality Constraints for Transformers Open
Orthogonality constraints encourage matrices to be orthogonal for numerical stability. These plug-and-play constraints, which can be conveniently incorporated into model training, have been studied for popular architectures in natural lang…
View article: Learning User Representations with Hypercuboids for Recommender Systems
Learning User Representations with Hypercuboids for Recommender Systems Open
Modeling user interests is crucial in real-world recommender systems. In this paper, we present a new user interest representation model for personalized recommendation. Specifically, the key novelty behind our model is that it explicitly …
View article: ControlVAE: Tuning, Analytical Properties, and Performance Analysis
ControlVAE: Tuning, Analytical Properties, and Performance Analysis Open
This paper reviews the novel concept of controllable variational autoencoder (ControlVAE), discusses its parameter tuning to meet application needs, derives its key analytic properties, and offers useful extensions and applications. Contro…
View article: Text Style Transfer: A Review and Experimental Evaluation
Text Style Transfer: A Review and Experimental Evaluation Open
The stylistic properties of text have intrigued computational linguistics researchers in recent years. Specifically, researchers have investigated the Text Style Transfer (TST) task, which aims to change the stylistic properties of the tex…