Suha Kwak
YOU?
Author Swipe
View article: Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection
Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection Open
Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representa…
View article: GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning
GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning Open
Parameter-efficient fine-tuning (PEFT) of vision-language models (VLMs) excels in various vision tasks thanks to the rich knowledge and generalization ability of VLMs. However, recent studies revealed that such fine-tuned VLMs are vulnerab…
View article: GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation
GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation Open
Improving robustness of the Segment Anything Model (SAM) to input degradations is critical for its deployment in high-stakes applications such as autonomous driving and robotics. Our approach to this challenge prioritizes three key aspects…
View article: TestDG: Test-time Domain Generalization for Continual Test-time Adaptation
TestDG: Test-time Domain Generalization for Continual Test-time Adaptation Open
This paper studies continual test-time adaptation (CTTA), the task of adapting a model to constantly changing unseen domains in testing while preserving previously learned knowledge. Existing CTTA methods mostly focus on adaptation to the …
View article: Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval Open
Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and t…
View article: Enhancing Cost Efficiency in Active Learning with Candidate Set Query
Enhancing Cost Efficiency in Active Learning with Candidate Set Query Open
This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our me…
View article: Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens Open
Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challeng…
View article: Improving Text-based Person Search via Part-level Cross-modal Correspondence
Improving Text-based Person Search via Part-level Cross-modal Correspondence Open
Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, wh…
View article: ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation
ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation Open
Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated…
View article: Bootstrapping Top-down Information for Self-modulating Slot Attention
Bootstrapping Top-down Information for Self-modulating Slot Attention Open
Object-centric learning (OCL) aims to learn representations of individual objects within visual scenes without manual supervision, facilitating efficient and effective visual reasoning. Traditional OCL methods primarily employ bottom-up ap…
View article: PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery
PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery Open
Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing m…
View article: Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models Open
Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which red…
View article: Online Temporal Action Localization with Memory-Augmented Transformer
Online Temporal Action Localization with Memory-Augmented Transformer Open
Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in consideri…
View article: Classification Matters: Improving Video Action Detection with Class-Specific Attention
Classification Matters: Improving Video Action Detection with Class-Specific Attention Open
Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for cl…
View article: FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions
FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions Open
Robust semantic segmentation under adverse conditions is crucial in real-world applications. To address this challenging task in practical scenarios where labeled normal condition images are not accessible in training, we propose FREST, a …
View article: Extreme Point Supervised Instance Segmentation
Extreme Point Supervised Instance Segmentation Open
This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box ann…
View article: Active Label Correction for Semantic Segmentation with Foundation Models
Active Label Correction for Semantic Segmentation with Foundation Models Open
Training and validating models for semantic segmentation require datasets with pixel-wise annotations, which are notoriously labor-intensive. Although useful priors such as foundation models or crowdsourced datasets are available, they are…
View article: Activity Grammars for Temporal Action Segmentation
Activity Grammars for Temporal Action Segmentation Open
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an u…
View article: Towards More Practical Group Activity Detection: A New Benchmark and Model
Towards More Practical Group Activity Detection: A New Benchmark and Model Open
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both da…
View article: Active Learning for Semantic Segmentation with Multi-class Label Query
Active Learning for Semantic Segmentation with Multi-class Label Query Open
This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an…
View article: Learning Unified Distance Metric Across Diverse Data Distributions with Parameter-Efficient Transfer Learning
Learning Unified Distance Metric Across Diverse Data Distributions with Parameter-Efficient Transfer Learning Open
A common practice in metric learning is to train and test an embedding model for each dataset. This dataset-specific approach fails to simulate real-world scenarios that involve multiple heterogeneous distributions of data. In this regard,…
View article: Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision Open
Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading t…
View article: SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems
SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems Open
Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern …
View article: PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization Open
In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Also, a recent study has demonstrated the cross-modal transferability phenome…
View article: Adaptive Superpixel for Active Learning in Semantic Segmentation
Adaptive Superpixel for Active Learning in Semantic Segmentation Open
Learning semantic segmentation requires pixel-wise annotations, which can be time-consuming and expensive. To reduce the annotation cost, we propose a superpixel-based active learning (AL) framework, which collects a dominant label per sup…
View article: Human Pose Estimation in Extremely Low-Light Conditions
Human Pose Estimation in Extremely Low-Light Conditions Open
We study human pose estimation in extremely low-light images. This task is challenging due to the difficulty of collecting real low-light images with accurate labels, and severely corrupted inputs that degrade prediction quality significan…
View article: HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization
HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization Open
Supervision for metric learning has long been given in the form of equivalence between human-labeled classes. Although this type of supervision has been a basis of metric learning for decades, we argue that it hinders further advances in t…
View article: Learning to Detect Semantic Boundaries with Image-level Class Labels
Learning to Detect Semantic Boundaries with Image-level Class Labels Open
This paper presents the first attempt to learn semantic boundary detection using image-level class labels as supervision. Our method starts by estimating coarse areas of object classes through attentions drawn by an image classification ne…
View article: Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Improving Cross-Modal Retrieval with Set of Diverse Embeddings Open
Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied …
View article: Cross-Domain Ensemble Distillation for Domain Generalization
Cross-Domain Ensemble Distillation for Domain Generalization Open
Domain generalization is the task of learning models that generalize to unseen target domains. We propose a simple yet effective method for domain generalization, named cross-domain ensemble distillation (XDED), that learns domain-invarian…