Mubarak Shah
YOU?
Author Swipe
View article: Cross-View Open-Vocabulary Object Detection in Aerial Imagery
Cross-View Open-Vocabulary Object Detection in Aerial Imagery Open
Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling mode…
View article: Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications Open
Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the comple…
View article: GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space
GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space Open
Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues lik…
View article: From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos Open
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the…
View article: MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark
MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark Open
We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, …
View article: HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation Open
Although recent large multimodal models (LMMs) demonstrate impressive progress on vision language tasks, their alignment with human centered (HC) principles, such as fairness, ethics, inclusivity, empathy, and robustness; remains poorly un…
View article: Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models
Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models Open
Text-to-image diffusion models have demonstrated remarkable capabilities in creating images highly aligned with user prompts, yet their proclivity for memorizing training set images has sparked concerns about the originality of the generat…
View article: On Transfer-based Universal Attacks in Pure Black-box Setting
On Transfer-based Universal Attacks in Pure Black-box Setting Open
Despite their impressive performance, deep visual models are susceptible to transferable black-box adversarial attacks. Principally, these attacks craft perturbations in a target model-agnostic manner. However, surprisingly, we find that e…
View article: GAEA: A Geolocation Aware Conversational Assistant
GAEA: A Geolocation Aware Conversational Assistant Open
Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyo…
View article: VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment
VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment Open
Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality mi…
View article: SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models
SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models Open
Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to…
View article: A Note on Exact State Visit Probabilities in Two-State Markov Chains
A Note on Exact State Visit Probabilities in Two-State Markov Chains Open
In this note we derive the exact probability that a specific state in a two-state Markov chain is visited exactly $k$ times after $N$ transitions. We provide a closed-form solution for $\mathbb{P}(N_l = k \mid N)$, considering initial stat…
View article: ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition
ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition Open
Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bi…
View article: A guided approach for cross-view geolocalization estimation with land cover semantic segmentation
A guided approach for cross-view geolocalization estimation with land cover semantic segmentation Open
Geolocalization is a crucial process that leverages environmental information and contextual data to accurately identify a position. In particular, cross-view geolocalization utilizes images from various perspectives, such as satellite and…
View article: LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds Open
Traditional jailbreaks have successfully exposed vulnerabilities in LLMs, primarily relying on discrete combinatorial optimization, while more recent methods focus on training LLMs to generate adversarial prompts. However, both approaches …
View article: CityGuessr: City-Level Video Geo-Localization on a Global Scale
CityGuessr: City-Level Video Geo-Localization on a Global Scale Open
Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it was captured from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using th…
View article: Investigating Memorization in Video Diffusion Models
Investigating Memorization in Video Diffusion Models Open
Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior resear…
View article: Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning Open
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, lea…
View article: Perceptions of Subject Specialists Regarding the Relationship between Principal Leadership Skills and School Effectiveness
Perceptions of Subject Specialists Regarding the Relationship between Principal Leadership Skills and School Effectiveness Open
The purpose of this current study is to analyze the relationship between leadership skills and school effectiveness. Employing a quantitative approach, a co-relational research design has been used for the instant study. The target populat…
View article: FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition Open
Real-life applications of action recognition often require a fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annot…
View article: Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets Open
Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approa…
View article: Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects
Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects Open
Within the vast expanse of computerized language processing, a revolutionary entity known as Large Language Models (LLMs) has emerged, wielding immense power in its capacity to comprehend intricate linguistic patterns and conjure coherent …
View article: GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers Open
Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods…