Mubarak Shah
YOU?
Author Swipe
View article: Wildlife Action Recognition using Deep Learning
Wildlife Action Recognition using Deep Learning Open
View article: StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales Open
State space models (SSMs) have emerged as a competitive alternative to transformers in various tasks. Their linear complexity and hidden-state recurrence make them particularly attractive for modeling long sequences, whereas attention beco…
View article: Exploring Multi-Agent Reinforcement Learning for Cell Mechanics
Exploring Multi-Agent Reinforcement Learning for Cell Mechanics Open
View article: Cross-View Open-Vocabulary Object Detection in Aerial Imagery
Cross-View Open-Vocabulary Object Detection in Aerial Imagery Open
Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling mode…
View article: Agentic Large-Language-Model Systems in Medicine: A Systematic Review and Taxonomy
Agentic Large-Language-Model Systems in Medicine: A Systematic Review and Taxonomy Open
View article: Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications Open
Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the comple…
View article: GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space
GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space Open
Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues lik…
View article: From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos Open
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the…
View article: On Transfer-based Universal Attacks in Pure Black-box Setting
On Transfer-based Universal Attacks in Pure Black-box Setting Open
Despite their impressive performance, deep visual models are susceptible to transferable black-box adversarial attacks. Principally, these attacks craft perturbations in a target model-agnostic manner. However, surprisingly, we find that e…
View article: VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment
VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment Open
Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality mi…
View article: SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models
SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models Open
Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to…
View article: Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects
Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects Open
View article: A Note on Exact State Visit Probabilities in Two-State Markov Chains
A Note on Exact State Visit Probabilities in Two-State Markov Chains Open
In this note we derive the exact probability that a specific state in a two-state Markov chain is visited exactly $k$ times after $N$ transitions. We provide a closed-form solution for $\mathbb{P}(N_l = k \mid N)$, considering initial stat…
View article: ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition
ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition Open
Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bi…
View article: A guided approach for cross-view geolocalization estimation with land cover semantic segmentation
A guided approach for cross-view geolocalization estimation with land cover semantic segmentation Open
Geolocalization is a crucial process that leverages environmental information and contextual data to accurately identify a position. In particular, cross-view geolocalization utilizes images from various perspectives, such as satellite and…
View article: Emotional intelligence and its impact on postgraduate students: A study at the University of Kashmir
Emotional intelligence and its impact on postgraduate students: A study at the University of Kashmir Open
View article: A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
A Culturally-diverse Multilingual Multimodal Video Benchmark & Model Open
View article: LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds Open
Traditional jailbreaks have successfully exposed vulnerabilities in LLMs, primarily relying on discrete combinatorial optimization, while more recent methods focus on training LLMs to generate adversarial prompts. However, both approaches …
View article: CityGuessr: City-Level Video Geo-Localization on a Global Scale
CityGuessr: City-Level Video Geo-Localization on a Global Scale Open
Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it was captured from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using th…
View article: Investigating Memorization in Video Diffusion Models
Investigating Memorization in Video Diffusion Models Open
Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior resear…
View article: Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning Open
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, lea…
View article: Perceptions of Subject Specialists Regarding the Relationship between Principal Leadership Skills and School Effectiveness
Perceptions of Subject Specialists Regarding the Relationship between Principal Leadership Skills and School Effectiveness Open
The purpose of this current study is to analyze the relationship between leadership skills and school effectiveness. Employing a quantitative approach, a co-relational research design has been used for the instant study. The target populat…
View article: Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects
Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects Open
View article: FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition Open
Real-life applications of action recognition often require a fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annot…
View article: Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets Open
Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approa…
View article: Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects
Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects Open
Within the vast expanse of computerized language processing, a revolutionary entity known as Large Language Models (LLMs) has emerged, wielding immense power in its capacity to comprehend intricate linguistic patterns and conjure coherent …
View article: Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects
Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects Open
View article: GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers Open
Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods…
View article: Unifying Video Self-Supervised Learning across Families of Tasks: A Survey
Unifying Video Self-Supervised Learning across Families of Tasks: A Survey Open
Video self-supervised learning (VideoSSL) offers significant potential for reducing annotation costs and enhancing a wide range of downstream tasks in video understanding. The ultimate goal of VideoSSL is to achieve human-level video intel…
View article: X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs Open
Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this fi…