Abdelrahman Shaker
YOU?
Author Swipe
View article: MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning Open
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories…
View article: All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages Open
Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support lo…
View article: VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos Open
Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and i…
View article: VideoMolmo: Spatio-Temporal Grounding Meets Pointing
VideoMolmo: Spatio-Temporal Grounding Meets Pointing Open
Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sop…
View article: Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model Open
Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an eff…
View article: GroupMamba: Efficient Group-Based Visual State Space Model
GroupMamba: Efficient Group-Based Visual State Space Model Open
State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity, making them attractive for various applications. However, purely SSM-based models face critical challeng…
View article: UNETR++: Delving Into Efficient and Accurate 3D Medical Image Segmentation
UNETR++: Delving Into Efficient and Accurate 3D Medical Image Segmentation Open
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture lon…
View article: Efficient Video Object Segmentation via Modulated Cross-Attention Memory
Efficient Video Object Segmentation via Modulated Cross-Attention Memory Open
Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand t…
View article: PALO: A Polyglot Large Multimodal Model for 5B People
PALO: A Polyglot Large Multimodal Model for 5B People Open
In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanis…
View article: GLaMM: Pixel Grounding Large Multimodal Model
GLaMM: Pixel Grounding Large Multimodal Model Open
Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually …
View article: Learnable Weight Initialization for Volumetric Medical Image Segmentation
Learnable Weight Initialization for Volumetric Medical Image Segmentation Open
Hybrid volumetric medical image segmentation models, combining the advantages of local convolution and global attention, have recently received considerable attention. While mainly focusing on architectural modifications, most existing hyb…
View article: XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models Open
The latest breakthroughs in large vision-language models, such as Bard and GPT-4, have showcased extraordinary abilities in performing a wide range of tasks. Such models are trained on massive datasets comprising billions of public image-t…
View article: SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications Open
Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially f…
View article: Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored Arabic LLM
Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored Arabic LLM Open
Climate change is one of the most significant challenges we face together as\na society. Creating awareness and educating policy makers the wide-ranging\nimpact of climate change is an essential step towards a sustainable future.\nRecently…
View article: UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation Open
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture lon…
View article: EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications Open
In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to bui…
View article: INSTA-YOLO: Real-Time Instance Segmentation
INSTA-YOLO: Real-Time Instance Segmentation Open
Instance segmentation has gained recently huge attention in various computer vision applications. It aims at providing different IDs to different object of the scene, even if they belong to the same class. This is useful in various scenari…
View article: Generalization of Convolutional Neural Networks for ECG Classification Using Generative Adversarial Networks
Generalization of Convolutional Neural Networks for ECG Classification Using Generative Adversarial Networks Open
Electrocardiograms (ECGs) play a vital role in the clinical diagnosis of heart diseases. An ECG record of the heart signal over time can be used to discover numerous arrhythmias. Our work is based on 15 different classes from the MIT-BIH a…