Mustafa Jarrar
YOU?
Author Swipe
View article: Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition
Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition Open
We introduce Konooz, a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains, resulting in 160 distinct corpora. The corpus comprises about 777k tokens, carefully collected and manually annotated with 21 entity types…
View article: Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset Open
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explic…
View article: Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs Open
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries…
View article: Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition
Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition Open
View article: ImageEval 2025: The First Arabic Image Captioning Shared Task
ImageEval 2025: The First Arabic Image Captioning Shared Task Open
View article: WojoodOntology: Ontology-Driven LLM Prompting for Unified Information Extraction Tasks
WojoodOntology: Ontology-Driven LLM Prompting for Unified Information Extraction Tasks Open
View article: Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs Open
View article: WojoodRelations: Arabic Relation Extraction Corpus and Modeling
WojoodRelations: Arabic Relation Extraction Corpus and Modeling Open
View article: The AraGenEval Shared Task on Arabic Authorship Style Transfer and AI Generated Text Detection
The AraGenEval Shared Task on Arabic Authorship Style Transfer and AI Generated Text Detection Open
View article: NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task
NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task Open
View article: Active Learning for Multidialectal Arabic POS Tagging
Active Learning for Multidialectal Arabic POS Tagging Open
View article: Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset Open
View article: SinaTools: Open Source Toolkit for Arabic Natural Language Processing
SinaTools: Open Source Toolkit for Arabic Natural Language Processing Open
We introduce SinaTools, an open-source Python package for Arabic natural language processing and understanding. SinaTools is a unified package allowing people to integrate it into their system workflow, offering solutions for various tasks…
View article: Casablanca: Data and Models for Multidialectal Arabic Speech Recognition
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition Open
In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclu…
View article: ArabicNLU 2024: The First Arabic Natural Language Understanding Shared Task
ArabicNLU 2024: The First Arabic Natural Language Understanding Shared Task Open
This paper presents an overview of the Arabic Natural Language Understanding (ArabicNLU 2024) shared task, focusing on two subtasks: Word Sense Disambiguation (WSD) and Location Mention Disambiguation (LMD). The task aimed to evaluate the …
View article: Event-Arguments Extraction Corpus and Modeling using BERT for Arabic
Event-Arguments Extraction Corpus and Modeling using BERT for Arabic Open
Event-argument extraction is a challenging task, particularly in Arabic due to sparse linguistic resources. To fill this gap, we introduce the \hadath corpus ($550$k tokens) as an extension of Wojood, enriched with event-argument annotatio…
View article: The FIGNEWS Shared Task on News Media Narratives
The FIGNEWS Shared Task on News Media Narratives Open
We present an overview of the FIGNEWS shared task, organized as part of the ArabicNLP 2024 conference co-located with ACL 2024. The shared task addresses bias and propaganda annotation in multilingual news posts. We focus on the early days…
View article: AraFinNLP 2024: The First Arabic Financial NLP Shared Task
AraFinNLP 2024: The First Arabic Financial NLP Shared Task Open
The expanding financial markets of the Arab world require sophisticated Arabic NLP tools. To address this need within the banking domain, the Arabic Financial NLP (AraFinNLP) shared task proposes two subtasks: (i) Multi-dialect Intent Dete…
View article: WojoodNER 2024: The Second Arabic Named Entity Recognition Shared Task
WojoodNER 2024: The Second Arabic Named Entity Recognition Shared Task Open
We present WojoodNER-2024, the second Arabic Named Entity Recognition (NER) Shared Task. In WojoodNER-2024, we focus on fine-grained Arabic NER. We provided participants with a new Arabic fine-grained NER dataset called wojoodfine, annotat…
View article: Sina at FigNews 2024: Multilingual Datasets Annotated with Bias and Propaganda
Sina at FigNews 2024: Multilingual Datasets Annotated with Bias and Propaganda Open
The proliferation of bias and propaganda on social media is an increasingly significant concern, leading to the development of techniques for automatic detection. This article presents a multilingual corpus of 12, 000 Facebook posts fully …
View article: Are Large Language Models the New Interface for Data Pipelines?
Are Large Language Models the New Interface for Data Pipelines? Open
A Language Model is a term that encompasses various types of models designed to understand and generate human communication. Large Language Models (LLMs) have gained significant attention due to their ability to process text with human-lik…
View article: Qabas: An Open-Source Arabic Lexicographic Database
Qabas: An Open-Source Arabic Lexicographic Database Open
We present Qabas, a novel open-source Arabic lexicon designed for NLP applications. The novelty of Qabas lies in its synthesis of 110 lexicons. Specifically, Qabas lexical entries (lemmas) are assembled by linking lemmas from 110 lexicons.…
View article: NLU-STR at SemEval-2024 Task 1: Generative-based Augmentation and Encoder-based Scoring for Semantic Textual Relatedness
NLU-STR at SemEval-2024 Task 1: Generative-based Augmentation and Encoder-based Scoring for Semantic Textual Relatedness Open
Semantic textual relatedness is a broader concept of semantic similarity. It measures the extent to which two chunks of text convey similar meaning or topics, or share related concepts or contexts. This notion of relatedness can be applied…
View article: SinaTools: Open Source Toolkit for Arabic Natural Language Processing
SinaTools: Open Source Toolkit for Arabic Natural Language Processing Open
View article: Alma: Fast Lemmatizer and POS Tagger for Arabic
Alma: Fast Lemmatizer and POS Tagger for Arabic Open
View article: ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic
ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic Open
This paper presents the ArBanking77, a large Arabic dataset for intent detection in the banking domain. Our dataset was arabized and localized from the original English Banking77 dataset, which consists of 13,083 queries to ArBanking77 dat…
View article: SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks
SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks Open
SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens, which are all sense-annotated. The corpus is annotated using two different sense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how tokens and s…
View article: Arabic Fine-Grained Entity Recognition
Arabic Fine-Grained Entity Recognition Open
Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fin…
View article: Nabra: Syrian Arabic Dialects with Morphological Annotations
Nabra: Syrian Arabic Dialects with Morphological Annotations Open
This paper presents Nabra, a corpora of Syrian Arabic dialects with morphological annotations. A team of Syrian natives collected more than 6K sentences containing about 60K words from several sources including social media posts, scripts …
View article: WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task
WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task Open
We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER) Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering novel NER datasets (i.e., Wojood) and the definition of subtasks designed to facilitate …