Explanipedia

SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models Open

Margaret Mitchell, Giuseppe Attanasio, Ioana Baldini, Miruna Clinciu, Jordan Clive , et al. · 2025

Computer science Philosophy

Large Language Models (LLMs) reproduce and exacerbate the social biases present in their training data, and resources to quantify this issue are limited. While research has attempted to identify and mitigate such biases, most efforts have …

Documenting Geographically and Contextually Diverse Language Data Sources Open

Angelina McMillan-Major, Francesco De Toni, Zaid Alyafeai, Stella Biderman, Kimbo Chen , et al. · 2024

Geography

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data …

Helping Cancer Patients to Choose the Best Treatment: Towards Automated Data-Driven and Personalized Information Presentation of Cancer Treatment Options Open

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić , et al. · 2024

Computer science Engineering Biology

When a person is diagnosed with cancer, difficult decisions about treatments need to be made. In this chapter, we describe an interdisciplinary research project which aims to automatically generate personalized descriptions of treatment op…

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Open

Hugo Laurençon, Lucile Saulnier, Thomas J. Wang, Christopher Akiki, A. Villanova del Moral , et al. · 2023

Computer science Political science Geography

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, w…

Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets Open

Yousef Altaher, Ali Fadel, Mazen Alotaibi, Mazen Alyazidi, Mishari Al-Mutairi , et al. · 2022

Computer science Philosophy Economics

Masader (Alyafeai et al., 2021) created a metadata structure to be used for cataloguing Arabic NLP datasets. However, developing an easy way to explore such a catalogue is a challenging task. In order to give the optimal experience for use…

Data Governance in the Age of Large-Scale Data-Driven Language Technology Open

Yacine Jernite, Huu Du Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud , et al. · 2022

Computer science Political science Business

The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to glob…

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources Open

Angelina McMillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco De Toni , et al. · 2022

Computer science Sociology

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect …

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings Open

Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey , et al. · 2022

Computer science Sociology Psychology

Evaluating bias, fairness, and social impact in monolingual language models is a difficult task. This challenge is further compounded when language modeling occurs in a multilingual context. Considering the implication of evaluation biases…

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources Open

Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani · 2021

Computer science Philosophy

The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadat…

Automatic Construction of Knowledge Graphs from Text and Structured Data: A Preliminary Literature Review Open

Maraim Masoud, Bianca Pereira, John P. McCrae, Paul Buitelaar · 2021

Computer science

Knowledge graphs have been shown to be an important data structure for many applications, including chatbot development, data integration, and semantic search. In the enterprise domain, such graphs need to be constructed based on both stru…

Towards Machine Translation for the Kurdish Language Open

Sina Ahmadi, Maraim Masoud · 2020

Computer science Engineering Physics

Machine translation is the task of translating texts from one language to another using computers. It has been one of the major tasks in natural language processing and computational linguistics and has been motivating to facilitate human …

Aspects of Terminological and Named Entity Knowledge within Rule-Based Machine Translation Models for Under-Resourced Neural Machine Translation Scenarios Open

Daniel Torregrosa, Nivranshu Pasricha, Maraim Masoud, Bharathi Raja Chakravarthi, J. A. Pajares , et al. · 2020

Computer science Chemistry

Rule-based machine translation is a machine translation paradigm where linguistic knowledge is encoded by an expert in the form of rules that translate text from source to target language. While this approach grants extensive control over …

Towards Machine Translation for the Kurdish Language Open

Sina Ahmadi, Maraim Masoud · 2020

Computer science Engineering Physics

Machine translation is the task of translating texts from one language to another using computers. It has been one of the major tasks in natural language processing and computational linguistics and has been motivating to facilitate human …

Back-translation approach for code-switching machine translation: A case study Open

Maraim Masoud, Daniel Torregrosa, Paul Buitelaar, Mihael Arčan · 2019

Political science Engineering Business

Recently, machine translation has demonstrated significant progress in terms of translation quality. However, most of the research has focused on translating with pure monolingual texts in the source and the target side of the parallel cor…

Leveraging rule-based machine translation knowledge for under-resourced neural machine translation models Open

Daniel Torregrosa, Nivranshu Pasricha, Maraim Masoud, Bharathi Raja Chakravarthi, J. A. Pajares , et al. · 2019

Computer science Engineering Political science

Rule-based machine translation is a machine translation paradigm where linguistic knowledge is encoded by an expert in the form of rules that translate from source to target language. While this approach grants total control over the outpu…

Maraim Masoud YOU? Author Swipe