Marta R. Costa‐jussà

Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification Open

Samuel J. Bell, Eduardo Sánchez, David C. Dale, Mikel Artetxe, Marta R. Costa‐jussà · 2025

Multilingual toxicity detection remains a significant challenge due to the scarcity of training data and resources for many languages. While prior work has leveraged the translate-test paradigm to support cross-lingual transfer across a ra…

Improving Language and Modality Transfer in Translation by Character-level Modeling Open

Ioannis Tsiamas, David Dale, Marta R. Costa‐jussà · 2025

Current translation systems, despite being highly multilingual, cover only 5% of the world's languages. Expanding language coverage to the long-tail of low-resource languages requires data-efficient methods that rely on cross-lingual and c…

BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation Open

The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa‐jussà , et al. · 2025

BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark, and a broader collaborative initiative. This dataset is handcrafted in 8 non-English languages. Each of these source languages are representative of the …

Joint speech and text machine translation for up to 100 languages Open

Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong , et al. · 2025

Computer science Biology Engineering

Creating the Babel Fish, a tool that helps individuals translate speech between any two languages, requires advanced technological innovation and linguistic expertise. Although conventional speech-to-speech translation systems composed of …

Large Concept Models: Language Modeling in a Sentence Representation Space Open

the KSS Cave Studies Team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov , et al. · 2024

Computer science Philosophy Political science

LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp con…

LCFO: Long Context and Long Form Output Dataset and Benchmarking Open

Marta R. Costa‐jussà, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Jen-Hsiang Chuang , et al. · 2024

Computer science Business Geography

This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k wo…

2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset Open

Marta R. Costa‐jussà, Bokai Yu, Pierre Andrews, Belen Alastruey, Necati Cihan Camgöz , et al. · 2024

Computer science Mathematics Philosophy

We introduce the first highly multilingual speech and American Sign Language (ASL) comprehension dataset by extending BELEBELE. Our dataset covers 74 spoken languages at the intersection of BELEBELE and FLEURS, and one sign language (ASL).…

Y-NQ: English-Yorùbá Evaluation dataset for Open-Book Reading Comprehension and Text Generation Open

Marta R. Costa‐jussà, Joy Chen, Ifeoluwanimi Adebara, Joe Chuang, Christophe Ropers , et al. · 2024

Computer science Philosophy

The purpose of this work is to share an English-Yorùbá evaluation dataset for open-book reading comprehension and text generation to assess the performance of models both in a high- and a low- resource language. The dataset contains 358 qu…

On the Role of Speech Data in Reducing Toxicity Detection Bias Open

Samuel J. Bell, Mariano Coria Meglioli, Megan Richards, Eduardo Sánchez, Christophe Ropers , et al. · 2024

Computer science Business Medicine

Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-…

On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task Open

Javier Ferrando, Marta R. Costa‐jussà · 2024

Computer science Philosophy Engineering

Several algorithms implemented by language models have recently been successfully reversed-engineered. However, these findings have been concentrated on specific tasks and models, leaving it unclear how universal circuits are across differ…

Unveiling the Role of Pretraining in Direct Speech Translation Open

Belen Alastruey, Gerard I. Gállego, Marta R. Costa‐jussà · 2024

Psychology Computer science Philosophy

Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this stu…

Linguini: A benchmark for language-agnostic linguistic reasoning Open

Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe , et al. · 2024

Computer science Psychology Philosophy

We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resour…

Towards Massive Multilingual Holistic Bias Open

Xiaoqing Ellen Tan, Prangthip Hansanti, Carleigh Wood, Bokai Yu, Christophe Ropers , et al. · 2024

Computer science Psychology

In the current landscape of automatic language generation, there is a need to understand, evaluate, and mitigate demographic biases as existing models are becoming increasingly multilingual. To address this, we present the initial eight la…

A Primer on the Inner Workings of Transformer-based Language Models Open

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa‐jussà · 2024

Computer science Chemistry Engineering

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical in…

Pushing the Limits of Zero-shot End-to-End Speech Translation Open

Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa‐jussà · 2024

Computer science Philosophy Chemistry

Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by lev…

Spirit LM: Interleaved Spoken and Written Language Model Open

Tu Anh Nguyen, Benjamin Müller, Bokai Yu, Marta R. Costa‐jussà, Maha Elbayad , et al. · 2024

Computer science Philosophy

We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speec…

Towards Red Teaming in Multimodal and Multilingual Translation Open

Christophe Ropers, David C. Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov , et al. · 2024

Computer science Philosophy Chemistry

Assessing performance in Natural Language Processing is becoming increasingly complex. One particular challenge is the potential for evaluation datasets to overlap with training data, either directly or indirectly, which can lead to skewed…

MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector Open

Marta R. Costa‐jussà, Mariano Coria Meglioli, Pierre Andrews, David C. Dale, Prangthip Hansanti , et al. · 2024

Computer science

Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual…

Unveiling the Role of Pretraining in Direct Speech Translation Open

Belen Alastruey, Gerard I. Gállego, Marta R. Costa‐jussà · 2024

Computer science Biology

Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this stu…

Marta R. Costa‐jussà YOU? Author Swipe