Marta R. Costa‐jussà
YOU?
Author Swipe
View article: Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification
Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification Open
Multilingual toxicity detection remains a significant challenge due to the scarcity of training data and resources for many languages. While prior work has leveraged the translate-test paradigm to support cross-lingual transfer across a ra…
View article: Improving Language and Modality Transfer in Translation by Character-level Modeling
Improving Language and Modality Transfer in Translation by Character-level Modeling Open
Current translation systems, despite being highly multilingual, cover only 5% of the world's languages. Expanding language coverage to the long-tail of low-resource languages requires data-efficient methods that rely on cross-lingual and c…
View article: BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation Open
BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark, and a broader collaborative initiative. This dataset is handcrafted in 8 non-English languages. Each of these source languages are representative of the …
View article: Joint speech and text machine translation for up to 100 languages
Joint speech and text machine translation for up to 100 languages Open
Creating the Babel Fish, a tool that helps individuals translate speech between any two languages, requires advanced technological innovation and linguistic expertise. Although conventional speech-to-speech translation systems composed of …
View article: Large Concept Models: Language Modeling in a Sentence Representation Space
Large Concept Models: Language Modeling in a Sentence Representation Space Open
LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp con…
View article: LCFO: Long Context and Long Form Output Dataset and Benchmarking
LCFO: Long Context and Long Form Output Dataset and Benchmarking Open
This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k wo…
View article: 2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset
2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset Open
We introduce the first highly multilingual speech and American Sign Language (ASL) comprehension dataset by extending BELEBELE. Our dataset covers 74 spoken languages at the intersection of BELEBELE and FLEURS, and one sign language (ASL).…
View article: Y-NQ: English-Yorùbá Evaluation dataset for Open-Book Reading Comprehension and Text Generation
Y-NQ: English-Yorùbá Evaluation dataset for Open-Book Reading Comprehension and Text Generation Open
The purpose of this work is to share an English-Yorùbá evaluation dataset for open-book reading comprehension and text generation to assess the performance of models both in a high- and a low- resource language. The dataset contains 358 qu…
View article: On the Role of Speech Data in Reducing Toxicity Detection Bias
On the Role of Speech Data in Reducing Toxicity Detection Bias Open
Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-…
View article: On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task
On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task Open
Several algorithms implemented by language models have recently been successfully reversed-engineered. However, these findings have been concentrated on specific tasks and models, leaving it unclear how universal circuits are across differ…
View article: Unveiling the Role of Pretraining in Direct Speech Translation
Unveiling the Role of Pretraining in Direct Speech Translation Open
Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this stu…
View article: Linguini: A benchmark for language-agnostic linguistic reasoning
Linguini: A benchmark for language-agnostic linguistic reasoning Open
We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resour…
View article: Towards Massive Multilingual Holistic Bias
Towards Massive Multilingual Holistic Bias Open
In the current landscape of automatic language generation, there is a need to understand, evaluate, and mitigate demographic biases as existing models are becoming increasingly multilingual. To address this, we present the initial eight la…
View article: A Primer on the Inner Workings of Transformer-based Language Models
A Primer on the Inner Workings of Transformer-based Language Models Open
The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical in…
View article: Pushing the Limits of Zero-shot End-to-End Speech Translation
Pushing the Limits of Zero-shot End-to-End Speech Translation Open
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by lev…
View article: Spirit LM: Interleaved Spoken and Written Language Model
Spirit LM: Interleaved Spoken and Written Language Model Open
We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speec…
View article: Towards Red Teaming in Multimodal and Multilingual Translation
Towards Red Teaming in Multimodal and Multilingual Translation Open
Assessing performance in Natural Language Processing is becoming increasingly complex. One particular challenge is the potential for evaluation datasets to overlap with training data, either directly or indirectly, which can lead to skewed…
View article: MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector
MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector Open
Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual…
View article: Unveiling the Role of Pretraining in Direct Speech Translation
Unveiling the Role of Pretraining in Direct Speech Translation Open
Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this stu…