Tom Kocmi
YOU?
Author Swipe
View article: Aya Expanse: Combining Research Breakthroughs for a New Multilingual\n Frontier
Aya Expanse: Combining Research Breakthroughs for a New Multilingual\n Frontier Open
We introduce the Aya Expanse model family, a new generation of 8B and 32B\nparameter multilingual language models, aiming to address the critical\nchallenge of developing highly performant multilingual models that match or\nsurpass the cap…
View article: Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets Open
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgment. However, these results are often obtained by averaging predictions across large test sets without any insights into the strengths and…
View article: Preliminary WMT24 Ranking of General MT Systems and LLMs
Preliminary WMT24 Ranking of General MT Systems and LLMs Open
This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to int…
View article: AI-Assisted Human Evaluation of Machine Translation
AI-Assisted Human Evaluation of Machine Translation Open
Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires a lot of expert human labor. In the recently adopted annotation protocol,…
View article: Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation Open
High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done b…
View article: Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets Open
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of …
View article: Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies Open
Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic in…
View article: GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4
GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4 Open
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large la…
View article: SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window
SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window Open
Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be pres…
View article: Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Large Language Models Are State-of-the-Art Evaluators of Translation Quality Open
We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based …
View article: Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference Open
Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations wit…
View article: eBLEU: Unexpectedly Good Machine Translation Evaluation Using Simple Word Embeddings
eBLEU: Unexpectedly Good Machine Translation Evaluation Using Simple Word Embeddings Open
We propose eBLEU, a metric inspired by BLEU metric that uses embedding similarities instead of string matches. We introduce meaning diffusion vectors to enable matching n-grams of semantically similar words in a BLEU-like algorithm, using …
View article: Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference Open
Vilém Zouhar, Shehzaad Dhuliawala, Wangchunshu Zhou, Nico Daheim, Tom Kocmi, Yuchen Eleanor Jiang, Mrinmaya Sachan. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023.
View article: Cometoid: Distilling Strong Reference-based Machine Translation Metrics into Even Stronger Quality Estimation Metrics
Cometoid: Distilling Strong Reference-based Machine Translation Metrics into Even Stronger Quality Estimation Metrics Open
This paper describes our submissions to the 2023 Conference on Machine Translation (WMT-23) Metrics shared task. Knowledge distillation is commonly used to create smaller student models that mimic larger teacher model while reducing the mo…
View article: Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies
Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies Open
The WMT 2023 Terminology Shared Task investigates progress in machine translation of texts with specialized vocabulary. The participants were given the source text and segment-level terminology dictionaries for three language pairs: Chines…
View article: Evaluating Metrics for Document-context Evaluation in Machine Translation
Evaluating Metrics for Document-context Evaluation in Machine Translation Open
We describe our submission of a new metric, SLIDE (Raunak et al., 2023), to the WMT 2023 metrics task. SLIDE is a reference-free quality-estimation metric that works by constructing a fixed sentence-length window over the documents in a te…
View article: Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent
Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent Open
Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, George Foster. Proceedings of the Eighth…
View article: GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4
GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4 Open
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large la…
View article: Searching for a higher power in the human evaluation of MT
Searching for a higher power in the human evaluation of MT Open
In MT evaluation, pairwise comparisons are conducted to identify the better system. In conducting the comparison, the experimenter must allocate a budget to collect Direct Assessment (DA) judgments. We provide a cost effective way to spend…
View article: The Reality of Multi-Lingual Machine Translation
The Reality of Multi-Lingual Machine Translation Open
Our book "The Reality of Multi-Lingual Machine Translation" discusses the benefits and perils of using more than two languages in machine translation systems. While focused on the particular task of sequence-to-sequence processing and mult…
View article: To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation Open
Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developmen…