Anja Belz
YOU?
Author Swipe
View article: What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?
What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples? Open
Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate inductio…
View article: The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems
The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems Open
Prior work has shown that two NLP evaluation experiments that report results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality, and the comparability implied by the name can be mislea…
View article: Enhancing Study-Level Inference from Clinical Trial Papers via Reinforcement Learning-Based Numeric Reasoning
Enhancing Study-Level Inference from Clinical Trial Papers via Reinforcement Learning-Based Numeric Reasoning Open
Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level…
View article: Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges
Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges Open
View article: Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments
Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments Open
View article: Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies
Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies Open
View article: Enhancing Study-Level Inference from Clinical Trial Papers via Reinforcement Learning-Based Numeric Reasoning
Enhancing Study-Level Inference from Clinical Trial Papers via Reinforcement Learning-Based Numeric Reasoning Open
View article: HEDS 3.0: The Human Evaluation Data Sheet Version 3.0
HEDS 3.0: The Human Evaluation Data Sheet Version 3.0 Open
This paper presents version 3.0 of the Human Evaluation Datasheet (HEDS). This update is the result of our experience using HEDS in the context of numerous recent human evaluation experiments, including reproduction studies, and of feedbac…
View article: From image to language and back again
From image to language and back again Open
Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings tog…
View article: Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques Open
Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of o…
View article: High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models
High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models Open
The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) c…
View article: Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods
Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods Open
As the cost of training ever larger language models has grown, so has the interest in reusing previously learnt knowledge. Transfer learning methods have shown how reusing non-task-specific knowledge can help in subsequent task-specific le…
View article: Differences in Semantic Errors Made by Different Types of Data-to-text Systems
Differences in Semantic Errors Made by Different Types of Data-to-text Systems Open
View article: DCU-NLG-PBN at the GEM’24 Data-to-Text Task: Open-Source LLM PEFT-Tuning for Effective Data-to-Text Generation
DCU-NLG-PBN at the GEM’24 Data-to-Text Task: Open-Source LLM PEFT-Tuning for Effective Data-to-Text Generation Open
View article: Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups
Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups Open
View article: Front Matter
Front Matter Open
View article: DCU-NLG-Small at the GEM’24 Data-to-Text Task: Rule-based generation and post-processing with T5-Base
DCU-NLG-Small at the GEM’24 Data-to-Text Task: Rule-based generation and post-processing with T5-Base Open
View article: (Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems
(Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems Open
View article: Common Flaws in Running Human Evaluation Experiments in NLP
Common Flaws in Running Human Evaluation Experiments in NLP Open
While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this squib, we describe the types of flaws we d…
View article: The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units
The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units Open
View article: QCET: An Interactive Taxonomy of Quality Criteria for Comparable and Repeatable Evaluation of NLP Systems
QCET: An Interactive Taxonomy of Quality Criteria for Comparable and Repeatable Evaluation of NLP Systems Open
View article: Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate
Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate Open
LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of…
View article: Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP Open
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which…
View article: PEFT-Ref: A Modular Reference Architecture and Typology for Parameter-Efficient Finetuning Techniques
PEFT-Ref: A Modular Reference Architecture and Typology for Parameter-Efficient Finetuning Techniques Open
Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM). As different PEFT techniques proliferate, it is becoming difficult to compare the…
View article: Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP Open
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which…
View article: Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP
Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP Open
Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is that they should support the same conclusions if repeated. Howev…
View article: Generating Irish Text with a Flexible Plug-and-Play Architecture
Generating Irish Text with a Flexible Plug-and-Play Architecture Open
In this paper, we describe M-FleNS, a multilingual flexible plug-and-play architecture designed to accommodate neural and symbolic modules, and initially instantiated with rule-based modules. We focus on using M-FleNS for the specific purp…
View article: The 2023 WebNLG shared task on low resource languages. Overview and evaluation results (WebNLG 2023)
The 2023 WebNLG shared task on low resource languages. Overview and evaluation results (WebNLG 2023) Open
The WebNLG task consists of mapping a knowledge graph to a text verbalising the con- tent of that graph. The 2017 WebNLG edi- tion required participating systems to gener- ate English text from a set of DBpedia triples, while the 2020 WebN…
View article: Mod-D2T: A Multi-layer Dataset for Modular Data-to-Text Generation
Mod-D2T: A Multi-layer Dataset for Modular Data-to-Text Generation Open
Rule-based text generators lack the coverage and fluency of their neural counterparts, but have two big advantages over them: (i) they are entirely controllable and do not hallucinate; and (ii) they can fully explain how an output was gene…
View article: Towards a Consensus Taxonomy for Annotating Errors in Automatically Generated Text
Towards a Consensus Taxonomy for Annotating Errors in Automatically Generated Text Open
Error analysis aims to provide insights into system errors at different levels of granularity.NLP as a field has a long-standing tradition of analysing and reporting errors which is generally considered good practice.There are existing err…