Explanipedia

Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations Open

Hong Peng, Beiduo Chen, Siyao Peng, Marie-Catherine de Marneffe, Benjamin Roth , et al. · 2025

Natural Language Inference datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxo…

Human-centered LLMs for Inclusive Language Technology: The Need to Embrace Variation Holistically in NLP Open

Barbara Plank · 2025

From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP Open

Shanshan Xu, Santosh T.y.s.s, Barbara Plank · 2025

Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the genuine diversity of human perspectives rather than mere error. For decades, HLV in NLP was dismissed as noise to be discarded, and only slowly o…

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods Open

Philipp Mondorf, Mingyang Wang, Sebastian Gerstner, Ahmad Dawar Hakimi, Yihong Liu , et al. · 2025

The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we i…

Crossing Domains without Labels: Distant Supervision for Term Extraction Open

Elena Senger, Yuri Campbell, Rob van der Goot, Barbara Plank · 2025

Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with doma…

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort Open

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He · 2025

Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-t…

Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora Open

Robert Litschko, Verena Blaschke, Diana Burkhardt, Barbara Plank, Diego Frassinelli · 2025

Dialects exhibit a substantial degree of variation due to the lack of a standard orthography. At the same time, the ability of Large Language Models (LLMs) to process dialects remains largely understudied. To address this gap, we use Bavar…

Toward more realistic career path prediction: evaluation and methods Open

Elena Senger, Yuri Campbell, Rob van der Goot, Barbara Plank · 2025

Predicting career trajectories is a complex yet impactful task, offering significant benefits for personalized career counseling, recruitment optimization, and workforce planning. However, effective career path prediction (CPP) modeling fa…

A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation Open

Verena Blaschke, Miriam Winkler, Constantin Förster, Gabriele Wenger-Glemser, Barbara Plank · 2025

Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an eval…

Reason to Rote: Rethinking Memorization in Reasoning Open

Y. Le Du, Philipp Mondorf, Silvia Casola, Yao Yao, Robert Litschko , et al. · 2025

Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization…

MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs Open

Raoyuan Zhao, Beiduo Chen, Barbara Plank, Michael A. Hedderich · 2025

Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive m…

Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically Open

Ryan Soh-Eun Shim, Domenico De Cristofaro, Chengzhi Hu, Alessandro Vietti, Barbara Plank · 2025

Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and m…

ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior Open

Florian Eichin, Yupei Du, Philipp Mondorf, Barbara Plank, Michael A. Hedderich · 2025

Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining exist…

Tracing Multilingual Factual Knowledge Acquisition in Pretraining Open

Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie , et al. · 2025

Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consist…

What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns Open

Michael A. Hedderich, Anyi Wang, Rongchun Zhao, Florian Eichin, Jonas Fischer , et al. · 2025

Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods of LLM outputs, either automated metrics or h…

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies Between Model Predictions and Human Responses in VQA Open

Jian Lan, Diego Frassinelli, Barbara Plank · 2025

Large vision-language models struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit high uncertainty. In this study, we focus on a Visual Question Answering (VQA) task and …

Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning Open

Philipp Mondorf, Shijia Zhou, Monica Riedler, Barbara Plank · 2025

Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their k…

Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set Open

Florian Eichin, Yang Janet Liu, Barbara Plank, Michael A. Hedderich · 2025

Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge t…

Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases Open

Shanshan Xu, T. Y. S. S Santosh, Yanai Elazar, Quirin Vogel, Barbara Plank , et al. · 2025

Recent works have shown that Large Language Models (LLMs) have a tendency to memorize patterns and biases present in their training data, raising important questions about how such memorized content influences model behavior. One such conc…

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It Open

Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi · 2025

The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encounterin…

M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis Open

Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng , et al. · 2025

Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-cen…

Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges Open

Bolei Ma, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu , et al. · 2025

Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their…

Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study Open

Xaver Maria Krückl, Verena Blaschke, Barbara Plank · 2025

Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, t…

Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection Open

Verena Blaschke, Felicia Körner, Barbara Plank · 2025

Slot and intent detection (SID) is a classic natural language understanding task. Despite this, research has only more recently begun focusing on SID for dialectal and colloquial varieties. Many approaches for low-resource scenarios have n…

Tracing Multilingual Factual Knowledge Acquisition in Pretraining Open

Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie , et al. · 2025

International audience

Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set Open

Florian Eichin, Yang Janet Liu, Barbara Plank, Michael A. Hedderich · 2025

RAcQUEt: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs Open

Alberto Testoni, Barbara Plank, Raquel Fernández · 2025

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks Open

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández , et al. · 2025

Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases Open

Shanshan Xu, Santosh T.y.s.s, Yanai Elazar, Quirin Vogel, Barbara Plank , et al. · 2025

Evaluating Large Language Models for Cross-Lingual Retrieval Open

Longfei Zuo, P. C. Hong, Oliver Kraus, Barbara Plank, Robert Litschko · 2025

Barbara Plank YOU? Author Swipe