Barbara Plank
YOU?
Author Swipe
View article: Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations
Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations Open
Natural Language Inference datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxo…
View article: Human-centered LLMs for Inclusive Language Technology: The Need to Embrace Variation Holistically in NLP
Human-centered LLMs for Inclusive Language Technology: The Need to Embrace Variation Holistically in NLP Open
View article: From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP
From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP Open
Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the genuine diversity of human perspectives rather than mere error. For decades, HLV in NLP was dismissed as noise to be discarded, and only slowly o…
View article: BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods
BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods Open
The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we i…
View article: Crossing Domains without Labels: Distant Supervision for Term Extraction
Crossing Domains without Labels: Distant Supervision for Term Extraction Open
Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with doma…
View article: Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort Open
Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-t…
View article: Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora
Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora Open
Dialects exhibit a substantial degree of variation due to the lack of a standard orthography. At the same time, the ability of Large Language Models (LLMs) to process dialects remains largely understudied. To address this gap, we use Bavar…
View article: Toward more realistic career path prediction: evaluation and methods
Toward more realistic career path prediction: evaluation and methods Open
Predicting career trajectories is a complex yet impactful task, offering significant benefits for personalized career counseling, recruitment optimization, and workforce planning. However, effective career path prediction (CPP) modeling fa…
View article: A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation
A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation Open
Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an eval…
View article: Reason to Rote: Rethinking Memorization in Reasoning
Reason to Rote: Rethinking Memorization in Reasoning Open
Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization…
View article: MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs
MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs Open
Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive m…
View article: Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically Open
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and m…
View article: ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior
ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior Open
Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining exist…
View article: Tracing Multilingual Factual Knowledge Acquisition in Pretraining
Tracing Multilingual Factual Knowledge Acquisition in Pretraining Open
Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consist…
View article: What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns Open
Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods of LLM outputs, either automated metrics or h…
View article: Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies Between Model Predictions and Human Responses in VQA
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies Between Model Predictions and Human Responses in VQA Open
Large vision-language models struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit high uncertainty. In this study, we focus on a Visual Question Answering (VQA) task and …
View article: Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning
Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning Open
Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their k…
View article: Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set Open
Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge t…
View article: Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases Open
Recent works have shown that Large Language Models (LLMs) have a tendency to memorize patterns and biases present in their training data, raising important questions about how such memorized content influences model behavior. One such conc…
View article: The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It Open
The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encounterin…
View article: M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis Open
Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-cen…
View article: Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges Open
Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their…
View article: Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study
Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study Open
Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, t…
View article: Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection
Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection Open
Slot and intent detection (SID) is a classic natural language understanding task. Despite this, research has only more recently begun focusing on SID for dialectal and colloquial varieties. Many approaches for low-resource scenarios have n…
View article: Tracing Multilingual Factual Knowledge Acquisition in Pretraining
Tracing Multilingual Factual Knowledge Acquisition in Pretraining Open
International audience
View article: Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set Open
View article: RAcQUEt: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs
RAcQUEt: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs Open
View article: LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks Open
View article: Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases Open
View article: Evaluating Large Language Models for Cross-Lingual Retrieval
Evaluating Large Language Models for Cross-Lingual Retrieval Open