Jonathan Herzig
YOU?
Author Swipe
View article: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs Open
As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful conte…
View article: DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs
DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs Open
Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models (LLMs) with relevant and up-to-date information. However, the retrieved sources can often contain conflicting information and it remains u…
View article: Inside-Out: Hidden Factual Knowledge in LLMs
Inside-Out: Hidden Factual Knowledge in LLMs Open
This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly def…
View article: Advancing Sustainable Prototyping of Future Aircraft Cabin Designs Through Extended Reality Technologies
Advancing Sustainable Prototyping of Future Aircraft Cabin Designs Through Extended Reality Technologies Open
This paper explores the virtual development of cabin concepts for hydrogen-powered aircraft, emphasizing sustainable, safe, and comfortable transport solutions. It examines how digital technologies can accelerate product development by inv…
View article: Applied Design Thinking in urban air mobility: creating the airtaxi cabin design of the future from a user perspective
Applied Design Thinking in urban air mobility: creating the airtaxi cabin design of the future from a user perspective Open
Design Thinking is essential for user-centered cabin design concepts in future transportation vehicles, as it facilitates the identification of user needs, creative problem-solving and iterative development to ensure optimal user experienc…
View article: Distinguishing Ignorance from Error in LLM Hallucinations
Distinguishing Ignorance from Error in LLM Hallucinations Open
Large language models (LLMs) are susceptible to hallucinations -- factually incorrect outputs -- leading to a large body of work on detecting and mitigating such cases. We argue that it is important to distinguish between two types of hall…
View article: DoubleDipper: Improving Long-Context LLMs via Context Recycling
DoubleDipper: Improving Long-Context LLMs via Context Recycling Open
Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In this work, we propose DoubleDipper, a novel In-Context-Learning method that automatically generates few…
View article: TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools
TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools Open
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts. To better evaluate this setting and facilitate modeling efforts, we introduce TACT - Text And Calculations through …
View article: Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? Open
When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating f…
View article: Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs Open
Large language models (LLMs) are prone to hallucinations, which sparked a widespread effort to detect and prevent them. Recent work attempts to mitigate hallucinations by intervening in the model's generation, typically computing represent…
View article: Create Your Own Cabin - Connecting Multidisciplinary User Perspectives and a Future Rescue Helicopter Concept Within the Applied Xr Co-design Process
Create Your Own Cabin - Connecting Multidisciplinary User Perspectives and a Future Rescue Helicopter Concept Within the Applied Xr Co-design Process Open
The German air rescue and healthcare system is facing a number of changes and challenges, that require a rethink in the development of new rescue helicopter concepts. As part of its research activities, the German Aerospace Centre is, ther…
View article: Representation Surgery: Theory and Practice of Affine Steering
Representation Surgery: Theory and Practice of Affine Steering Open
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one nat…
View article: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains Open
Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literatu…
View article: Multilingual Instruction Tuning With Just a Pinch of Multilinguality
Multilingual Instruction Tuning With Just a Pinch of Multilinguality Open
As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of …
View article: Applying an interior VR co-design approach for the medical deployment vehicle of the future
Applying an interior VR co-design approach for the medical deployment vehicle of the future Open
-
View article: A Comprehensive Evaluation of Tool-Assisted Generation Strategies
A Comprehensive Evaluation of Tool-Assisted Generation Strategies Open
A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-us…
View article: Applied design thinking in urban air mobility: creating the airtaxi cabin design of the future from a user perspective
Applied design thinking in urban air mobility: creating the airtaxi cabin design of the future from a user perspective Open
In the course of developing digital and future aviation cabin concepts at the German Aerospace Center, the exploration of user-centered and acceptance-enhancing methods plays a central role. The challenge here is to identify the flexible r…
View article: Evaluating and Modeling Attribution for Cross-Lingual Question Answering
Evaluating and Modeling Attribution for Cross-Lingual Question Answering Open
Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in …
View article: TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models Open
Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, th…
View article: What You See is What You Read? Improving Text-Image Alignment Evaluation
What You See is What You Read? Improving Text-Image Alignment Evaluation Open
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we stud…
View article: A Comprehensive Evaluation of Tool-Assisted Generation Strategies
A Comprehensive Evaluation of Tool-Assisted Generation Strategies Open
A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-us…
View article: TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models Open
Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, th…
View article: Evaluating and Modeling Attribution for Cross-Lingual Question Answering
Evaluating and Modeling Attribution for Cross-Lingual Question Answering Open
Benjamin Muller, John Wieting, Jonathan Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Soares, Roee Aharoni, Jonathan Herzig, Xinyi Wang. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.
View article: Multilingual Summarization with Factual Consistency Evaluation
Multilingual Summarization with Factual Consistency Evaluation Open
ive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsiste…
View article: mFACE: Multilingual Summarization with Factual Consistency Evaluation
mFACE: Multilingual Summarization with Factual Consistency Evaluation Open
ive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsiste…
View article: Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models Open
Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM…
View article: QAMPARI: An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs
QAMPARI: An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs Open
Existing benchmarks for open-domain question answering (ODQA) typically focus on questions whose answers can be extracted from a single paragraph. By contrast, many natural questions, such as "What players were drafted by the Brooklyn Nets…
View article: Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing
Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing Open
Despite their strong performance on many tasks, pre-trained language models have been shown to struggle on out-of-distribution compositional generalization. Meanwhile, recent work has shown considerable improvements on many NLP tasks from …
View article: TRUE: Re-evaluating Factual Consistency Evaluation
TRUE: Re-evaluating Factual Consistency Evaluation Open
Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cy…
View article: Learning To Retrieve Prompts for In-Context Learning
Learning To Retrieve Prompts for In-Context Learning Open
In-context learning is a recent paradigm in natural language understanding, where a large pre-trained language model (LM) observes a test instance and a few training examples as its input, and directly decodes the output without any update…