Philippe Laban
YOU?
Author Swipe
View article: Flipping the Dialogue: Training and Evaluating User Language Models
Flipping the Dialogue: Training and Evaluating User Language Models Open
Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to prod…
View article: KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning Open
Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to form…
View article: LLMs Get Lost In Multi-Turn Conversation
LLMs Get Lost In Multi-Turn Conversation Open
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need thro…
View article: Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing
Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing Open
Writing well requires not only expressing ideas but also refining them through revision, a process facilitated by reflection. Prior research suggests that feedback delivered through dialogues, such as those in writing center tutoring sessi…
View article: Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding Open
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant …
View article: SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits
SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits Open
Detecting factual inconsistencies in summarization is critical, yet existing benchmarks lack the necessary challenge and interpretability for robust evaluation. In this paper, we introduce SummExecEdit, a novel pipeline and benchmark lever…
View article: CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments Open
Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personal…
View article: Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage
Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage Open
Evaluating retrieval-augmented generation (RAG) systems remains challenging, particularly for open-ended questions that lack definitive answers and require coverage of multiple sub-topics. In this paper, we introduce a novel evaluation fra…
View article: Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses Open
Large Language Model (LLM)-based applications are graduating from research prototypes to products serving millions of users, influencing how people write and consume information. A prominent example is the appearance of Answer Engines: LLM…
View article: Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits
Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits Open
LLM-based applications are helping people write, and LLM-generated text is making its way into social media, journalism, and our classrooms. However, the differences between LLM-generated and human-written text remain unclear. To explore t…
View article: Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems Open
LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In thi…
View article: Art or Artifice? Large Language Models and the False Promise of Creativity
Art or Artifice? Large Language Models and the False Promise of Creativity Open
Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test o…
View article: Prompt Leakage effect and defense strategies for multi-turn LLM interactions
Prompt Leakage effect and defense strategies for multi-turn LLM interactions Open
Prompt leakage poses a compelling security and privacy threat in LLM applications. Leakage of system prompts may compromise intellectual property, and act as adversarial reconnaissance for an attacker. A systematic evaluation of prompt lea…
View article: MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents Open
Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verif…
View article: Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment Open
The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop expe…
View article: Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems
Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems Open
Making big purchases requires consumers to research or consult a salesperson to gain domain expertise. However, existing conversational recommender systems (CRS) often overlook users' lack of background knowledge, focusing solely on gather…
View article: Automatic and Human-AI Interactive Text Generation
Automatic and Human-AI Interactive Text Generation Open
In this tutorial, we focus on text-to-text generation, a class of natural language generation (NLG) tasks, that takes a piece of text as input and then generates a revision that is improved according to some specific criteria (e.g., readab…
View article: Beyond the Chat: Executable and Verifiable Text-Editing with LLMs
Beyond the Chat: Executable and Verifiable Text-Editing with LLMs Open
Conversational interfaces powered by Large Language Models (LLMs) have recently become a popular way to obtain feedback during document editing. However, standard chat-based conversational interfaces do not support transparency and verifia…
View article: Art or Artifice? Large Language Models and the False Promise of Creativity
Art or Artifice? Large Language Models and the False Promise of Creativity Open
Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test o…