Cunxiang Wang
YOU?
Author Swipe
View article: AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs
AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs Open
Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks ar…
View article: TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them Open
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Incons…
View article: GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Open
We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through mu…
View article: Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future Open
Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through …
View article: Exploring the Evolution of Physics Cognition in Video Generation: A Survey
Exploring the Evolution of Physics Cognition in Video Generation: A Survey Open
Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - g…
View article: StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error
StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error Open
Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpre…
View article: LongSafety: Evaluating Long-Context Safety of Large Language Models
LongSafety: Evaluating Long-Context Safety of Large Language Models Open
As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored…
View article: HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
HPSS: Heuristic Prompting Strategy Search for LLM Evaluators Open
Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to im…
View article: Game-Based Learning: Its Impact on Asian Teenagers’ Motivation for English Learning
Game-Based Learning: Its Impact on Asian Teenagers’ Motivation for English Learning Open
Game-based learning, which applies games to help learners to learn, are an effective tool to improve learners’ motivation. Due to its complex operation mechanism and the influence of many factors, it is often difficult for many learners to…
View article: HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
HPSS: Heuristic Prompting Strategy Search for LLM Evaluators Open
View article: Unlocking Recursive Thinking of LLMs: Alignment via Refinement
Unlocking Recursive Thinking of LLMs: Alignment via Refinement Open
View article: Training Language Model to Critique for Better Refinement
Training Language Model to Critique for Better Refinement Open
View article: Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions
Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions Open
View article: LongSafety: Evaluating Long-Context Safety of Large Language Models
LongSafety: Evaluating Long-Context Safety of Large Language Models Open
View article: SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models Open
Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and …
View article: Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall
Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall Open
Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they f…
View article: RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation Open
Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses …
View article: Nash CoT: Multi-Path Inference with Preference Equilibrium
Nash CoT: Multi-Path Inference with Preference Equilibrium Open
Chain of thought (CoT) is a reasoning framework that can enhance the performance of Large Language Models (LLMs) on complex inference tasks. In particular, among various studies related to CoT, multi-path inference stands out as a simple y…
View article: NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens Open
Recent advancements in Large Language Models (LLMs) have pushed the boundaries of natural language processing, especially in long-context understanding. However, the evaluation of these models' long-context abilities remains a challenge du…
View article: Knowledge Conflicts for LLMs: A Survey
Knowledge Conflicts for LLMs: A Survey Open
This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of kn…
View article: How Likely Do LLMs with CoT Mimic Human Reasoning?
How Likely Do LLMs with CoT Mimic Human Reasoning? Open
Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from Large Language Models (LLMs). However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved …
View article: Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions
Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions Open
Previous research has typically concentrated on leveraging the internal knowledge of Large Language Models (LLMs) to answer known questions (i.e., \textit{internal reasoning such as generate-then-read}). In contrast, for questions that fal…
View article: $R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks
$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks Open
Large Language Models (LLMs) have demonstrated strong performance on various tasks. To unleash their power on the Text-to-SQL task, we propose $R^3$ (Review-Rebuttal-Revision), a consensus-based multi-agent system for Text-to-SQL tasks. $R…
View article: TRAMS: Training-free Memory Selection for Long-range Language Modeling
TRAMS: Training-free Memory Selection for Long-range Language Modeling Open
The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range language modeling. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies…
View article: Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity Open
This survey addresses the crucial issue of factuality in Large Language Models (LLMs). As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital. We define the Factuality Issue as the prob…
View article: A Survey on Evaluation of Large Language Models
A Survey on Evaluation of Large Language Models Open
Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their eva…
View article: PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization Open
Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automa…
View article: RFiD: Towards Rational Fusion-in-Decoder for Open-Domain Question Answering
RFiD: Towards Rational Fusion-in-Decoder for Open-Domain Question Answering Open
Open-Domain Question Answering (ODQA) systems necessitate a reader model capable of generating answers by simultaneously referring to multiple passages. Although representative models like Fusion-in-Decoder (FiD) have been proposed to addr…
View article: Exploiting Abstract Meaning Representation for Open-Domain Question Answering
Exploiting Abstract Meaning Representation for Open-Domain Question Answering Open
The Open-Domain Question Answering (ODQA) task involves retrieving and subsequently generating answers from fine-grained relevant passages within a database. Current systems leverage Pretrained Language Models (PLMs) to model the relations…
View article: Evaluating Open-QA Evaluation
Evaluating Open-QA Evaluation Open
This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs). Current automatic evaluation methods have shown limitations, indicating that hu…