Explanipedia

AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs Open

Pingping He, Zichen Wen, Yubo Wang, Yuxuan Wang, Xiaoqian Liu , et al. · 2025

Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks ar…

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them Open

Yidong Wang, Yunze Song, Tong Zhu, Xuanwang Zhang, Zhuohao Yu , et al. · 2025

The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Incons…

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Open

Team, Aohan Zeng, Xin Lü, Qinkai Zheng, Zhenyu Hou , et al. · 2025

We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through mu…

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future Open

Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang , et al. · 2025

Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through …

Exploring the Evolution of Physics Cognition in Video Generation: A Survey Open

Minghui Lin, Xiang Wang, Y. S. Wang, Shu Wang, Fei Dai , et al. · 2025

Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - g…

StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error Open

Shih‐Mo Yang, Cunxiang Wang, Yidong Wang, Xiaotao Gu, Minlie Huang , et al. · 2025

Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpre…

LongSafety: Evaluating Long-Context Safety of Large Language Models Open

Yida Lu, Jiale Cheng, Zhexin Zhang, Shiyao Cui, Cunxiang Wang , et al. · 2025

As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored…

HPSS: Heuristic Prompting Strategy Search for LLM Evaluators Open

Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu , et al. · 2025

Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to im…

Game-Based Learning: Its Impact on Asian Teenagers’ Motivation for English Learning Open

Cunxiang Wang · 2025

Game-based learning, which applies games to help learners to learn, are an effective tool to improve learners’ motivation. Due to its complex operation mechanism and the influence of many factors, it is often difficult for many learners to…

HPSS: Heuristic Prompting Strategy Search for LLM Evaluators Open

Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu , et al. · 2025

Unlocking Recursive Thinking of LLMs: Alignment via Refinement Open

Haoke Zhang, Xiaobo Liang, Cunxiang Wang, Juntao Li, Min Zhang · 2025

Training Language Model to Critique for Better Refinement Open

Tianshu Yu, Chao Xiang, Mingchuan Yang, Pei Ke, Bosi Wen , et al. · 2025

Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions Open

Hongru Wang, Boyang Xue, Baohang Zhou, Tianhua Zhang, Cunxiang Wang , et al. · 2025

LongSafety: Evaluating Long-Context Safety of Large Language Models Open

Yida Lu, Jiale Cheng, Zhexin Zhang, Shiyao Cui, Cunxiang Wang , et al. · 2025

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models Open

Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu , et al. · 2024

Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and …

Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall Open

Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang , et al. · 2024

Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they f…

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation Open

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi , et al. · 2024

Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses …

Nash CoT: Multi-Path Inference with Preference Equilibrium Open

Ziqi Zhang, Cunxiang Wang, Xiong Xiao, Yue Zhang, Donglin Wang · 2024

Chain of thought (CoT) is a reasoning framework that can enhance the performance of Large Language Models (LLMs) on complex inference tasks. In particular, among various studies related to CoT, multi-path inference stands out as a simple y…

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens Open

Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo , et al. · 2024

Recent advancements in Large Language Models (LLMs) have pushed the boundaries of natural language processing, especially in long-context understanding. However, the evaluation of these models' long-context abilities remains a challenge du…

Knowledge Conflicts for LLMs: A Survey Open

Rongwu Xu, Zehan Qi, Cunxiang Wang, Hongru Wang, Yue Zhang , et al. · 2024

This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of kn…

How Likely Do LLMs with CoT Mimic Human Reasoning? Open

Guangsheng Bao, Hongbo Zhang, Linyi Yang, Cunxiang Wang, Yue Zhang · 2024

Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from Large Language Models (LLMs). However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved …

Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions Open

Hongru Wang, Boyang Xue, Baohang Zhou, Tianhua Zhang, Cunxiang Wang , et al. · 2024

Previous research has typically concentrated on leveraging the internal knowledge of Large Language Models (LLMs) to answer known questions (i.e., \textit{internal reasoning such as generate-then-read}). In contrast, for questions that fal…

$R^3$: "This is My SQL, Are You With Me?" A Consensus-Based Multi-Agent System for Text-to-SQL Tasks Open

Hanchen Xia, Feng Jiang, Naihao Deng, Cunxiang Wang, Guojiang Zhao , et al. · 2024

Large Language Models (LLMs) have demonstrated strong performance on various tasks. To unleash their power on the Text-to-SQL task, we propose $R^3$ (Review-Rebuttal-Revision), a consensus-based multi-agent system for Text-to-SQL tasks. $R…

TRAMS: Training-free Memory Selection for Long-range Language Modeling Open

Haofei Yu, Cunxiang Wang, Yue Zhang, Wei Bi · 2023

The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range language modeling. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies…

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity Open

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang , et al. · 2023

This survey addresses the crucial issue of factuality in Large Language Models (LLMs). As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital. We define the Factuality Issue as the prob…

A Survey on Evaluation of Large Language Models Open

Yupeng Chang, Xu Wang, Jindong Wang, Yuan-Hsuan Wu, Kaijie Zhu , et al. · 2023

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their eva…

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization Open

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang , et al. · 2023

Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automa…

RFiD: Towards Rational Fusion-in-Decoder for Open-Domain Question Answering Open

Cunxiang Wang, Haofei Yu, Yue Zhang · 2023

Open-Domain Question Answering (ODQA) systems necessitate a reader model capable of generating answers by simultaneously referring to multiple passages. Although representative models like Fusion-in-Decoder (FiD) have been proposed to addr…

Exploiting Abstract Meaning Representation for Open-Domain Question Answering Open

Cunxiang Wang, Zhikun Xu, Qipeng Guo, Xiangkun Hu, Xuefeng Bai , et al. · 2023

The Open-Domain Question Answering (ODQA) task involves retrieving and subsequently generating answers from fine-grained relevant passages within a database. Current systems leverage Pretrained Language Models (PLMs) to model the relations…

Evaluating Open-QA Evaluation Open

Cunxiang Wang, Sirui Cheng, Zhikun Xu, Bowen Ding, Yidong Wang , et al. · 2023

This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs). Current automatic evaluation methods have shown limitations, indicating that hu…

Cunxiang Wang YOU? Author Swipe