James Glass
YOU?
Author Swipe
View article: Unleashing the Power of Digital Voice for Early Symptom Detection for Alzheimer's Disease and Related Dementias
Unleashing the Power of Digital Voice for Early Symptom Detection for Alzheimer's Disease and Related Dementias Open
Background Analysis of digital voice (dVoice) is emerging as an inclusive approach to detecting the earliest preclinical symptoms of Alzheimer's disease (AD) and related dementias (ADRD) because of the widespread penetration of recording d…
View article: TTRV: Test-Time Reinforcement Learning for Vision Language Models
TTRV: Test-Time Reinforcement Learning for Vision Language Models Open
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose T…
View article: Towards Unsupervised Speech Recognition at the Syllable-Level
Towards Unsupervised Speech Recognition at the Syllable-Level Open
Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning…
View article: VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes Open
Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth respon…
View article: USAD: Universal Speech and Audio Representation via Distillation
USAD: Universal Speech and Audio Representation via Distillation Open
Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a uni…
View article: RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning Open
Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-st…
View article: Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?
Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary? Open
With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can be integrated into passage rerankers bu…
View article: Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM? Open
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU and MMAR benc…
View article: CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment Open
Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences w…
View article: Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution
Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution Open
Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios invo…
View article: UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation Open
Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training te…
View article: Obfuscation via pitch‐shifting for balancing privacy and diagnostic utility in voice‐based cognitive assessment
Obfuscation via pitch‐shifting for balancing privacy and diagnostic utility in voice‐based cognitive assessment Open
INTRODUCTION Digital voice analysis is an emerging tool for differentiating cognitive states, but it poses privacy risks as automated systems may inadvertently identify speakers. METHODS We developed a computational framework to evaluate t…
View article: mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition Open
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which ma…
View article: Machine learning for privacy‐protected voice analysis in dementia assessment
Machine learning for privacy‐protected voice analysis in dementia assessment Open
Background The prevalence of cognitive impairments, such as mild cognitive impairment (MCI) and Alzheimer's disease (AD), has surged, necessitating rapid, cost‐effective, and non‐invasive diagnostic tools. Speech, as a rich source of cogni…
View article: Differential Privacy Preserving Voice Conversion for Audio Health Data
Differential Privacy Preserving Voice Conversion for Audio Health Data Open
Background Speech is a predominant mode of human communication. Speech digital recordings are inexpensive to record and contain rich health related information. Deep learning, a key method, excels in detecting intricate patterns, however, …
View article: Obfuscation via pitch-shifting for balancing privacy and diagnostic utility in voice-based cognitive assessment
Obfuscation via pitch-shifting for balancing privacy and diagnostic utility in voice-based cognitive assessment Open
Introduction Digital voice analysis is gaining traction as a tool to differentiate cognitively normal from impaired individuals. However, voice data poses privacy risks due to the potential identification of speakers by automated systems. …
View article: DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models Open
Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Doub…
View article: A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation Open
Neural Audio Codecs, initially designed as a compression technique, have gained more attention recently for speech generation. Codec models represent each audio frame as a sequence of tokens, i.e., discrete embeddings. The discrete and low…
View article: Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback
Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback Open
Building effective dense retrieval systems remains difficult when relevance supervision is not available. Recent work has looked to overcome this challenge by using a Large Language Model (LLM) to generate hypothetical documents that can b…
View article: Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains
Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains Open
Knowledge Graphs (KGs) can serve as reliable knowledge sources for question answering (QA) due to their structured representation of knowledge. Existing research on the utilization of KG for large language models (LLMs) prevalently relies …
View article: Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models
Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models Open
Neural audio codec models are becoming increasingly important as they serve as tokenizers for audio, enabling efficient transmission or facilitating speech language modeling. The ideal neural audio codec should maintain content, paralingui…
View article: Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps Open
When asked to summarize articles or answer questions given a passage, large language models (LLMs) can hallucinate details and respond with unsubstantiated answers that are inaccurate with respect to the input context. This paper describes…
View article: Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer
Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer Open
Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches. We propose ALS longitudinal speech transformer (ALST), a neural network-based autom…
View article: Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts Open
We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constr…
View article: Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers
Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers Open
Query rewriting is a crucial technique for passage retrieval in open-domain conversational question answering (CQA). It decontexualizes conversational queries into self-contained questions suitable for off-the-shelf retrievers. Existing me…
View article: Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation Open
Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, spee…