Mark Hasegawa–Johnson
YOU?
Author Swipe
View article: TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models
TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models Open
Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In thi…
View article: V(is)owel: An Interactive Vowel Chart to Understand What Makes Visual Pronunciation Effective in Second Language Learning
V(is)owel: An Interactive Vowel Chart to Understand What Makes Visual Pronunciation Effective in Second Language Learning Open
Visual feedback speeds up learners' improvement of pronunciation in a second language. The visual combined with audio allows speakers to see sounds and differences in pronunciation that they are unable to hear. Prior studies have tested di…
View article: ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization
ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization Open
We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or comp…
View article: Convolution-Augmented Transformers for Enhanced Speaker-Independent Dysarthric Speech Recognition
Convolution-Augmented Transformers for Enhanced Speaker-Independent Dysarthric Speech Recognition Open
Dysarthria is a motor speech disorder characterized by muscle movement difficulties that complicate verbal communication. It poses significant challenges to Automatic Speech Recognition (ASR) systems due to data scarcity and speaker variab…
View article: LIMMITS'24: Multi-Speaker, Multi-Lingual INDIC TTS With Voice Cloning
LIMMITS'24: Multi-Speaker, Multi-Lingual INDIC TTS With Voice Cloning Open
The Multi-speaker, Multi-lingual Indic Text to Speech (TTS) with voice cloning (LIMMITS'24) challenge is organized as part of the ICASSP 2024 signal processing grand challenge. LIMMITS'24 aims at the development of voice cloning for the mu…
View article: R2I-rPPG: A Robust Region of Interest Selection Method for Remote Photoplethysmography to Extract Heart Rate
R2I-rPPG: A Robust Region of Interest Selection Method for Remote Photoplethysmography to Extract Heart Rate Open
The COVID-19 pandemic has underscored the need for low-cost, scalable approaches to measuring contactless vital signs, either during initial triage at a healthcare facility or virtual telemedicine visits. Remote photoplethysmography (rPPG)…
View article: Community-Supported Shared Infrastructure in Support of Speech Accessibility
Community-Supported Shared Infrastructure in Support of Speech Accessibility Open
Purpose: The Speech Accessibility Project (SAP) intends to facilitate research and development in automatic speech recognition (ASR) and other machine learning tasks for people with speech disabilities. The purpose of this article is to in…
View article: Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue Open
In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) perfor…
View article: LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition
LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition Open
Test-Time Adaptation (TTA) has emerged as a crucial solution to the domain shift challenge, wherein the target environment diverges from the original training environment. A prime exemplification is TTA for Automatic Speech Recognition (AS…
View article: TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback Open
Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismat…
View article: Sound Tagging in Infant-centric Home Soundscapes
Sound Tagging in Infant-centric Home Soundscapes Open
Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focuse…
View article: Multimodal Respiratory Rate Estimation From Audio and Video in Emergency Department Patients
Multimodal Respiratory Rate Estimation From Audio and Video in Emergency Department Patients Open
Given the recent COVID-19 pandemic, there has been a push in the medical community for reliable, remote medical care. The ubiquity of smartphone devices has brought about much interest in the estimation of patient vital signs via an audio …
View article: Towards Unsupervised Speech Recognition Without Pronunciation Models
Towards Unsupervised Speech Recognition Without Pronunciation Models Open
Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech a…
View article: Analysis of Self-Supervised Speech Models on Children’s Speech and Infant Vocalizations
Analysis of Self-Supervised Speech Models on Children’s Speech and Infant Vocalizations Open
To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer repres…
View article: C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion
C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion Open
In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language model…
View article: AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition
AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition Open
In Automatic Speech Recognition (ASR) systems, a recurring obstacle is the generation of narrowly focused output distributions. This phenomenon emerges as a side effect of Connectionist Temporal Classification (CTC), a robust sequence lear…
View article: Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations
Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations Open
To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer repres…
View article: Preliminary Technical Validation of LittleBeats™: A Multimodal Sensing Platform to Capture Cardiac Physiology, Motion, and Vocalizations
Preliminary Technical Validation of LittleBeats™: A Multimodal Sensing Platform to Capture Cardiac Physiology, Motion, and Vocalizations Open
Across five studies, we present the preliminary technical validation of an infant-wearable platform, LittleBeats™, that integrates electrocardiogram (ECG), inertial measurement unit (IMU), and audio sensors. Each sensor modality is validat…
View article: Preliminary Technical Validation of LittleBeats™: A Multimodal Sensing Platform to Capture Cardiac Physiology, Motion, and Vocalizations
Preliminary Technical Validation of LittleBeats™: A Multimodal Sensing Platform to Capture Cardiac Physiology, Motion, and Vocalizations Open
Background: The use of wearable devices has burgeoned over the past decade, including wearables for infants and young children. Typically, such devices assess a single modality, and few have undergone scientific validation. To address this…
View article: Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech Open
The Lightweight, Multi-speaker, Multi-lingual Indic Text-to-Speech (LIMMITS'23) challenge is organized as part of the ICASSP 2023 Signal Processing Grand Challenge. LIMMITS'23 aims at the development of a lightweight, multi-speaker, multi-…
View article: HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models
HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models Open
This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes b…
View article: The influence of memory for and affective response to health messages on self-care behavioral intentions
The influence of memory for and affective response to health messages on self-care behavioral intentions Open
Clinical test results are often presented in digital health solutions (e.g., patient portals, mobile phone apps) with limited context to help patients understand implications of this numeric information. Guided by a framework that integrat…
View article: Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching
Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching Open
Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPU…
View article: Evaluating Users’ Experiences of a Child Multimodal Wearable Device: Mixed Methods Approach
Evaluating Users’ Experiences of a Child Multimodal Wearable Device: Mixed Methods Approach Open
Background Wearable devices permit the continuous, unobtrusive collection of data from children in their natural environments and can transform our understanding of child development. Although the use of wearable devices has begun to emerg…
View article: Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis
Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis Open
The assessment of children at risk of autism typically involves a clinician observing, taking notes, and rating children's behaviors. A machine learning model that can label adult and child audio may largely save labor in coding children's…
View article: Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction
Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction Open
Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-…
View article: Classification of Infant Sleep/Wake States: Cross-Attention among Large Scale Pretrained Transformer Networks using Audio, ECG, and IMU Data
Classification of Infant Sleep/Wake States: Cross-Attention among Large Scale Pretrained Transformer Networks using Audio, ECG, and IMU Data Open
Infant sleep is critical to brain and behavioral development. Prior studies on infant sleep/wake classification have been largely limited to reliance on expensive and burdensome polysomnography (PSG) tests in the laboratory or wearable dev…
View article: A Theory of Unsupervised Speech Recognition
A Theory of Unsupervised Speech Recognition Open
Unsupervised speech recognition (ASR-U) is the problem of learning automatic speech recognition (ASR) systems from unpaired speech-only and text-only corpora. While various algorithms exist to solve this problem, a theoretical framework is…