Yusuke Ijima
YOU?
Author Swipe
View article: Voice Impression Control in Zero-Shot TTS
Voice Impression Control in Zero-Shot TTS Open
Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived…
View article: One's own recorded voice is more intelligible than the voices of others in the presence of competing speech
One's own recorded voice is more intelligible than the voices of others in the presence of competing speech Open
View article: Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
Lightweight Zero-shot Text-to-Speech with Mixture of Adapters Open
The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a ligh…
View article: What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis
What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis Open
Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, sp…
View article: Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters Open
The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approa…
View article: Effect of Personal Traits on Impressions of One's Own Recorded Voice
Effect of Personal Traits on Impressions of One's Own Recorded Voice Open
View article: StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models
StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models Open
We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification …
View article: SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?
SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge? Open
Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks, such as speech and speaker recognition. More recently, speech SSL models have also been shown to be beneficial in advancing…
View article: Expressive Text-to-Speech Synthesis using Text Chat Dataset with Speaking Style Information
Expressive Text-to-Speech Synthesis using Text Chat Dataset with Speaking Style Information Open
This paper aims to generate expressive speech for integration with a robot and AI character dialogue systems. To generate expressive speech, some researchers have proposed using labels that express specific dialogue acts and emotions (i.e.…
View article: Perceived emotional states mediate willingness to buy from advertising speech
Perceived emotional states mediate willingness to buy from advertising speech Open
Previous studies have shown that stimulus-organism-response (SOR) theory can well explain the willingness to buy from stores, products, and advertising-related stimuli. However, few studies have investigated advertising speech stimulus tha…
View article: SIMD-size aware weight regularization for fast neural vocoding on CPU
SIMD-size aware weight regularization for fast neural vocoding on CPU Open
This paper proposes weight regularization for a faster neural vocoder. Pruning time-consuming DNN modules is a promising way to realize a real-time vocoder on a CPU (e.g. WaveRNN, LPCNet). Regularization that encourages sparsity is also ef…
View article: Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification
Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification Open
We propose non-parallel and many-to-many voice conversion (VC) using variational autoencoders (VAEs) that constructs VC models for converting arbitrary speakers' characteristics into those of other arbitrary speakers without parallel speec…
View article: Saxe: Text-to-Speech Synthesis Engine Applicable to Diverse Use Cases
Saxe: Text-to-Speech Synthesis Engine Applicable to Diverse Use Cases Open
from input text and a speech-synthesis section that generates synthesized speech from the
View article: Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Model architectures to extrapolate emotional expressions in DNN-based text-to-speech Open
View article: Estimating Sentence Final Tone Labels using Dialogue-Act Information for Text-to-Speech Synthesis within a Spoken Dialogue System
Estimating Sentence Final Tone Labels using Dialogue-Act Information for Text-to-Speech Synthesis within a Spoken Dialogue System Open
This paper proposes a novel sentence final tone labels estimation method using dialogue-act (DA) informationfor text-to-speech synthesis within a spoken dialogue system. Estimating appropriate sentence final tone labels isconsidered essent…
View article: DNN-based Speech Synthesis using Dialogue-Act Information and Its Evaluation with Respect to Illocutionary Act Naturalness
DNN-based Speech Synthesis using Dialogue-Act Information and Its Evaluation with Respect to Illocutionary Act Naturalness Open
This paper aims at improving naturalness of synthesized speech generated by a text-to-speech (TTS) systemwithin a spoken dialogue system with respect to “how natural the system’s intention is perceived via the synthesizedspeech”. We call t…
View article: V2S attack: building DNN-based voice conversion from automatic speaker verification
V2S attack: building DNN-based voice conversion from automatic speaker verification Open
This paper presents a new voice impersonation attack using voice conversion (VC). Enrolling personal voices for automatic speaker verification (ASV) offers natural and flexible biometric authentication systems. Basically, the ASV systems d…
View article: DNN-Based Speech Synthesis Using Speaker Codes
DNN-Based Speech Synthesis Using Speaker Codes Open
Deep neural network (DNN)-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, it is not revealed whether the synthesized speech quality can be improved by utilizing …
View article: Similar Speaker Selection Technique Based on Distance Metric Learning Using Highly Correlated Acoustic Features with Perceptual Voice Quality Similarity
Similar Speaker Selection Technique Based on Distance Metric Learning Using Highly Correlated Acoustic Features with Perceptual Voice Quality Similarity Open
This paper analyzes the correlation between various acoustic features and perceptual voice quality similarity, and proposes a perceptually similar speaker selection technique based on distance metric learning. To analyze the relationship b…