Yukiya Hono
YOU?
Author Swipe
View article: PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems Open
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response r…
View article: Release of Pre-Trained Models for the Japanese Language
Release of Pre-Trained Models for the Japanese Language Open
AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-tra…
View article: PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model
PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model Open
This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) incorporating explicit periodic signals as auxiliary conditioning signals. Recently, DDPM-based neural vocoders have gained prominence as non-au…
View article: Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition Open
Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining atte…
View article: Towards human-like spoken dialogue generation between AI agents from written dialogue
Towards human-like spoken dialogue generation between AI agents from written dialogue Open
The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues ha…
View article: UniFLG: Unified Facial Landmark Generator from Text or Speech
UniFLG: Unified Facial Landmark Generator from Text or Speech Open
Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces …
View article: Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation
Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation Open
This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by sco…
View article: Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism
Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism Open
This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal mo…
View article: Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System
Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System Open
This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform …
View article: End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue
End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue Open
The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we r…
View article: PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components
PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components Open
We propose PeriodNet, a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms. The non-AR waveform generation models can generate speech waveform…
View article: PeriodNet: A Non-Autoregressive Raw Waveform Generative Model With a Structure Separating Periodic and Aperiodic Components
PeriodNet: A Non-Autoregressive Raw Waveform Generative Model With a Structure Separating Periodic and Aperiodic Components Open
This paper presents PeriodNet, a non-autoregressive (non-AR) waveform generative model with a new model structure for modeling periodic and aperiodic components in speech waveforms. Non-AR raw waveform generative models have enabled the fa…
View article: Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis
Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis Open
This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech. In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine …