Explanipedia

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems Open

Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada · 2024

Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response r…

Release of Pre-Trained Models for the Japanese Language Open

Kei Sawada, Tianyu Zhao, Makoto Shing, Kentaro Mitsui, Akio Kaga , et al. · 2024

AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-tra…

PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model Open

Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda · 2024

This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) incorporating explicit periodic signals as auxiliary conditioning signals. Recently, DDPM-based neural vocoders have gained prominence as non-au…

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition Open

Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki , et al. · 2023

Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining atte…

Towards human-like spoken dialogue generation between AI agents from written dialogue Open

Kentaro Mitsui, Yukiya Hono, Kei Sawada · 2023

The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues ha…

UniFLG: Unified Facial Landmark Generator from Text or Speech Open

Kentaro Mitsui, Yukiya Hono, Kei Sawada · 2023

Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces …

Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation Open

Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda · 2023

This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by sco…

Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism Open

Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda · 2022

This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal mo…

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System Open

Takenori Yoshimura, Shinji Takaki, Kazuhiro Nakamura, Keiichiro Oura, Yukiya Hono , et al. · 2022

This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform …

End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue Open

Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko Nankaku , et al. · 2022

The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we r…

PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components Open

Yukiya Hono, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku , et al. · 2021

We propose PeriodNet, a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms. The non-AR waveform generation models can generate speech waveform…

PeriodNet: A Non-Autoregressive Raw Waveform Generative Model With a Structure Separating Periodic and Aperiodic Components Open

Yukiya Hono, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku , et al. · 2021

This paper presents PeriodNet, a non-autoregressive (non-AR) waveform generative model with a new model structure for modeling periodic and aperiodic components in speech waveforms. Non-AR raw waveform generative models have enabled the fa…

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis Open

Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura , et al. · 2020

This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech. In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine …

Yukiya Hono YOU? Author Swipe