Daisy Stanton
YOU?
Author Swipe
View article: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech Open
Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce…
View article: Learning the joint distribution of two sequences using little or no paired data
Learning the joint distribution of two sequences using little or no paired data Open
We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exac…
View article: Speaker Generation
Speaker Generation Open
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-b…
View article: Speaker Generation
Speaker Generation Open
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task generation, and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text…
View article: Non-saturating GAN training as divergence minimization
Non-saturating GAN training as divergence minimization Open
Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f…
View article: Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis
Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis Open
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failure…
View article: Semi-Supervised Generative Modeling for Controllable Speech Synthesis
Semi-Supervised Generative Modeling for Controllable Speech Synthesis Open
We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to forc…
View article: Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis
Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis Open
Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs betwee…
View article: Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis Open
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover ex…
View article: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron Open
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on t…
View article: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis Open
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to m…
View article: Uncovering Latent Style Factors for Expressive Speech Synthesis
Uncovering Latent Style Factors for Expressive Speech Synthesis Open
Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tac…
View article: Tacotron: Towards End-to-End Speech Synthesis
Tacotron: Towards End-to-End Speech Synthesis Open
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain…