Eric Battenberg
YOU?
Author Swipe
View article: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech Open
Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce…
View article: Learning the joint distribution of two sequences using little or no paired data
Learning the joint distribution of two sequences using little or no paired data Open
We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exac…
View article: Speaker Generation
Speaker Generation Open
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-b…
View article: Speaker Generation
Speaker Generation Open
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task generation, and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text…
View article: librosa/librosa: 0.8.1rc2
librosa/librosa: 0.8.1rc2 Open
Second release candidate for 0.8.1.
View article: Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis
Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis Open
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output wave…
View article: Non-saturating GAN training as divergence minimization
Non-saturating GAN training as divergence minimization Open
Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f…
View article: Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis
Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis Open
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failure…
View article: librosa/librosa: 0.7.2
librosa/librosa: 0.7.2 Open
This is primarily a bug-fix release, and most likely the last release in the 0.7 series. It includes fixes for errors in dynamic time warping (DTW) and RMS energy calculation, and several corrections to the documentation. Inverse-liftering…
View article: Semi-Supervised Generative Modeling for Controllable Speech Synthesis
Semi-Supervised Generative Modeling for Controllable Speech Synthesis Open
We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to forc…
View article: Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis
Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis Open
Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs betwee…
View article: librosa/librosa: 0.6.3
librosa/librosa: 0.6.3 Open
This release contains a few minor bugfixes and many improvements to documentation and usability.
View article: librosa/librosa: 0.6.2
librosa/librosa: 0.6.2 Open
This minor release adds support for joblib>=0.12, and introduces new signal and time-grid generation functions.
View article: librosa/librosa: 0.6.1
librosa/librosa: 0.6.1 Open
0.6.1 final release. This contains no substantial changes from 0.6.1rc0. The major changes from 0.6.0 include: new module librosa.sequence for Viterbi decoding Per-channel energy normalization (librosa.pcen()) As well as numerous bug-fixes…
View article: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron Open
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on t…
View article: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis Open
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to m…
View article: librosa/librosa: 0.6.0
librosa/librosa: 0.6.0 Open
The 0.6.0 release contains no changes from the rc1 release candidate. A full list of changes is provided in the release notes.
View article: Uncovering Latent Style Factors for Expressive Speech Synthesis
Uncovering Latent Style Factors for Expressive Speech Synthesis Open
Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tac…
View article: Exploring Neural Transducers for End-to-End Speech Recognition
Exploring Neural Transducers for End-to-End Speech Recognition Open
In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperfo…
View article: Reducing Bias in Production Speech Models
Reducing Bias in Production Speech Models Open
Replacing hand-engineered pipelines with end-to-end deep learning systems has enabled strong results in applications like speech and object recognition. However, the causality and latency constraints of production systems put end-to-end sp…
View article: librosa 0.5.1
librosa 0.5.1 Open
This was a minor bugfix release, and included some API enhancements. See https://librosa.github.io/librosa/changelog.html#v0-5-1 for details.
View article: librosa 0.5.0
librosa 0.5.0 Open
A python library for audio signal processing and music analysis.
View article: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Open
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, …
View article: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Open
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, …
View article: librosa: 0.4.1
librosa: 0.4.1 Open
This minor revision expands the rhythm analysis functionality, and fixes several small bugs. It is also the first release to officially support Python 3.5. For a complete list of changes, refer to the CHANGELOG.
View article: Lasagne: First release.
Lasagne: First release. Open
core contributors, in alphabetical order: Eric Battenberg (@ebattenberg) Sander Dieleman (@benanne) Daniel Nouri (@dnouri) Eben Olson (@ebenolson) Aäron van den Oord (@avdnoord) Colin Raffel (@craffel) Jan Schlüter (@f0k) Søren Kaae Sønder…
View article: librosa: Audio and Music Signal Analysis in Python
librosa: Audio and Music Signal Analysis in Python Open
This document describes version 0.4.0 of librosa: a Python package for audio and music signal processing. At a high level, librosa provides implementations of a variety of common functions used throughout the field of music information ret…