Explanipedia

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech Open

Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon , et al. · 2024

Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce…

Learning the joint distribution of two sequences using little or no paired data Open

Soroosh Mariooryad, Matt Shannon, Siyuan Ma, Tom Bagby, David Kao , et al. · 2022

Computer science Mathematics Engineering

We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exac…

Speaker Generation Open

Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg , et al. · 2021

Computer science Psychology Economics

This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-b…

Speaker Generation Open

Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg , et al. · 2021

Computer science Biology Economics

This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task generation, and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text…

librosa/librosa: 0.8.1rc2 Open

Brian McFee, Alexandros Metsai, Matt McVicar, Stefan Balke, Carl Thomé , et al. · 2021

Computer science

Second release candidate for 0.8.1.

Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis Open

Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma · 2021

Computer science Mathematics Political science

We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output wave…

Non-saturating GAN training as divergence minimization Open

Matt Shannon, Ben Poole, Soroosh Mariooryad, Tom Bagby, Eric Battenberg , et al. · 2020

Computer science Mathematics Physics

Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f…

Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis Open

Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao , et al. · 2020

Computer science Mathematics Philosophy

Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failure…

librosa/librosa: 0.7.2 Open

Brian McFee, Vincent Lostanlen, Matt McVicar, Alexandros Metsai, Stefan Balke , et al. · 2020

Computer science

This is primarily a bug-fix release, and most likely the last release in the 0.7 series. It includes fixes for errors in dynamic time warping (DTW) and RMS energy calculation, and several corrections to the documentation. Inverse-liftering…

Semi-Supervised Generative Modeling for Controllable Speech Synthesis Open

Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan , et al. · 2019

Computer science Mathematics Geology

We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to forc…

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis Open

Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon , et al. · 2019

Computer science Biology

Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs betwee…

librosa/librosa: 0.6.3 Open

Brian McFee, Matt McVicar, Stefan Balke, Vincent Lostanlen, Carl Thomé , et al. · 2019

Computer science

This release contains a few minor bugfixes and many improvements to documentation and usability.

librosa/librosa: 0.6.2 Open

Brian McFee, Matt McVicar, Stefan Balke, Carl Thomé, Vincent Lostanlen , et al. · 2018

Geology

This minor release adds support for joblib>=0.12, and introduces new signal and time-grid generation functions.

librosa/librosa: 0.6.1 Open

Brian McFee, Matt McVicar, Stefan Balke, Carl Thomé, Colin Raffel , et al. · 2018

Computer science

0.6.1 final release. This contains no substantial changes from 0.6.1rc0. The major changes from 0.6.0 include: new module librosa.sequence for Viterbi decoding Per-channel energy normalization (librosa.pcen()) As well as numerous bug-fixes…

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron Open

RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton , et al. · 2018

Computer science Political science Economics

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on t…

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis Open

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg , et al. · 2018

Computer science Physics History

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to m…

librosa/librosa: 0.6.0 Open

Brian McFee, Matt McVicar, Stefan Balke, Carl Thomé, Colin Raffel , et al. · 2018

Computer science

The 0.6.0 release contains no changes from the rc1 release candidate. A full list of changes is provided in the release notes.

Uncovering Latent Style Factors for Expressive Speech Synthesis Open

Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor , et al. · 2017

Computer science Psychology Art

Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tac…

Exploring Neural Transducers for End-to-End Speech Recognition Open

Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur , et al. · 2017

Computer science Geography Political science

In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperfo…

Reducing Bias in Production Speech Models Open

Eric Battenberg, Rewon Child, Adam Coates, Christopher Fougner, Yashesh Gaur , et al. · 2017

Computer science Economics Mathematics

Replacing hand-engineered pipelines with end-to-end deep learning systems has enabled strong results in applications like speech and object recognition. However, the causality and latency constraints of production systems put end-to-end sp…

librosa 0.5.1 Open

Brian McFee, Matt McVicar, Oriol Nieto, Stefan Balke, Carl Thome , et al. · 2017

Computer science

This was a minor bugfix release, and included some API enhancements. See https://librosa.github.io/librosa/changelog.html#v0-5-1 for details.

librosa 0.5.0 Open

Brian McFee, Matt McVicar, Oriol Nieto, Stefan Balke, Carl Thome , et al. · 2017

Computer science

A python library for audio signal processing and music analysis.

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Open

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper , et al. · 2015

Computer science Philosophy

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, …

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Open

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper , et al. · 2015

Computer science Philosophy

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, …

librosa: 0.4.1 Open

Brian McFee, Matt McVicar, Colin Raffel, Dawen Liang, Oriol Nieto , et al. · 2015

Computer science

This minor revision expands the rhythm analysis functionality, and fixes several small bugs. It is also the first release to officially support Python 3.5. For a complete list of changes, refer to the CHANGELOG.

Lasagne: First release. Open

Sander Dieleman, Michael Heilman, Jack B. Kelly, Martin Thoma, Kashif Rasul , et al. · 2015

Computer science

core contributors, in alphabetical order: Eric Battenberg (@ebattenberg) Sander Dieleman (@benanne) Daniel Nouri (@dnouri) Eben Olson (@ebenolson) Aäron van den Oord (@avdnoord) Colin Raffel (@craffel) Jan Schlüter (@f0k) Søren Kaae Sønder…

librosa: Audio and Music Signal Analysis in Python Open

Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar , et al. · 2015

Computer science Art

This document describes version 0.4.0 of librosa: a Python package for audio and music signal processing. At a high level, librosa provides implementations of a variety of common functions used throughout the field of music information ret…

Eric Battenberg YOU? Author Swipe