Simon King
YOU?
Author Swipe
View article: Can We Reconstruct a Dysarthric Voice with the Large Speech Model Parler TTS?
Can We Reconstruct a Dysarthric Voice with the Large Speech Model Parler TTS? Open
Speech disorders can make communication hard or even impossible for those who develop them. Personalised Text-to-Speech is an attractive option as a communication aid. We attempt voice reconstruction using a large speech model, with which …
View article: Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information
Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information Open
Quantization in SSL speech models (e.g., HuBERT) improves compression and performance in tasks like language modeling, resynthesis, and text-to-speech but often discards prosodic and paralinguistic information (e.g., emotion, prominence). …
View article: Learning Nonlinear Dynamics in Physical Modelling Synthesis using Neural Ordinary Differential Equations
Learning Nonlinear Dynamics in Physical Modelling Synthesis using Neural Ordinary Differential Equations Open
Modal synthesis methods are a long-standing approach for modelling distributed musical systems. In some cases extensions are possible in order to handle geometric nonlinearities. One such case is the high-amplitude vibration of a string, w…
View article: Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023
Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023 Open
International audience
View article: Do Discrete Self-Supervised Representations of Speech Capture Tone Distinctions?
Do Discrete Self-Supervised Representations of Speech Capture Tone Distinctions? Open
Discrete representations of speech, obtained from Self-Supervised Learning (SSL) foundation models, are widely used, especially where there are limited data for the downstream task, such as for a low-resource language. Typically, discretiz…
View article: Voice Conversion-based Privacy through Adversarial Information Hiding
Voice Conversion-based Privacy through Adversarial Information Hiding Open
Privacy-preserving voice conversion aims to remove only the attributes of speech audio that convey identity information, keeping other speech characteristics intact. This paper presents a mechanism for privacy-preserving voice conversion t…
View article: Hierarchical Intonation Modelling for Speech Synthesis using Legendre Polynomial Coefficients
Hierarchical Intonation Modelling for Speech Synthesis using Legendre Polynomial Coefficients Open
Synthetic speech quality is now close to parity with human speech for isolated read speech utterances. There has therefore been a resurgence of interest in using speech synthesis for speech science research. However, many speech synthesis …
View article: Natural language guidance of high-fidelity text-to-speech with synthetic annotations
Natural language guidance of high-fidelity text-to-speech with synthetic annotations Open
Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on referenc…
View article: Synthesising turn-taking cues using natural conversational data
Synthesising turn-taking cues using natural conversational data Open
As speech synthesis quality reaches high levels of naturalness for isolated utterances, more work is focusing on the synthesis of context-dependent conversational speech.The role of context in conversation is still poorly understood and ma…
View article: Differentiable Grey-box Modelling of Phaser Effects using Frame-based Spectral Processing
Differentiable Grey-box Modelling of Phaser Effects using Frame-based Spectral Processing Open
Machine learning approaches to modelling analog audio effects have seen intensive investigation in recent years, particularly in the context of non-linear time-invariant effects such as guitar amplifiers. For modulation effects such as pha…
View article: Controllable Speaking Styles Using a Large Language Model
Controllable Speaking Styles Using a Large Language Model Open
Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text. Such models jointly learn a latent acoustic space during training, which can be sampled from during inference. Co…
View article: Ensemble prosody prediction for expressive speech synthesis
Ensemble prosody prediction for expressive speech synthesis Open
Generating expressive speech with rich and varied prosody continues to be a challenge for Text-to-Speech. Most efforts have focused on sophisticated neural architectures intended to better model the data distribution. Yet, in evaluations i…
View article: Do Prosody Transfer Models Transfer Prosody?
Do Prosody Transfer Models Transfer Prosody? Open
Some recent models for Text-to-Speech synthesis aim to transfer the prosody of a reference utterance to the generated target synthetic speech. This is done by using a learned embedding of the reference utterance, which is used to condition…
View article: Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing
Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing Open
Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as an intermediate representation, to decompose the task into acoustic modelling and waveform generation. A mel-spectrogram is extracted from the waveform by a simple, fa…
View article: Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech
Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech Open
Sequence-to-sequence speech synthesis models are notorious for gross errors such as skipping and repetition, commonly associated with failures in the attention mechanism.While a lot has been done to improve attention and decrease errors, t…
View article: Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis Open
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data …
View article: ADEPT: A Dataset for Evaluating Prosody Transfer
ADEPT: A Dataset for Evaluating Prosody Transfer Open
Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample. There have been considerable advances in …
View article: ADEPT: A Dataset for Evaluating Prosody Transfer
ADEPT: A Dataset for Evaluating Prosody Transfer Open
The ADEPT dataset consists of prosodically-varied natural speech samples for evaluating prosody transfer in english text-to-speech models. The samples include global variations reflecting emotion and interpersonal attitude, and local varia…
View article: CSTR NAM TIMIT Plus
CSTR NAM TIMIT Plus Open
CSTR NAM TIMIT Plus (Version 0.8) RELEASE May 2012 The Centre for Speech Technology Research University of Edinburgh Copyright (c) 2012 Junichi Yamagishi [email protected] Overview This CSTR NAM TIMIT Plus corpus includes a parallel wh…
View article: Using previous acoustic context to improve Text-to-Speech synthesis
Using previous acoustic context to improve Text-to-Speech synthesis Open
Many speech synthesis datasets, especially those derived from audiobooks, naturally comprise sequences of utterances. Nevertheless, such data are commonly treated as individual, unordered utterances both when training a model and at infere…
View article: An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning
An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning Open
Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech pr…
View article: Hider-Finder-Combiner: An Adversarial Architecture for General Speech Signal Modification
Hider-Finder-Combiner: An Adversarial Architecture for General Speech Signal Modification Open
International audience
View article: An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning
An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning Open
Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech pr…
View article: Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0
Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0 Open
In English, prosody adds a broad range of information to segment sequences,\nfrom information structure (e.g. contrast) to stylistic variation (e.g.\nexpression of emotion). However, when learning to control prosody in\ntext-to-speech voic…
View article: Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis Open
We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech (TTS) synthesis. We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion sc…
View article: Comparison of Speech Representations for Automatic Quality Estimation in\n Multi-Speaker Text-to-Speech Synthesis
Comparison of Speech Representations for Automatic Quality Estimation in\n Multi-Speaker Text-to-Speech Synthesis Open
We aim to characterize how different speakers contribute to the perceived\noutput quality of multi-speaker Text-to-Speech (TTS) synthesis. We\nautomatically rate the quality of TTS using a neural network (NN) trained on\nhuman mean opinion…
View article: Enriched communication across the lifespan
Enriched communication across the lifespan Open
Speech is a hugely efficient means of communication: a reduced capacity in listening or speaking creates a significant barrier to social inclusion at all points through the lifespan, in education, work and at home. Hearing devices and spee…