Jan Skoglund
YOU?
Author Swipe
View article: SynBAD: Synthetic Binaural Audio Dataset
SynBAD: Synthetic Binaural Audio Dataset Open
The SynBAD dataset contains synthetic binaural renders of various audio contents. The dataset samples were generated by applying specific Head Related Transfer Functions (HRTF) from subject D2 of the SADIE II database to various audio cont…
View article: Binamix -- A Python Library for Generating Binaural Audio Datasets
Binamix -- A Python Library for Generating Binaural Audio Datasets Open
The increasing demand for spatial audio in applications such as virtual reality, immersive media, and spatial audio research necessitates robust solutions to generate binaural audio data sets for use in testing and validation. Binamix is a…
View article: Perceptual Audio Coding: A 40-Year Historical Perspective
Perceptual Audio Coding: A 40-Year Historical Perspective Open
In the history of audio and acoustic signal processing, perceptual audio coding has certainly excelled as a bright success story by its ubiquitous deployment in virtually all digital media devices, such as computers, tablets, mobile phones…
View article: SCOREQ: Speech Quality Assessment with Contrastive Regression
SCOREQ: Speech Quality Assessment with Contrastive Regression Open
In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference …
View article: Neural Speech and Audio Coding: Modern AI Technology Meets Traditional Codecs
Neural Speech and Audio Coding: Modern AI Technology Meets Traditional Codecs Open
This paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems. It highlights the challenges posed by the subjective evaluation processes of speech and audio codecs …
View article: NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment
NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment Open
This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature e…
View article: A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality
A Comparison Of Deep Learning MOS Predictors For Speech Synthesis Quality Open
Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this…
View article: Context-Based Evaluation of the Opus Audio Codec for Spatial Audio Content in Virtual Reality
Context-Based Evaluation of the Opus Audio Codec for Spatial Audio Content in Virtual Reality Open
This paper discusses the evaluation of Opus-compressed Ambisonic audio content through listening tests conducted in a virtual reality environment.The aim of this study was to investigate the effect that Opus compression has on the Basic Au…
View article: LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models
LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models Open
We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residua…
View article: Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset Open
Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the…
View article: Speech quality assessment with WARP‐Q: From similarity to subsequence dynamic time warp cost
Speech quality assessment with WARP‐Q: From similarity to subsequence dynamic time warp cost Open
Speech coding has been shown to achieve good speech quality using either waveform matching or parametric reconstruction. For very low bit rate streams, recently developed generative speech models can reconstruct high‐quality wideband speec…
View article: Ultra-Low-Bitrate Speech Coding with Pretrained Transformers
Ultra-Low-Bitrate Speech Coding with Pretrained Transformers Open
Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While …
View article: A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality
A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality Open
Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this…
View article: SoundStream: An End-to-End Neural Audio Codec
SoundStream: An End-to-End Neural Audio Codec Open
We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convol…
View article: Speech quality estimation with deep lattice networks
Speech quality estimation with deep lattice networks Open
Intrusive subjective speech quality estimation of mean opinion score (MOS) often involves mapping a raw similarity score extracted from differences between the clean and degraded utterance onto MOS with a fitted mapping function. More rece…
View article: Warp-Q: Quality Prediction for Generative Neural Speech Codecs
Warp-Q: Quality Prediction for Generative Neural Speech Codecs Open
Good speech quality has been achieved using waveform matching and parametric reconstruction coders. Recently developed very low bit rate generative codecs can reconstruct high quality wideband speech with bit streams less than 3 kb/s. Thes…
View article: Generative Speech Coding with Predictive Variance Regularization
Generative Speech Coding with Predictive Variance Regularization Open
The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the disto…
View article: VISQOL: The Virtual Speech Quality Objective Listener
VISQOL: The Virtual Speech Quality Objective Listener Open
A model of human speech quality perception has been developed to provide an objective measure for predicting subjective quality assessments. The Virtual Speech Quality Objective Listener (ViSQOL) model is a signal based full reference metr…
View article: Handling Background Noise in Neural Speech Generation
Handling Background Noise in Neural Speech Generation Open
Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise,…
View article: Improving Opus Low Bit Rate Quality with Neural Speech Synthesis
Improving Opus Low Bit Rate Quality with Neural Speech Synthesis Open
The voice mode of the Opus audio coder can compress wideband speech at bit rates ranging from 6 kb/s to 40 kb/s. However, Opus is at its core a waveform matching coder, and as the rate drops below 10 kb/s, quality degrades quickly. As the …
View article: AMBIQUAL: Towards a Quality Metric for Headphone Rendered Compressed Ambisonic Spatial Audio
AMBIQUAL: Towards a Quality Metric for Headphone Rendered Compressed Ambisonic Spatial Audio Open
Spatial audio is essential for creating a sense of immersion in virtual environments. Efficient encoding methods are required to deliver spatial audio over networks without compromising Quality of Service (QoS). Streaming service providers…
View article: ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric
ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric Open
The 12th International Conference on Quality of Multimedia Experience (QoMEX), Athlone, Ireland (held online due to coronavirus outbreak), 26-28 May 2020
View article: Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders
Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders Open
This study compares the performances of different algorithms for coding speech at low bit rates. In addition to widely deployed traditional vocoders, a selection of recently developed generative-model-based coders at different bit rates ar…
View article: ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric
ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric Open
Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of …
View article: Speech Quality Factors for Traditional and Neural-Based Low Bit Rate\n Vocoders
Speech Quality Factors for Traditional and Neural-Based Low Bit Rate\n Vocoders Open
This study compares the performances of different algorithms for coding\nspeech at low bit rates. In addition to widely deployed traditional vocoders, a\nselection of recently developed generative-model-based coders at different bit\nrates…
View article: Salient Speech Representations Based on Cloned Networks
Salient Speech Representations Based on Cloned Networks Open
We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are…
View article: A Real-Time Wideband Neural Vocoder at 1.6kb/s Using LPCNet
A Real-Time Wideband Neural Vocoder at 1.6kb/s Using LPCNet Open
Neural speech synthesis algorithms are a promising new approach for coding speech at very low bitrate. They have so far demonstrated quality that far exceeds traditional vocoders, at the cost of very high complexity. In this work, we prese…
View article: Generative Speech Enhancement Based on Cloned Networks
Generative Speech Enhancement Based on Cloned Networks Open
We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the…
View article: Auditory Localization in Low-Bitrate Compressed Ambisonic Scenes
Auditory Localization in Low-Bitrate Compressed Ambisonic Scenes Open
The increasing popularity of Ambisonics as a spatial audio format for streaming services poses new challenges to existing audio coding techniques. Immersive audio delivered to mobile devices requires an efficient bitrate compression that d…