Gil Keren
YOU?
Author Swipe
View article: Efficient Streaming LLM for Speech Recognition
Efficient Streaming LLM for Speech Recognition Open
Recent works have shown that prompting large language models with audio encodings can unlock speech recognition capabilities. However, existing techniques do not scale efficiently, especially while handling long form streaming audio inputs…
View article: M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses Open
The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independentl…
View article: Faster Speech-LLaMA Inference with Multi-token Prediction
Faster Speech-LLaMA Inference with Multi-token Prediction Open
Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data …
View article: Token-Weighted RNN-T for Learning from Flawed Data
Token-Weighted RNN-T for Learning from Flawed Data Open
ASR models are commonly trained with the cross-entropy criterion to increase the probability of a target token sequence. While optimizing the probability of all tokens in the target sequence is sensible, one may want to de-emphasize tokens…
View article: Towards Selection of Text-to-speech Data to Augment ASR Training
Towards Selection of Text-to-speech Data to Augment ASR Training Open
This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, wh…
View article: Text Generation with Speech Synthesis for ASR Data Augmentation
Text Generation with Speech Synthesis for ASR Data Augmentation Open
Aiming at reducing the reliance on expensive human annotations, data synthesis for Automatic Speech Recognition (ASR) has remained an active area of research. While prior work mainly focuses on synthetic speech generation for ASR data augm…
View article: A Token-Wise Beam Search Algorithm for RNN-T
A Token-Wise Beam Search Algorithm for RNN-T Open
Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for speech recognition are iterating over the time axis, such that one time step is decoded before moving on to the next time step. Those algorithms result in a larg…
View article: Improving Fast-slow Encoder based Transducer with Streaming Deliberation
Improving Fast-slow Encoder based Transducer with Streaming Deliberation Open
This paper introduces a fast-slow encoder based transducer with streaming deliberation for end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the fast-slow encoder based transducer while keeping its late…
View article: Scaling ASR Improves Zero and Few Shot Learning
Scaling ASR Improves Zero and Few Shot Learning Open
With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to …
View article: Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion
Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion Open
How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-do…
View article: Alignment Restricted Streaming Recurrent Neural Network Transducer
Alignment Restricted Streaming Recurrent Neural Network Transducer Open
There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal …
View article: New Avenues in Audio Intelligence: Towards Holistic Real-life Audio Understanding
New Avenues in Audio Intelligence: Towards Holistic Real-life Audio Understanding Open
Computer audition (i.e., intelligent audio) has made great strides in recent years; however, it is still far from achieving holistic hearing abilities, which more appropriately mimic human-like understanding. Within an audio scene, a human…
View article: Deep Shallow Fusion for RNN-T Personalization
Deep Shallow Fusion for RNN-T Personalization Open
End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and exc…
View article: Contextual RNN-T for Open Domain ASR
Contextual RNN-T for Open Domain ASR Open
End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciatio…
View article: N-HANS: Introducing the Augsburg Neuro-Holistic Audio-eNhancement System
N-HANS: Introducing the Augsburg Neuro-Holistic Audio-eNhancement System Open
N-HANS is a Python toolkit for in-the-wild audio enhancement, including speech, music, and general audio denoising, separation, and selective noise or source suppression. The functionalities are realised based on two neural network models …
View article: Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement
Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement Open
The use of deep learning (DL) architectures for speech enhancement has recently improved the robustness of voice applications under diverse noise conditions.These improvements are usually evaluated based on the perceptual quality of the en…
View article: A Walkthrough for the Principle of Logit Separation
A Walkthrough for the Principle of Logit Separation Open
We consider neural network training, in applications in which there are many possible classes, but at test-time, the task is a binary classification task of determining whether the given example belongs to a specific class. We define the S…
View article: Single-Channel Speech Separation with Auxiliary Speaker Embeddings
Single-Channel Speech Separation with Auxiliary Speaker Embeddings Open
We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers. The proposed model is a neural network based on residual blocks, and uses learnt speaker em…
View article: Scaling Speech Enhancement in Unseen Environments with Noise Embeddings
Scaling Speech Enhancement in Unseen Environments with Noise Embeddings Open
We address the problem of speech enhancement generalisation to unseen environments by performing two manipulations. First, we embed an additional recording from the environment alone, and use this embedding to alter activations in the main…
View article: Emotion Recognition in Speech with Latent Discriminative Representations Learning
Emotion Recognition in Speech with Latent Discriminative Representations Learning Open
Despite significant recent advances in the field of affective computing, learning meaningful representations for emotion recognition remains quite challenging.In this paper,wepropose anovelfeature learning approach named Latent Discriminat…
View article: Calibrated Prediction Intervals for Neural Network Regressors
Calibrated Prediction Intervals for Neural Network Regressors Open
Ongoing developments in neural network models are continually advancing the state of the art in terms of system accuracy. However, the predicted labels should not be regarded as the only core output; also important is a well-calibrated est…
View article: Weakly Supervised One-Shot Detection with Attention Siamese Networks
Weakly Supervised One-Shot Detection with Attention Siamese Networks Open
Neural network models that are not conditioned on class identities were shown to facilitate knowledge transfer between classes and to be well-suited for one-shot learning tasks. Following this motivation, we further explore and establish s…
View article: Weakly Supervised One-Shot Detection with Attention Similarity Networks
Weakly Supervised One-Shot Detection with Attention Similarity Networks Open
Neural network models that are not conditioned on class identities were shown to facilitate knowledge transfer between classes and to be well-suited for one-shot learning tasks. Following this motivation, we further explore and establish s…
View article: Calibrated Prediction Intervals for Neural Network Regressors
Calibrated Prediction Intervals for Neural Network Regressors Open
Ongoing developments in neural network models are continually advancing the state-of-the-art in terms of system accuracy. However, the predicted labels should not be regarded as the only core output; also important is a well-calibrated est…
View article: CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms
CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms Open
The adage that there is no data like more data is not new in affective computing; however, with recent advances in deep learning technologies, such as end-to-end learning, the need for extracting big data is greater than ever. Multimedia r…
View article: Fast Single-Class Classification and the Principle of Logit Separation
Fast Single-Class Classification and the Principle of Logit Separation Open
We consider neural network training, in applications in which there are many possible classes, but at test-time, the task is a binary classification task of determining whether the given example belongs to a specific class, where the class…
View article: Tunable Sensitivity to Large Errors in Neural Network Training
Tunable Sensitivity to Large Errors in Neural Network Training Open
When humans learn a new concept, they might ignore examples that they cannot make sense of at first, and only later focus on such examples, when they are more useful for learning. We propose incorporating this idea of tunable sensitivity f…
View article: Tunable Sensitivity to Large Errors in Neural Network Training
Tunable Sensitivity to Large Errors in Neural Network Training Open
When humans learn a new concept, they might ignore examples that they cannot make sense of at first, and only later focus on such examples, when they are more useful for learning. We propose incorporating this idea of tunable sensitivity f…