Athanasios Mouchtaris
YOU?
Author Swipe
View article: Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models
Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models Open
The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameter…
View article: Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition
Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition Open
Attention-based contextual biasing approaches have shown significant improvements in the recognition of generic and/or personal rare-words in End-to-End Automatic Speech Recognition (E2E ASR) systems like neural transducers. These approach…
View article: Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers
Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers Open
Streaming speech recognition architectures are employed for low-latency, real-time applications. Such architectures are often characterized by their causality. Causal architectures emit tokens at each frame, relying only on current and pas…
View article: Accelerator-Aware Training for Transducer-Based Speech Recognition
Accelerator-Aware Training for Transducer-Based Speech Recognition Open
Machine learning model weights and activations are represented in full-precision during training. This leads to performance degradation in runtime when deployed on neural network accelerator (NNA) chips, which leverage highly parallelized …
View article: Context-Aware Transformer Transducer for Speech Recognition
Context-Aware Transformer Transducer for Speech Recognition Open
End-to-end (E2E) automatic speech recognition (ASR) systems often have difficulty recognizing uncommon words, that appear infrequently in the training data. One promising method, to improve the recognition accuracy on such rare words, is t…
View article: FANS: Fusing ASR and NLU for on-device SLU
FANS: Fusing ASR and NLU for on-device SLU Open
Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one ma…
View article: Multi-Channel Transformer Transducer for Speech Recognition
Multi-Channel Transformer Transducer for Speech Recognition Open
Multi-channel inputs offer several advantages over single-channel, to improve the robustness of on-device speech recognition systems. Recent work on multi-channel transformer, has proposed a way to incorporate such inputs into end-to-end A…
View article: End-to-End Spoken Language Understanding for Generalized Voice Assistants
End-to-End Spoken Language Understanding for Generalized Voice Assistants Open
End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic struct…
View article: CoDERT: Distilling Encoder Representations with Co-Learning for Transducer-Based Speech Recognition
CoDERT: Distilling Encoder Representations with Co-Learning for Transducer-Based Speech Recognition Open
We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information…
View article: Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models
Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models Open
We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small …
View article: Sparsification via Compressed Sensing for Automatic Speech Recognition
Sparsification via Compressed Sensing for Automatic Speech Recognition Open
In order to achieve high accuracy for machine learning (ML) applications, it is essential to employ models with a large number of parameters. Certain applications, such as Automatic Speech Recognition (ASR), however, require real-time inte…
View article: End-to-End Multi-Channel Transformer for Speech Recognition
End-to-End Multi-Channel Transformer for Speech Recognition Open
Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the …
View article: End-to-End Neural Transformer Based Spoken Language Understanding
End-to-End Neural Transformer Based Spoken Language Understanding Open
Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals. While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in …
View article: Semantic Complexity in End-to-End Spoken Language Understanding
Semantic Complexity in End-to-End Spoken Language Understanding Open
End-to-end spoken language understanding (SLU) models are a class of model architectures that predict semantics directly from speech. Because of their input and output types, we refer to them as speech-to-interpretation (STI) models. Previ…
View article: Streaming End-to-End Bilingual ASR Systems with Joint Language Identification
Streaming End-to-End Bilingual ASR Systems with Joint Language Identification Open
Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, i…
View article: Convolutional Neural Networks for Video Quality Assessment
Convolutional Neural Networks for Video Quality Assessment Open
Video Quality Assessment (VQA) is a very challenging task due to its highly subjective nature. Moreover, many factors influence VQA. Compression of video content, while necessary for minimising transmission and storage requirements, introd…
View article: Acoustic Beamforming in Front of a Reflective Plane
Acoustic Beamforming in Front of a Reflective Plane Open
In this paper, we consider the problem of beamforming with a planar microphone array placed in front of a wall of the room, so that the microphone array plane is perpendicular to that of the wall. While this situation is very likely to occ…
View article: Normalization of Partly Overlapping Audio Recordings from the Same Event Based on Relative Signal Powers
Normalization of Partly Overlapping Audio Recordings from the Same Event Based on Relative Signal Powers Open
Exploiting correlations in the audio, several works in the past have demonstrated the ability to automatically match and synchronize user-generated video or audio files of the same event. Such tools solve for the unknown starting and endin…
View article: Maximum component elimination in mixing of user generated audio recordings
Maximum component elimination in mixing of user generated audio recordings Open
User generated content is gradually being recognized for its remarkable potential to enrich the professionally broadcasted content, but also as the means to provide acceptable quality audiovisual content for public events where professiona…
View article: A subjective evaluation on mixtures of crowdsourced audio recordings
A subjective evaluation on mixtures of crowdsourced audio recordings Open
Exploiting correlations in the audio, several works in the past have demonstrated the ability to automatically match and synchronize User Generated Recordings (UGRs) of the same event. Considering a small number of synchronized UGRs, we fo…
View article: Perpendicular Cross-Spectra Fusion for Sound Source Localization With a Planar Microphone Array
Perpendicular Cross-Spectra Fusion for Sound Source Localization With a Planar Microphone Array Open
Multiple sound source localization in reverberant environments stands as one of the most difficult challenges for many applications related to microphone array signal processing. In this paper, we describe Perpendicular Cross-Spectra Fusio…
View article: Synchronization Ambiguity In Audio Content Generated By Users Attending The Same Public Event
Synchronization Ambiguity In Audio Content Generated By Users Attending The Same Public Event Open
Exploiting correlations in the audio, several works in the past have demonstrated the ability to automatically match and synchronize User Generated Recordings (UGRs) of the same event. The synchronization process is of fundamental importan…
View article: DOA estimation with histogram analysis of spatially constrained active intensity vectors
DOA estimation with histogram analysis of spatially constrained active intensity vectors Open
<p>The active intensity vector (AIV) is a common descriptor of the sound field. In microphone array processing, AIV is commonly approximated with beamforming operations and uti- lized as a direction of arrival (DOA) estimator. Howeve…
View article: Towards wireless acoustic sensor networks for location estimation and counting of multiple speakers in real-life conditions
Towards wireless acoustic sensor networks for location estimation and counting of multiple speakers in real-life conditions Open
Speaker localization and counting in real-life conditions remains a challenging task. The computational burden, transmission usage and synchronization issues pose several limitations. Moreover, the physical characteristics of real speakers…
View article: A Survey of Sound Source Localization Methods in Wireless Acoustic Sensor Networks
A Survey of Sound Source Localization Methods in Wireless Acoustic Sensor Networks Open
Wireless acoustic sensor networks (WASNs) are formed by a distributed group of acoustic-sensing devices featuring audio playing and recording capabilities. Current mobile computing platforms offer great possibilities for the design of audi…
View article: Two Datasets With User Generated Audio Recordings
Two Datasets With User Generated Audio Recordings Open
We provide two open access datasets of user generated audio recordings captured with mobile devices such as smartphones and portable cameras . The provided audio files originate from two different public events, a musical concert and a foo…
View article: Development And Evaluation Of A Digital Mems Microphone Array For Spatial Audio
Development And Evaluation Of A Digital Mems Microphone Array For Spatial Audio Open
We present the design of a digital microphone array comprised of MEMS microphones and evaluate its potential for spatial audio capturing and direction-of-arrival (DOA) estimation which is an essential part of encoding the soundscape. The d…
View article: Improving Narrowband Doa Estimation Of Sound Sources Using The Complex Watson Distribution
Improving Narrowband Doa Estimation Of Sound Sources Using The Complex Watson Distribution Open
Narrowband direction-of-arrival (DOA) estimates for each time-frequency (TF) point offer a parametric spatial modeling of the acoustic environment which is very commonly used in many applications, such as source separation, dereverberation…
View article: Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids
Full-Band Quasi-Harmonic Analysis and Synthesis of Musical Instrument Sounds with Adaptive Sinusoids Open
Sinusoids are widely used to represent the oscillatory modes of musical instrument sounds in both analysis and synthesis. However, musical instrument sounds feature transients and instrumental noise that are poorly modeled with quasi-stati…