Frank Seide
YOU?
Author Swipe
View article: Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT
Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT Open
This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on Recurre…
View article: Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation
Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation Open
Simultaneous or streaming machine translation generates translation while reading the input stream. These systems face a quality/latency trade-off, aiming to achieve high translation quality similar to non-streaming models with minimal lat…
View article: Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition Open
We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is trans…
View article: Efficient Streaming LLM for Speech Recognition
Efficient Streaming LLM for Speech Recognition Open
Recent works have shown that prompting large language models with audio encodings can unlock speech recognition capabilities. However, existing techniques do not scale efficiently, especially while handling long form streaming audio inputs…
View article: Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation
Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation Open
We adapt the well-known beam-search algorithm for machine translation to operate in a cascaded real-time speech translation system. This proved to be more complex than initially anticipated, due to four key challenges: (1) real-time proces…
View article: Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time Open
We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle con…
View article: Effective internal language model training and fusion for factorized transducer model
Effective internal language model training and fusion for factorized transducer model Open
The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with …
View article: AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition
AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition Open
Wearable devices like smart glasses are approaching the compute capability to seamlessly generate real-time closed captions for live conversations. We build on our recently introduced directional Automatic Speech Recognition (ASR) for smar…
View article: Directional Source Separation for Robust Speech Recognition on Smart Glasses
Directional Source Separation for Robust Speech Recognition on Smart Glasses Open
Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequen…
View article: DISGO: Automatic End-to-End Evaluation for Scene Text OCR
DISGO: Automatic End-to-End Evaluation for Scene Text OCR Open
This paper discusses the challenges of optical character recognition (OCR) on natural scenes, which is harder than OCR on documents due to the wild content and various image backgrounds. We propose to uniformly use word error rates (WER) a…
View article: Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers
Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers Open
We show how factoring the RNN-T's output distribution can significantly reduce the computation cost and power consumption for on-device ASR inference with no loss in accuracy. With the rise in popularity of neural-transducer type models li…
View article: An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition
An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition Open
The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (M…
View article: Federated Domain Adaptation for ASR with Full Self-Supervision
Federated Domain Adaptation for ASR with Full Self-Supervision Open
Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. While important topics such as the FL …
View article: Achieving Human Parity on Automatic Chinese to English News Translation
Achieving Human Parity on Automatic Chinese to English News Translation Open
Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whethe…
View article: Marian: Fast Neural Machine Translation in C++
Marian: Fast Neural Machine Translation in C++ Open
We present Marian, an efficient and selfcontained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of th…
View article: The Microsoft 2017 Conversational Speech Recognition System
The Microsoft 2017 Conversational Speech Recognition System Open
We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the ar…
View article: The microsoft 2016 conversational speech recognition system
The microsoft 2016 conversational speech recognition system Open
We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired …
View article: Achieving Human Parity in Conversational Speech Recognition
Achieving Human Parity in Conversational Speech Recognition Open
Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find tha…