Explanipedia

Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT Open

Zeeshan Ahmed, Frank Seide, Niko Moritz, Ju Lin, Ruiming Xie , et al. · 2025

This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on Recurre…

Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation Open

Zeeshan Ahmed, Frank Seide, Zhe Liu, Rastislav Rabatin, Jachym Kolar , et al. · 2025

Simultaneous or streaming machine translation generates translation while reading the input stream. These systems face a quality/latency trade-off, aiming to achieve high translation quality similar to non-streaming models with minimal lat…

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition Open

Niko Moritz, Ruiming Xie, Yashesh Gaur, Ke Li, Simone Merello , et al. · 2024

We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is trans…

Efficient Streaming LLM for Speech Recognition Open

Junteng Jia, Gil Keren, Wei Zhou, Egor Lakomkin, Xiaohui Zhang , et al. · 2024

Computer science

Recent works have shown that prompting large language models with audio encodings can unlock speech recognition capabilities. However, existing techniques do not scale efficiently, especially while handling long form streaming audio inputs…

Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation Open

Rastislav Rabatin, Frank Seide, Ernie Chang · 2024

Computer science Physics Biology

We adapt the well-known beam-search algorithm for machine translation to operate in a cascaded real-time speech translation system. This proved to be more complex than initially anticipated, due to four key challenges: (1) real-time proces…

Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time Open

Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia , et al. · 2024

Computer science Mathematics

We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle con…

Effective internal language model training and fusion for factorized transducer model Open

Jinxi Guo, Niko Moritz, Yingyi Ma, Frank Seide, Chunyang Wu , et al. · 2024

Computer science Physics Philosophy

The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with …

AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition Open

Ju Lin, Niko Moritz, Yiteng Huang, Ruiming Xie, Ming Sun , et al. · 2024

Computer science Physics Philosophy

Wearable devices like smart glasses are approaching the compute capability to seamlessly generate real-time closed captions for live conversations. We build on our recently introduced directional Automatic Speech Recognition (ASR) for smar…

Directional Source Separation for Robust Speech Recognition on Smart Glasses Open

Tiantian Feng, Ju Lin, Yiteng Huang, Weipeng He, Kaustubh Kalgaonkar , et al. · 2023

Computer science Philosophy

Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequen…

DISGO: Automatic End-to-End Evaluation for Scene Text OCR Open

Mei-Yuh Hwang, Yangyang Shi, Ankit Ramchandani, Guan Pang, Praveen Krishnan , et al. · 2023

Computer science Engineering Mathematics

This paper discusses the challenges of optical character recognition (OCR) on natural scenes, which is harder than OCR on documents due to the wild content and various image backgrounds. We propose to uniformly use word error rates (WER) a…

Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers Open

Manh Duc Le, Frank Seide, Yuhao Wang, Li Yang, Kjell Schubert , et al. · 2022

Computer science Engineering

We show how factoring the RNN-T's output distribution can significantly reduce the computation cost and power consumption for on-device ASR inference with no loss in accuracy. With the rise in popularity of neural-transducer type models li…

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition Open

Niko Moritz, Frank Seide, Manh Duc Le, Jay Mahadeokar, Christian Fuegen · 2022

Computer science Mathematics Physics

The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (M…

Federated Domain Adaptation for ASR with Full Self-Supervision Open

Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli , et al. · 2022

Computer science Mathematics Physics

Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. While important topics such as the FL …

Achieving Human Parity on Automatic Chinese to English News Translation Open

Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan H. Clark , et al. · 2018

Computer science Physics Philosophy

Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whethe…

Marian: Fast Neural Machine Translation in C++ Open

Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield , et al. · 2018

Computer science Chemistry

We present Marian, an efficient and selfcontained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of th…

The Microsoft 2017 Conversational Speech Recognition System Open

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Michael L. Seltzer , et al. · 2017

Computer science Geography Economics

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the ar…

The microsoft 2016 conversational speech recognition system Open

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Michael L. Seltzer , et al. · 2017

Computer science Geography Philosophy

We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired …

Achieving Human Parity in Conversational Speech Recognition Open

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer , et al. · 2016

Computer science Geography Economics

Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find tha…

Frank Seide YOU? Author Swipe