Devang Naik
YOU?
Author Swipe
View article: Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential
Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential Open
Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and…
View article: From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs Open
Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardle…
View article: M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference Open
Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in auto-regressive generation leads to a suboptimal trade…
View article: Knowledge Transfer For Efficient On-Device False Trigger Mitigation
Knowledge Transfer For Efficient On-Device False Trigger Mitigation Open
In this paper, we address the task of determining whether a given utterance is directed towards a voice-enabled smart-assistant device or not. An undirected utterance is termed as a "false trigger" and false trigger mitigation (FTM) is ess…
View article: An Efficient and Streaming Audio Visual Active Speaker Detection System
An Efficient and Streaming Audio Visual Active Speaker Detection System Open
This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant str…
View article: KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Open
Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output the first token and the extension (or decoding) phase to the generate subsequent tokens. In this work, we propose an efficient parallelization sc…
View article: Weight subcloning: direct initialization of transformers using larger pretrained ones
Weight subcloning: direct initialization of transformers using larger pretrained ones Open
Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretraine…
View article: eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models Open
Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, …
View article: Optimize What Matters: Training DNN-Hmm Keyword Spotting Model Using End Metric
Optimize What Matters: Training DNN-Hmm Keyword Spotting Model Using End Metric Open
Deep Neural Network--Hidden Markov Model (DNN-HMM) based methods have been successfully used for many always-on keyword spotting algorithms that detect a wake word to trigger a device. The DNN predicts the state probabilities of a given sp…
View article: On The Role of Visual Cues in Audiovisual Speech Enhancement
On The Role of Visual Cues in Audiovisual Speech Enhancement Open
We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show …
View article: Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation
Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation Open
False triggers in voice assistants are unintended invocations of the assistant, which not only degrade the user experience but may also compromise privacy. False trigger mitigation (FTM) is a process to detect the false trigger events and …
View article: Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement.
Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement. Open
We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show …
View article: Multi-Task Learning for Speaker Verification and Voice Trigger Detection
Multi-Task Learning for Speaker Verification and Voice Trigger Detection Open
Automatic speech transcription and speaker recognition are usually treated as\nseparate tasks even though they are interdependent. In this study, we\ninvestigate training a single network to perform both tasks jointly. We train\nthe networ…
View article: Lattice-Based Improvements for Voice Triggering Using Graph Neural Networks
Lattice-Based Improvements for Voice Triggering Using Graph Neural Networks Open
Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant.…
View article: Detecting Emotion Primitives from Speech and their use in discerning Categorical Emotions
Detecting Emotion Primitives from Speech and their use in discerning Categorical Emotions Open
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity. While modern speech technologies rely heavily on speech recognition and natural language underst…
View article: Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice
Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice Open
Millions of people reach out to digital assistants such as Siri every day, asking for information, making phone calls, seeking assistance, and much more. The expectation is that such assistants should understand the intent of the users que…