Maja Pantić
YOU?
Author Swipe
View article: MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition Open
Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained setting…
View article: The Noor Project: fair transformer transfer learning for autism spectrum disorder recognition from speech
The Noor Project: fair transformer transfer learning for autism spectrum disorder recognition from speech Open
Early detection is crucial for managing incurable disorders, particularly autism spectrum disorder (ASD). Unfortunately, a considerable number of individuals with ASD receive a late diagnosis or remain undiagnosed. Speech holds a critical …
View article: Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis Open
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable wit…
View article: FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion
FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion Open
Human facial images encode a rich spectrum of information, encompassing both stable identity-related traits and mutable attributes such as pose, expression, and emotion. While recent advances in image generation have enabled high-quality i…
View article: Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction
Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction Open
In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE met…
View article: KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation Open
Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external s…
View article: Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs Open
Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend t…
View article: Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models
Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models Open
This paper investigates the under-explored area of low-rank weight training for large-scale Conformer-based speech recognition models from scratch. Our study demonstrates the viability of this training paradigm for such models, yielding se…
View article: Large Language Models are Strong Audio-Visual Speech Recognition Learners
Large Language Models are Strong Audio-Visual Speech Recognition Learners Open
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) sp…
View article: Dynamic Data Pruning for Automatic Speech Recognition
Dynamic Data Pruning for Automatic Speech Recognition Open
The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data prun…
View article: RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement Open
In this paper, we aim to generate clean speech frame by frame from a live video stream and a noisy audio stream without relying on future inputs. To this end, we propose RT-LA-VocE, which completely re-designs every component of LA-VocE, a…
View article: Dynamic Data Pruning for Automatic Speech Recognition
Dynamic Data Pruning for Automatic Speech Recognition Open
The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data prun…
View article: MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization Open
Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recog…
View article: EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars Open
Head avatars animated by visual signals have gained popularity, particularly in cross-driving synthesis where the driver differs from the animated character, a challenging but highly practical approach. The recently presented MegaPortraits…
View article: BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition Open
Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations e…
View article: Large-Scale Unsupervised Audio Pre-Training for Video-to-Speech Synthesis
Large-Scale Unsupervised Audio Pre-Training for Video-to-Speech Synthesis Open
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Previous approaches train on data from almost exclusively audio-visual datasets, i.e., every audio sample has a corresponding video…
View article: KAN-AV dataset for audio-visual face and speech analysis in the wild
KAN-AV dataset for audio-visual face and speech analysis in the wild Open
Human-computer interaction is becoming increasingly prevalent in daily life with the adoption of intelligent devices. These devices must be capable of interacting in diverse settings, such as environments with noise, music and differing il…
View article: Audio-visual video-to-speech synthesis with synthesized input audio
Audio-visual video-to-speech synthesis with synthesized input audio Open
Video-to-speech synthesis involves reconstructing the speech signal of a speaker from a silent video. The implicit assumption of this task is that the sound signal is either missing or contains a high amount of noise/corruption such that i…
View article: SparseVSR: Lightweight and Noise Robust Visual Speech Recognition
SparseVSR: Lightweight and Noise Robust Visual Speech Recognition Open
Recent advances in deep neural networks have achieved unprecedented success in visual speech recognition. However, there remains substantial disparity between current methods and their deployment in resource-constrained devices. In this wo…
View article: Large-scale unsupervised audio pre-training for video-to-speech synthesis
Large-scale unsupervised audio pre-training for video-to-speech synthesis Open
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a sp…
View article: Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models
Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models Open
Speech-driven animation has gained significant traction in recent years, with current methods achieving near-photorealistic results. However, the field remains underexplored regarding non-verbal communication despite evidence demonstrating…
View article: SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision Open
Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first…
View article: Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels Open
Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been…
View article: Learning Cross-lingual Visual Speech Representations
Learning Cross-lingual Visual Speech Representations Open
Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised v…
View article: Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation Open
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realisti…
View article: Jointly Learning Visual and Auditory Speech Representations from Raw Data
Jointly Learning Visual and Auditory Speech Representations from Raw Data Open
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowl…
View article: LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders Open
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only s…
View article: FAN-Trans: Online Knowledge Distillation for Facial Action Unit Detection
FAN-Trans: Online Knowledge Distillation for Facial Action Unit Detection Open
Due to its importance in facial behaviour analysis, facial action unit (AU) detection has attracted increasing attention from the research community. Leveraging the online knowledge distillation framework, we propose the ``FANTrans" method…
View article: Streaming Audio-Visual Speech Recognition with Alignment Regularization
Streaming Audio-Visual Speech Recognition with Alignment Regularization Open
In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer archit…
View article: Visual speech recognition for multiple languages in the wild
Visual speech recognition for multiple languages in the wild Open
Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development …