Andrew Zisserman
YOU?
Author Swipe
View article: Character-Centric Understanding of Animated Movies
Character-Centric Understanding of Animated Movies Open
View article: Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy"
Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy" Open
The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve a…
View article: Spatiotemporal Action Recognition in Videos Using ConvLSTMwith Attention: A Comparative Analysis and Implementation
Spatiotemporal Action Recognition in Videos Using ConvLSTMwith Attention: A Comparative Analysis and Implementation Open
This dissertation explores the application of Convolutional Long Short-Term Memory (ConvLSTM) networks for video action recognition. ConvLSTM integrates the spatial feature extraction capabilities of CNNs with the temporal modeling strengt…
View article: Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder
Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder Open
Camera traps are revolutionising wildlife monitoring by capturing vast amounts of visual data; however, the manual identification of individual animals remains a significant bottleneck. This study introduces a fully self-supervised approac…
View article: Detect+Track: robust and flexible software tools for improved tracking and behavioural analysis of fish
Detect+Track: robust and flexible software tools for improved tracking and behavioural analysis of fish Open
We introduce a novel video processing method called Detect+Track that combines a deep learning-based object detector with a template-based object agnostic tracker to significantly enhance the accuracy and robustness of animal tracking. App…
View article: Open-World Object Counting in Videos
Open-World Object Counting in Videos Open
We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. Th…
View article: Automated detection of spinal bone marrow oedema in axial spondyloarthritis: training and validation using two large phase 3 trial datasets
Automated detection of spinal bone marrow oedema in axial spondyloarthritis: training and validation using two large phase 3 trial datasets Open
Objective To evaluate the performance of machine learning (ML) models for the automated scoring of spinal MRI bone marrow oedema (BMO) in patients with axial spondyloarthritis (axSpA) and compare them with expert scoring. Methods ML algori…
View article: Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues Open
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, i…
View article: Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation Open
The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evid…
View article: VoiceVector: Multimodal Enrolment Vectors for Speaker Separation
VoiceVector: Multimodal Enrolment Vectors for Speaker Separation Open
We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-…
View article: Scaling 4D Representations
Scaling 4D Representations Open
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. I…
View article: New keypoint-based approach for recognising British Sign Language (BSL) from sequences
New keypoint-based approach for recognising British Sign Language (BSL) from sequences Open
In this paper, we present a novel keypoint-based classification model designed to recognise British Sign Language (BSL) words within continuous signing sequences. Our model's performance is assessed using the BOBSL dataset, revealing that …
View article: Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark Open
Following the successful 2023 edition, we organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024, with the goal of benchmarking state-of-the-art video…
View article: The Sound of Water: Inferring Physical Properties from Pouring Liquids
The Sound of Water: Inferring Physical Properties from Pouring Liquids Open
We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically…
View article: A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos
A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos Open
We discuss some consistent issues on how RepNet has been evaluated in various papers. As a way to mitigate these issues, we report RepNet performance results on different datasets, and release evaluation code and the RepNet checkpoint to o…
View article: It's Just Another Day: Unique Video Captioning by Discriminative Prompting
It's Just Another Day: Unique Video Captioning by Discriminative Prompting Open
Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the probl…
View article: Character-aware audio-visual subtitling in context
Character-aware audio-visual subtitling in context Open
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holis…
View article: 3D-Aware Instance Segmentation and Tracking in Egocentric Videos
3D-Aware Instance Segmentation and Tracking in Egocentric Videos Open
Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in f…
View article: Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names
Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names Open
Enabling engagement of manga by visually impaired individuals presents a significant challenge due to its inherently visual nature. With the goal of fostering accessibility, this paper aims to generate a dialogue transcript of a complete m…
View article: AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description Open
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text pr…
View article: TAPVid-3D: A Benchmark for Tracking Any Point in 3D
TAPVid-3D: A Benchmark for Tracking Any Point in 3D Open
We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). While point tracking in two dimensions (TAP) has many benchmarks measuring performance on real-world videos, such as TAPVid-D…
View article: CountGD: Multi-Modal Open-World Counting
CountGD: Multi-Modal Open-World Counting Open
The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and…
View article: Automated detection, labelling and radiological grading of clinical spinal MRIs
Automated detection, labelling and radiological grading of clinical spinal MRIs Open
Spinal magnetic resonance (MR) scans are a vital tool for diagnosing the cause of back pain for many diseases and conditions. However, interpreting clinically useful information from these scans can be challenging, time-consuming and hard …
View article: A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision Open
In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and…
View article: Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
Made to Order: Discovering monotonic temporal changes via self-supervised video ordering Open
Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal, since only ch…
View article: AutoAD III: The Prequel -- Back to the Pixels
AutoAD III: The Prequel -- Back to the Pixels Open
Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lac…
View article: Moving Object Segmentation: All You Need Is SAM (and Flow)
Moving Object Segmentation: All You Need Is SAM (and Flow) Open
The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful, and sometimes complex, approaches and training schemes including: self-super…
View article: FlexCap: Describe Anything in Images in Controllable Detail
FlexCap: Describe Anything in Images in Controllable Detail Open
We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descri…
View article: N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields Open
Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to …
View article: A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval
A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval Open
Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from …