Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrew Zisserman

DeepMind

New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Dec 12, 2024

Oishi Deb, KR Prajwal, Andrew Zisserman

Figure 1 for New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Figure 2 for New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Figure 3 for New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Figure 4 for New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Abstract:In this paper, we present a novel keypoint-based classification model designed to recognise British Sign Language (BSL) words within continuous signing sequences. Our model's performance is assessed using the BOBSL dataset, revealing that the keypoint-based approach surpasses its RGB-based counterpart in computational efficiency and memory usage. Furthermore, it offers expedited training times and demands fewer computational resources. To the best of our knowledge, this is the inaugural application of a keypoint-based model for BSL word classification, rendering direct comparisons with existing works unavailable.

* International Conference on Computer Vision (ICCV) - HANDS Workshop

Via

Access Paper or Ask Questions

3D Spine Shape Estimation from Single 2D DXA

Dec 02, 2024

Emmanuelle Bourigault, Amir Jamaludin, Andrew Zisserman

Abstract:Scoliosis is traditionally assessed based solely on 2D lateral deviations, but recent studies have also revealed the importance of other imaging planes in understanding the deformation of the spine. Consequently, extracting the spinal geometry in 3D would help quantify these spinal deformations and aid diagnosis. In this study, we propose an automated general framework to estimate the 3D spine shape from 2D DXA scans. We achieve this by explicitly predicting the sagittal view of the spine from the DXA scan. Using these two orthogonal projections of the spine (coronal in DXA, and sagittal from the prediction), we are able to describe the 3D shape of the spine. The prediction is learnt from over 30k paired images of DXA and MRI scans. We assess the performance of the method on a held out test set, and achieve high accuracy.

* 13 pages

Via

Access Paper or Ask Questions

Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark

Nov 29, 2024

Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean

Figure 1 for Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark

Figure 2 for Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark

Figure 3 for Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark

Figure 4 for Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark

Abstract:Following the successful 2023 edition, we organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024, with the goal of benchmarking state-of-the-art video models and measuring the progress since last year using the Perception Test benchmark. This year, the challenge had seven tracks (up from six last year) and covered low-level and high-level tasks, with language and non-language interfaces, across video, audio, and text modalities; the additional track covered hour-long video understanding and introduced a novel video QA benchmark 1h-walk VQA. Overall, the tasks in the different tracks were: object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, grounded video question-answering, and hour-long video question-answering. We summarise in this report the challenge tasks and results, and introduce in detail the novel hour-long video QA benchmark 1h-walk VQA.

* arXiv admin note: substantial text overlap with arXiv:2312.13090

Via

Access Paper or Ask Questions

The Sound of Water: Inferring Physical Properties from Pouring Liquids

Nov 18, 2024

Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman

Figure 1 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 2 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 3 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 4 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Abstract:We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.

* 25 pages, 17 figures. Project page at https://bpiyush.github.io/pouring-water-website

Via

Access Paper or Ask Questions

A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos

Nov 13, 2024

Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

Figure 1 for A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos

Figure 2 for A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos

Figure 3 for A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos

Abstract:We discuss some consistent issues on how RepNet has been evaluated in various papers. As a way to mitigate these issues, we report RepNet performance results on different datasets, and release evaluation code and the RepNet checkpoint to obtain these results. Code URL: https://github.com/google-research/google-research/blob/master/repnet/

Via

Access Paper or Ask Questions

Automated Spinal MRI Labelling from Reports Using a Large Language Model

Oct 22, 2024

Robin Y. Park, Rhydian Windsor, Amir Jamaludin, Andrew Zisserman

Abstract:We propose a general pipeline to automate the extraction of labels from radiology reports using large language models, which we validate on spinal MRI reports. The efficacy of our labelling method is measured on five distinct conditions: spinal cancer, stenosis, spondylolisthesis, cauda equina compression and herniation. Using open-source models, our method equals or surpasses GPT-4 on a held-out set of reports. Furthermore, we show that the extracted labels can be used to train imaging models to classify the identified conditions in the accompanying MR scans. All classifiers trained using automated labels achieve comparable performance to models trained using scans manually annotated by clinicians. Code can be found at https://github.com/robinyjpark/AutoLabelClassifier.

* Accepted to Medical Image Computing and Computer Assisted Intervention (MICCAI 2024, Spotlight). 11 pages plus appendix

Via

Access Paper or Ask Questions

It's Just Another Day: Unique Video Captioning by Discriminative Prompting

Oct 15, 2024

Toby Perrett, Tengda Han, Dima Damen, Andrew Zisserman

Figure 1 for It's Just Another Day: Unique Video Captioning by Discriminative Prompting

Figure 2 for It's Just Another Day: Unique Video Captioning by Discriminative Prompting

Figure 3 for It's Just Another Day: Unique Video Captioning by Discriminative Prompting

Figure 4 for It's Just Another Day: Unique Video Captioning by Discriminative Prompting

Abstract:Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the problem of unique captioning: Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it. We propose Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and use it to generate unique captions. We introduce two benchmarks for unique captioning, based on egocentric footage and timeloop movies - where repeating actions are common. We demonstrate that captions generated by CDP improve text-to-video R@1 by 15% for egocentric videos and 10% in timeloop movies.

* ACCV 2024 Oral. Project page: https://tobyperrett.github.io/its-just-another-day/

Via

Access Paper or Ask Questions

Character-aware audio-visual subtitling in context

Oct 14, 2024

Jaesung Huh, Andrew Zisserman

Figure 1 for Character-aware audio-visual subtitling in context

Figure 2 for Character-aware audio-visual subtitling in context

Figure 3 for Character-aware audio-visual subtitling in context

Figure 4 for Character-aware audio-visual subtitling in context

Abstract:This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it's said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Our approach brings improvements on two fronts: first, we show that audio-visual synchronisation can be used to pick out the talking face amongst others present in a video clip, and assign an identity to the corresponding speech segment. This audio-visual approach improves recognition accuracy and yield over current methods. Second, we show that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene. We propose an approach using local voice embeddings of the audio, and large language model reasoning on the text transcription. This overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches. Project page : https://www.robots.ox.ac.uk/~vgg/research/llr-context/

* ACCV 2024

Via

Access Paper or Ask Questions

The VoxCeleb Speaker Recognition Challenge: A Retrospective

Aug 27, 2024

Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman

Figure 1 for The VoxCeleb Speaker Recognition Challenge: A Retrospective

Figure 2 for The VoxCeleb Speaker Recognition Challenge: A Retrospective

Figure 3 for The VoxCeleb Speaker Recognition Challenge: A Retrospective

Figure 4 for The VoxCeleb Speaker Recognition Challenge: A Retrospective

Abstract:The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provided publicly available training and evaluation datasets for each task and setting, with new test sets released each year. In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation. We chart the progress in performance over the five installments of the challenge on a common evaluation dataset and provide a detailed analysis of how each year's special focus affected participants' performance. This paper is aimed both at researchers who want an overview of the speaker recognition and diarisation field, and also at challenge organisers who want to benefit from the successes and avoid the mistakes of the VoxSRC challenges. We end with a discussion of the current strengths of the field and open challenges. Project page : https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/workshop.html

* TASLP 2024

Via

Access Paper or Ask Questions

3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Aug 19, 2024

Yash Bhalgat, Vadim Tschernezki, Iro Laina, João F. Henriques, Andrea Vedaldi, Andrew Zisserman

Figure 1 for 3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Figure 2 for 3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Figure 3 for 3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Figure 4 for 3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Abstract:Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by $7$ points in Association Accuracy (AssA) and $4.5$ points in IDF1 score, while reducing the number of ID switches by $73\%$ to $80\%$ across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.

Via

Access Paper or Ask Questions