Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Samuel Albanie

Michael Pokorny

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Feb 29, 2024

Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke

Abstract:Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using Large Language Models (LLMs). In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound.

* 9 pages, 2 figures, 9 tables, Accepted at ICASSP 2024

Via

Access Paper or Ask Questions

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Feb 29, 2024

Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie

Figure 1 for Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Figure 2 for Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Figure 3 for Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Figure 4 for Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Abstract:Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing (for now) 1.69M and 1.98M test samples, respectively. While reducing overfitting, lifelong benchmarks introduce a key challenge: the high cost of evaluating a growing number of models across an ever-expanding sample set. To address this challenge, we also introduce an efficient evaluation framework: Sort \& Search (S&S), which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples, enabling cost-effective lifelong benchmarking. Extensive empirical evaluations across 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (1000x reduction) on a single A100 GPU, with low approximation error. As such, lifelong benchmarks offer a robust, practical solution to the "benchmark exhaustion" problem.

Via

Access Paper or Ask Questions

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Dec 19, 2023

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni

Figure 1 for InstructVideo: Instructing Video Diffusion Models with Human Feedback

Figure 2 for InstructVideo: Instructing Video Diffusion Models with Human Feedback

Figure 3 for InstructVideo: Instructing Video Diffusion Models with Human Feedback

Figure 4 for InstructVideo: Instructing Video Diffusion Models with Human Feedback

Abstract:Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.

* Project page: https://instructvideo.github.io/

Via

Access Paper or Ask Questions

Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Nov 30, 2023

Jonathan Roberts, Timo Lüddecke, Rehan Sheikh, Kai Han, Samuel Albanie

Figure 1 for Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Figure 2 for Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Figure 3 for Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Figure 4 for Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Abstract:Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains, particularly focusing on the frontier model GPT-4V, and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks, testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel, including instances where they outperform humans, but also where they falter, providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models, our benchmark will be publicly released.

* V2: Minor formatting changes and added missing subfigure captions

Via

Access Paper or Ask Questions

Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

Oct 16, 2023

Vishaal Udandarao, Max F. Burg, Samuel Albanie, Matthias Bethge

Figure 1 for Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

Figure 2 for Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

Figure 3 for Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

Figure 4 for Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

Abstract:Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.

Via

Access Paper or Ask Questions

Simple Baselines for Interactive Video Retrieval with Questions and Answers

Aug 21, 2023

Kaiqu Liang, Samuel Albanie

Figure 1 for Simple Baselines for Interactive Video Retrieval with Questions and Answers

Figure 2 for Simple Baselines for Interactive Video Retrieval with Questions and Answers

Figure 3 for Simple Baselines for Interactive Video Retrieval with Questions and Answers

Figure 4 for Simple Baselines for Interactive Video Retrieval with Questions and Answers

Abstract:To date, the majority of video retrieval systems have been optimized for a "single-shot" scenario in which the user submits a query in isolation, ignoring previous interactions with the system. Recently, there has been renewed interest in interactive systems to enhance retrieval, but existing approaches are complex and deliver limited gains in performance. In this work, we revisit this topic and propose several simple yet effective baselines for interactive video retrieval via question-answering. We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task without access to ground truth dialogue data. Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.

* ICCV 2023, project page: https://github.com/kevinliang888/IVR-QA-baselines

Via

Access Paper or Ask Questions

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Aug 18, 2023

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao

Figure 1 for RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Figure 2 for RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Figure 3 for RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Figure 4 for RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Abstract:Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

* Accepted to ICCV 2023. Code and models: https://github.com/JacobYuan7/RLIPv2

Via

Access Paper or Ask Questions

arXiVeri: Automatic table verification with GPT

Jun 13, 2023

Gyungin Shin, Weidi Xie, Samuel Albanie

Abstract:Without accurate transcription of numerical data in scientific documents, a scientist cannot draw accurate conclusions. Unfortunately, the process of copying numerical data from one paper to another is prone to human error. In this paper, we propose to meet this challenge through the novel task of automatic table verification (AutoTV), in which the objective is to verify the accuracy of numerical data in tables by cross-referencing cited sources. To support this task, we propose a new benchmark, arXiVeri, which comprises tabular data drawn from open-access academic papers on arXiv. We introduce metrics to evaluate the performance of a table verifier in two key areas: (i) table matching, which aims to identify the source table in a cited document that corresponds to a target table, and (ii) cell matching, which aims to locate shared cells between a target and source table and identify their row and column indices accurately. By leveraging the flexible capabilities of modern large language models (LLMs), we propose simple baselines for table verification. Our findings highlight the complexity of this task, even for state-of-the-art LLMs like OpenAI's GPT-4. The code and benchmark will be made publicly available.

* Tech report

Via

Access Paper or Ask Questions

GPT4GEO: How a Language Model Sees the World's Geography

May 30, 2023

Jonathan Roberts, Timo Lüddecke, Sowmen Das, Kai Han, Samuel Albanie

Figure 1 for GPT4GEO: How a Language Model Sees the World's Geography

Figure 2 for GPT4GEO: How a Language Model Sees the World's Geography

Figure 3 for GPT4GEO: How a Language Model Sees the World's Geography

Figure 4 for GPT4GEO: How a Language Model Sees the World's Geography

Abstract:Large language models (LLMs) have shown remarkable capabilities across a broad range of tasks involving question answering and the generation of coherent text and code. Comprehensively understanding the strengths and weaknesses of LLMs is beneficial for safety, downstream applications and improving performance. In this work, we investigate the degree to which GPT-4 has acquired factual geographic knowledge and is capable of using this knowledge for interpretative reasoning, which is especially important for applications that involve geographic data, such as geospatial analysis, supply chain management, and disaster response. To this end, we design and conduct a series of diverse experiments, starting from factual tasks such as location, distance and elevation estimation to more complex questions such as generating country outlines and travel networks, route finding under constraints and supply chain analysis. We provide a broad characterisation of what GPT-4 (without plugins or Internet access) knows about the world, highlighting both potentially surprising capabilities but also limitations.

Via

Access Paper or Ask Questions

Zero-shot Unsupervised Transfer Instance Segmentation

Apr 27, 2023

Gyungin Shin, Samuel Albanie, Weidi Xie

Figure 1 for Zero-shot Unsupervised Transfer Instance Segmentation

Figure 2 for Zero-shot Unsupervised Transfer Instance Segmentation

Figure 3 for Zero-shot Unsupervised Transfer Instance Segmentation

Figure 4 for Zero-shot Unsupervised Transfer Instance Segmentation

Abstract:Segmentation is a core computer vision competency, with applications spanning a broad range of scientifically and economically valuable domains. To date, however, the prohibitive cost of annotation has limited the deployment of flexible segmentation models. In this work, we propose Zero-shot Unsupervised Transfer Instance Segmentation (ZUTIS), a framework that aims to meet this challenge. The key strengths of ZUTIS are: (i) no requirement for instance-level or pixel-level annotations; (ii) an ability of zero-shot transfer, i.e., no assumption on access to a target data distribution; (iii) a unified framework for semantic and instance segmentations with solid performance on both tasks compared to state-of-the-art unsupervised methods. While comparing to previous work, we show ZUTIS achieves a gain of 2.2 mask AP on COCO-20K and 14.5 mIoU on ImageNet-S with 919 categories for instance and semantic segmentations, respectively. The code is made publicly available.

* Accepted to CVPRW 2023. Code: https://github.com/NoelShin/zutis

Via

Access Paper or Ask Questions