Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andreas Bulling

VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images

May 06, 2024

Anna Penzkofer, Lei Shi, Andreas Bulling

Figure 1 for VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images

Figure 2 for VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images

Figure 3 for VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images

Figure 4 for VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images

Abstract:While Vector Symbolic Architectures (VSAs) are promising for modelling spatial cognition, their application is currently limited to artificially generated images and simple spatial queries. We propose VSA4VQA - a novel 4D implementation of VSAs that implements a mental representation of natural images for the challenging task of Visual Question Answering (VQA). VSA4VQA is the first model to scale a VSA to complex spatial queries. Our method is based on the Semantic Pointer Architecture (SPA) to encode objects in a hyperdimensional vector space. To encode natural images, we extend the SPA to include dimensions for object's width and height in addition to their spatial location. To perform spatial queries we further introduce learned spatial query masks and integrate a pre-trained vision-language model for answering attribute-related questions. We evaluate our method on the GQA benchmark dataset and show that it can effectively encode natural images, achieving competitive performance to state-of-the-art deep learning methods for zero-shot VQA.

* To be published in the Proceedings of the Annual Meeting of the Cognitive Science Society (CogSci'24)

Via

Access Paper or Ask Questions

SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing

Mar 26, 2024

Florian Strohm, Mihai Bâce, Markus Kaltenecker, Andreas Bulling

Figure 1 for SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing

Figure 2 for SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing

Figure 3 for SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing

Figure 4 for SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing

Abstract:We propose Semantic Facial Feature Control (SeFFeC) - a novel method for fine-grained face shape editing. Our method enables the manipulation of human-understandable, semantic face features, such as nose length or mouth width, which are defined by different groups of facial landmarks. In contrast to existing methods, the use of facial landmarks enables precise measurement of the facial features, which then enables training SeFFeC without any manually annotated labels. SeFFeC consists of a transformer-based encoder network that takes a latent vector of a pre-trained generative model and a facial feature embedding as input, and learns to modify the latent vector to perform the desired face edit operation. To ensure that the desired feature measurement is changed towards the target value without altering uncorrelated features, we introduced a novel semantic face feature loss. Qualitative and quantitative results show that SeFFeC enables precise and fine-grained control of 23 facial features, some of which could not previously be controlled by other methods, without requiring manual annotations. Unlike existing methods, SeFFeC also provides deterministic control over the exact values of the facial features and more localised and disentangled face edits.

Via

Access Paper or Ask Questions

DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Mar 26, 2024

Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bâce, Zhiming Hu, Andreas Bulling

Figure 1 for DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Figure 2 for DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Figure 3 for DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Figure 4 for DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Abstract:We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360{\deg} images based on a conditional score-based denoising diffusion model. Generating human gaze on 360{\deg} images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360{\deg} images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360{\deg} image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences.

Via

Access Paper or Ask Questions

Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Mar 26, 2024

Florian Strohm, Mihai Bâce, Andreas Bulling

Figure 1 for Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Figure 2 for Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Figure 3 for Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Figure 4 for Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Abstract:Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.

Via

Access Paper or Ask Questions

GazeMotion: Gaze-guided Human Motion Forecasting

Mar 14, 2024

Zhiming Hu, Syn Schmitt, Daniel Haeufle, Andreas Bulling

Figure 1 for GazeMotion: Gaze-guided Human Motion Forecasting

Figure 2 for GazeMotion: Gaze-guided Human Motion Forecasting

Figure 3 for GazeMotion: Gaze-guided Human Motion Forecasting

Figure 4 for GazeMotion: Gaze-guided Human Motion Forecasting

Abstract:We present GazeMotion, a novel method for human motion forecasting that combines information on past human poses with human eye gaze. Inspired by evidence from behavioural sciences showing that human eye and body movements are closely coordinated, GazeMotion first predicts future eye gaze from past gaze, then fuses predicted future gaze and past poses into a gaze-pose graph, and finally uses a residual graph convolutional network to forecast body motion. We extensively evaluate our method on the MoGaze, ADT, and GIMO benchmark datasets and show that it outperforms state-of-the-art methods by up to 7.4% improvement in mean per joint position error. Using head direction as a proxy to gaze, our method still achieves an average improvement of 5.5%. We finally report an online user study showing that our method also outperforms prior methods in terms of perceived realism. These results show the significant information content available in eye gaze for human motion forecasting as well as the effectiveness of our method in exploiting this information.

Via

Access Paper or Ask Questions

ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos

Mar 13, 2024

Lei Shi, Paul Bürkner, Andreas Bulling

Abstract:We present ActionDiffusion -- a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noise-adding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.

* Submitted to IROS 2024

Via

Access Paper or Ask Questions

PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Feb 29, 2024

Mayar Elfares, Pascal Reisert, Zhiming Hu, Wenwu Tang, Ralf Küsters, Andreas Bulling

Figure 1 for PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Figure 2 for PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Figure 3 for PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Figure 4 for PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Abstract:Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators' updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.

Via

Access Paper or Ask Questions

Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements

Feb 21, 2024

Jhon Charaja, Isabell Wochner, Pierre Schumacher, Winfried Ilg, Martin Giese, Christophe Maufroy, Andreas Bulling, Syn Schmitt, Daniel F. B. Haeufle

Figure 1 for Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements

Figure 2 for Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements

Figure 3 for Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements

Figure 4 for Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements

Abstract:The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.

Via

Access Paper or Ask Questions

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

Feb 20, 2024

Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling

Abstract:We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.

* COLING 2024

Via

Access Paper or Ask Questions

Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model

Dec 19, 2023

Zhiming Hu, Jiahui Xu, Syn Schmitt, Andreas Bulling

Figure 1 for Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model

Figure 2 for Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model

Figure 3 for Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model

Figure 4 for Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model

Abstract:While generating realistic body movements, e.g., for avatars in virtual reality, is widely studied in computer vision and graphics, the generation of eye movements that exhibit realistic coordination with the body remains under-explored. We first report a comprehensive analysis of the coordination of human eye and full-body movements during everyday activities based on data from the MoGaze and GIMO datasets. We show that eye gaze has strong correlations with head directions and also full-body motions and there exists a noticeable time delay between body and eye movements. Inspired by the analyses, we then present Pose2Gaze -- a novel eye-body coordination model that first uses a convolutional neural network and a spatio-temporal graph convolutional neural network to extract features from head directions and full-body poses respectively and then applies a convolutional neural network to generate realistic eye movements. We compare our method with state-of-the-art methods that predict eye gaze only from head movements for three different generation tasks and demonstrate that Pose2Gaze significantly outperforms these baselines on both datasets with an average improvement of 26.4% and 21.6% in mean angular error, respectively. Our findings underline the significant potential of cross-modal human gaze behaviour analysis and modelling.

Via

Access Paper or Ask Questions