Accomplishing household tasks such as 'bringing a cup of water' requires planning step-by-step actions by maintaining knowledge about the spatial arrangement of objects and the consequences of previous actions. Perception models of the current embodied AI agents, however, often make mistakes due to a lack of such knowledge but rely on imperfect learning of imitating agents or an algorithmic planner without knowledge about the changed environment by the previous actions. To address the issue, we propose CPEM (Context-aware Planner and Environment-aware Memory) to incorporate the contextual information of previous actions for planning and maintaining spatial arrangement of objects with their states (e.g., if an object has been moved or not) in an environment to the perception model for improving both visual navigation and object interaction. We observe that CPEM achieves state-of-the-art task success performance in various metrics using a challenging interactive instruction following benchmark both in seen and unseen environments by large margins (up to +10.70% in unseen env.). CPEM with the templated actions, named ECLAIR, also won the 1st generalist language grounding agents challenge at Embodied AI Workshop in CVPR'23.
Eye-tracking plays a crucial role in the development of virtual reality devices, neuroscience research, and psychology. Despite its significance in numerous applications, achieving an accurate, robust, and fast eye-tracking solution remains a considerable challenge for current state-of-the-art methods. While existing reflection-based techniques (e.g., "glint tracking") are considered the most accurate, their performance is limited by their reliance on sparse 3D surface data acquired solely from the cornea surface. In this paper, we rethink the way how specular reflections can be used for eye tracking: We propose a novel method for accurate and fast evaluation of the gaze direction that exploits teachings from single-shot phase-measuring-deflectometry (PMD). In contrast to state-of-the-art reflection-based methods, our method acquires dense 3D surface information of both cornea and sclera within only one single camera frame (single-shot). Improvements in acquired reflection surface points("glints") of factors $>3300 \times$ are easily achievable. We show the feasibility of our approach with experimentally evaluated gaze errors of only $\leq 0.25^\circ$ demonstrating a significant improvement over the current state-of-the-art.
The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://sites.google.com/view/estevevallsmascaro/publications/unimask-m.
In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. In this work, we show that instead of following the prevalent pretext task to perform masked self-component reconstruction in human joints, explicit contextual motion modeling is key to the success of learning effective feature representation for 3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP) framework. To be specific, the proposed MAMP takes as input the masked spatio-temporal skeleton sequence and predicts the corresponding temporal motion of the masked human joints. Considering the high temporal redundancy of the skeleton sequence, in our MAMP, the motion information also acts as an empirical semantic richness prior that guide the masking process, promoting better attention to semantically rich temporal regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed MAMP pre-training substantially improves the performance of the adopted vanilla transformer, achieving state-of-the-art results without bells and whistles. The source code of our MAMP is available at https://github.com/maoyunyao/MAMP.
Multi-platform radar networks (MPRNs) are an emerging sensing technology due to their ability to provide improved surveillance capabilities over plain monostatic and bistatic systems. The design of advanced detection, localization, and tracking algorithms for efficient fusion of information obtained through multiple receivers has attracted much attention. However, considerable challenges remain. This article provides an overview on recent unconstrained and constrained localization techniques as well as multitarget tracking (MTT) algorithms tailored to MPRNs. In particular, two data-processing methods are illustrated and explored in detail, one aimed at accomplishing localization tasks the other tracking functions. As to the former, assuming a MPRN with one transmitter and multiple receivers, the angular and range constrained estimator (ARCE) algorithm capitalizes on the knowledge of the transmitter antenna beamwidth. As to the latter, the scalable sum-product algorithm (SPA) based MTT technique is presented. Additionally, a solution to combine ARCE and SPA-based MTT is investigated in order to boost the accuracy of the overall surveillance system. Simulated experiments show the benefit of the combined algorithm in comparison with the conventional baseline SPA-based MTT and the stand-alone ARCE localization, in a 3D sensing scenario.
This paper studies task-oriented, otherwise known as goal-oriented, communications, in a setting where a transmitter communicates with multiple receivers, each with its own task to complete on a dataset, e.g., images, available at the transmitter. A multi-task deep learning approach that involves training a common encoder at the transmitter and individual decoders at the receivers is presented for joint optimization of completing multiple tasks and communicating with multiple receivers. By providing efficient resource allocation at the edge of 6G networks, the proposed approach allows the communications system to adapt to varying channel conditions and achieves task-specific objectives while minimizing transmission overhead. Joint training of the encoder and decoders using multi-task learning captures shared information across tasks and optimizes the communication process accordingly. By leveraging the broadcast nature of wireless communications, multi-receiver task-oriented communications (MTOC) reduces the number of transmissions required to complete tasks at different receivers. Performance evaluation conducted on the MNIST, Fashion MNIST, and CIFAR-10 datasets (with image classification considered for different tasks) demonstrates the effectiveness of MTOC in terms of classification accuracy and resource utilization compared to single-task-oriented communication systems.
Sequential recommendation demonstrates the capability to recommend items by modeling the sequential behavior of users. Traditional methods typically treat users as sequences of items, overlooking the collaborative relationships among them. Graph-based methods incorporate collaborative information by utilizing the user-item interaction graph. However, these methods sometimes face challenges in terms of time complexity and computational efficiency. To address these limitations, this paper presents AutoSeqRec, an incremental recommendation model specifically designed for sequential recommendation tasks. AutoSeqRec is based on autoencoders and consists of an encoder and three decoders within the autoencoder architecture. These components consider both the user-item interaction matrix and the rows and columns of the item transition matrix. The reconstruction of the user-item interaction matrix captures user long-term preferences through collaborative filtering. In addition, the rows and columns of the item transition matrix represent the item out-degree and in-degree hopping behavior, which allows for modeling the user's short-term interests. When making incremental recommendations, only the input matrices need to be updated, without the need to update parameters, which makes AutoSeqRec very efficient. Comprehensive evaluations demonstrate that AutoSeqRec outperforms existing methods in terms of accuracy, while showcasing its robustness and efficiency.
Image reconstruction and captioning from brain activity evoked by visual stimuli allow researchers to further understand the connection between the human brain and the visual perception system. While deep generative models have recently been employed in this field, reconstructing realistic captions and images with both low-level details and high semantic fidelity is still a challenging problem. In this work, we propose UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity. For the first time, we unify image reconstruction and captioning from visual-evoked functional magnetic resonance imaging (fMRI) through a latent diffusion model termed Versatile Diffusion. Specifically, we transform fMRI voxels into text and image latent for low-level information and guide the backward diffusion process through fMRI-based image and text conditions derived from CLIP to generate realistic captions and images. UniBrain outperforms current methods both qualitatively and quantitatively in terms of image reconstruction and reports image captioning results for the first time on the Natural Scenes Dataset (NSD) dataset. Moreover, the ablation experiments and functional region-of-interest (ROI) analysis further exhibit the superiority of UniBrain and provide comprehensive insight for visual-evoked brain decoding.
One major challenge in reinforcement learning (RL) is the large amount of steps for the RL agent needs to converge in the training process and learn the optimal policy, especially in text-based game environments where the action space is extensive. However, non-player characters (NPCs) sometimes hold some key information about the game, which can potentially help to train RL agents faster. Thus, this paper explores how to interact and converse with NPC agents to get the key information using large language models (LLMs), as well as incorporate this information to speed up RL agent's training using knowledge graphs (KGs) and Story Shaping.
Multi-node communication, which refers to the interaction among multiple devices, has attracted lots of attention in many Internet-of-Things (IoT) scenarios. However, its huge amounts of data flows and inflexibility for task extension have triggered the urgent requirement of communication-efficient distributed data transmission frameworks. In this paper, inspired by the great superiorities on bandwidth reduction and task adaptation of semantic communications, we propose a federated learning-based semantic communication (FLSC) framework for multi-task distributed image transmission with IoT devices. Federated learning enables the design of independent semantic communication link of each user while further improves the semantic extraction and task performance through global aggregation. Each link in FLSC is composed of a hierarchical vision transformer (HVT)-based extractor and a task-adaptive translator for coarse-to-fine semantic extraction and meaning translation according to specific tasks. In order to extend the FLSC into more realistic conditions, we design a channel state information-based multiple-input multiple-output transmission module to combat channel fading and noise. Simulation results show that the coarse semantic information can deal with a range of image-level tasks. Moreover, especially in low signal-to-noise ratio and channel bandwidth ratio regimes, FLSC evidently outperforms the traditional scheme, e.g. about 10 peak signal-to-noise ratio gain in the 3 dB channel condition.