Abstract:Video action detection (spatio-temporal action localization) is usually the starting point for human-centric intelligent analysis of videos nowadays. It has high practical impacts for many applications across robotics, security, healthcare, etc. The two-stage paradigm of Faster R-CNN inspires a standard paradigm of video action detection in object detection, i.e., firstly generating person proposals and then classifying their actions. However, none of the existing solutions could provide fine-grained action detection to the "who-when-where-what" level. This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions spatially (by predicting the associated target IDs and locations) and temporally (by predicting the time in exact frame indices). This solution won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC).
Abstract:Relation reasoning in knowledge graphs (KGs) aims at predicting missing relations in incomplete triples, whereas the dominant paradigm is learning the embeddings of relations and entities, which is limited to a transductive setting and has restriction on processing unseen entities in an inductive situation. Previous inductive methods are scalable and consume less resource. They utilize the structure of entities and triples in subgraphs to own inductive ability. However, in order to obtain better reasoning results, the model should acquire entity-independent relational semantics in latent rules and solve the deficient supervision caused by scarcity of rules in subgraphs. To address these issues, we propose a novel graph convolutional network (GCN)-based approach for interpretable inductive reasoning with relational path contrast, named RPC-IR. RPC-IR firstly extracts relational paths between two entities and learns representations of them, and then innovatively introduces a contrastive strategy by constructing positive and negative relational paths. A joint training strategy considering both supervised and contrastive information is also proposed. Comprehensive experiments on three inductive datasets show that RPC-IR achieves outstanding performance comparing with the latest inductive reasoning methods and could explicitly represent logical rules for interpretability.
Abstract:Emotion recognition plays a vital role in human-machine interactions and daily healthcare. EEG signals have been reported to be informative and reliable for emotion recognition in recent years. However, the inter-subject variability of emotion-related EEG signals poses a great challenge for the practical use of EEG-based emotion recognition. Inspired by the recent neuroscience studies on inter-subject correlation, we proposed a Contrastive Learning method for Inter-Subject Alignment (CLISA) for reliable cross-subject emotion recognition. Contrastive learning was employed to minimize the inter-subject differences by maximizing the similarity in EEG signals across subjects when they received the same stimuli in contrast to different ones. Specifically, a convolutional neural network with depthwise spatial convolution and temporal convolution layers was applied to learn inter-subject aligned spatiotemporal representations from raw EEG signals. Then the aligned representations were used to extract differential entropy features for emotion classification. The performance of the proposed method was evaluated on our THU-EP dataset with 80 subjects and the publicly available SEED dataset with 15 subjects. Comparable or better cross-subject emotion recognition accuracy (i.e., 72.1% and 47.0% for binary and nine-class classification, respectively, on the THU-EP dataset and 86.3% on the SEED dataset for three-class classification) was achieved as compared to the state-of-the-art methods. The proposed method could be generalized well to unseen emotional stimuli as well. The CLISA method is therefore expected to considerably increase the practicality of EEG-based emotion recognition by operating in a "plug-and-play" manner. Furthermore, the learned spatiotemporal representations by CLISA could provide insights into the neural mechanisms of human emotion processing.
Abstract:Convolutional neural networks (CNNs) have been applied to learn spatial features for high-resolution (HR) synthetic aperture radar (SAR) image classification. However, there has been little work on integrating the unique statistical distributions of SAR images which can reveal physical properties of terrain objects, into CNNs in a supervised feature learning framework. To address this problem, a novel end-to-end supervised classification method is proposed for HR SAR images by considering both spatial context and statistical features. First, to extract more effective spatial features from SAR images, a new deep spatial context encoder network (DSCEN) is proposed, which is a lightweight structure and can be effectively trained with a small number of samples. Meanwhile, to enhance the diversity of statistics, the nonstationary joint statistical model (NS-JSM) is adopted to form the global statistical features. Specifically, SAR images are transformed into the Gabor wavelet domain and the produced multi-subbands magnitudes and phases are modeled by the log-normal and uniform distribution. The covariance matrix is further utilized to capture the inter-scale and intra-scale nonstationary correlation between the statistical subbands and make the joint statistical features more compact and distinguishable. Considering complementary advantages, a feature fusion network (Fusion-Net) base on group compression and smooth normalization is constructed to embed the statistical features into the spatial features and optimize the fusion feature representation. As a result, our model can learn the discriminative features and improve the final classification performance. Experiments on four HR SAR images validate the superiority of the proposed method over other related algorithms.
Abstract:Recent studies on computer vision mainly focus on natural images that express real-world scenes. They achieve outstanding performance on diverse tasks such as visual question answering. Diagram is a special form of visual expression that frequently appears in the education field and is of great significance for learners to understand multimodal knowledge. Current research on diagrams preliminarily focuses on natural disciplines such as Biology and Geography, whose expressions are still similar to natural images. Another type of diagrams such as from Computer Science is composed of graphics containing complex topologies and relations, and research on this type of diagrams is still blank. The main challenges of graphic diagrams understanding are the rarity of data and the confusion of semantics, which are mainly reflected in the diversity of expressions. In this paper, we construct a novel dataset of graphic diagrams named Computer Science Diagrams (CSDia). It contains more than 1,200 diagrams and exhaustive annotations of objects and relations. Considering the visual noises caused by the various expressions in diagrams, we introduce the topology of diagrams to parse topological structure. After that, we propose Diagram Parsing Net (DPN) to represent the diagram from three branches: topology, visual feature, and text, and apply the model to the diagram classification task to evaluate the ability of diagrams understanding. The results show the effectiveness of the proposed DPN on diagrams understanding.
Abstract:Classification is an important aspect of hyperspectral images processing and application. At present, the researchers mostly use the classic airborne hyperspectral imagery as the benchmark dataset. However, existing datasets suffer from three bottlenecks: (1) low spatial resolution; (2) low labeled pixels proportion; (3) low degree of subclasses distinction. In this paper, a new benchmark dataset named the Wuhan UAV-borne hyperspectral image (WHU-Hi) dataset was built for hyperspectral image classification. The WHU-Hi dataset with a high spectral resolution (nm level) and a very high spatial resolution (cm level), which we refer to here as H2 imager. Besides, the WHU-Hi dataset has a higher pixel labeling ratio and finer subclasses. Some start-of-art hyperspectral image classification methods benchmarked the WHU-Hi dataset, and the experimental results show that WHU-Hi is a challenging dataset. We hope WHU-Hi dataset can become a strong benchmark to accelerate future research.
Abstract:The inertial navigation system (INS) has been widely used to provide self-contained and continuous motion estimation in intelligent transportation systems. Recently, the emergence of chip-level inertial sensors has expanded the relevant applications from positioning, navigation, and mobile mapping to location-based services, unmanned systems, and transportation big data. Meanwhile, benefit from the emergence of big data and the improvement of algorithms and computing power, artificial intelligence (AI) has become a consensus tool that has been successfully applied in various fields. This article reviews the research on using AI technology to enhance inertial sensing from various aspects, including sensor design and selection, calibration and error modeling, navigation and motion-sensing algorithms, multi-sensor information fusion, system evaluation, and practical application. Based on the over 30 representative articles selected from the nearly 300 related publications, this article summarizes the state of the art, advantages, and challenges on each aspect. Finally, it summarizes nine advantages and nine challenges of AI-enhanced inertial sensing and then points out future research directions.
Abstract:Location is key to spatialize internet-of-things (IoT) data. However, it is challenging to use low-cost IoT devices for robust unsupervised localization (i.e., localization without training data that have known location labels). Thus, this paper proposes a deep reinforcement learning (DRL) based unsupervised wireless-localization method. The main contributions are as follows. (1) This paper proposes an approach to model a continuous wireless-localization process as a Markov decision process (MDP) and process it within a DRL framework. (2) To alleviate the challenge of obtaining rewards when using unlabeled data (e.g., daily-life crowdsourced data), this paper presents a reward-setting mechanism, which extracts robust landmark data from unlabeled wireless received signal strengths (RSS). (3) To ease requirements for model re-training when using DRL for localization, this paper uses RSS measurements together with agent location to construct DRL inputs. The proposed method was tested by using field testing data from multiple Bluetooth 5 smart ear tags in a pasture. Meanwhile, the experimental verification process reflected the advantages and challenges for using DRL in wireless localization.