Considering the importance of building a good Visual Dialog (VD) Questioner, many researchers study the topic under a Q-Bot-A-Bot image-guessing game setting, where the Questioner needs to raise a series of questions to collect information of an undisclosed image. Despite progress has been made in Supervised Learning (SL) and Reinforcement Learning (RL), issues still exist. Firstly, previous methods do not provide explicit and effective guidance for Questioner to generate visually related and informative questions. Secondly, the effect of RL is hampered by an incompetent component, i.e., the Guesser, who makes image predictions based on the generated dialogs and assigns rewards accordingly. To enhance VD Questioner: 1) we propose a Related entity enhanced Questioner (ReeQ) that generates questions under the guidance of related entities and learns entity-based questioning strategy from human dialogs; 2) we propose an Augmented Guesser (AugG) that is strong and is optimized for the VD setting especially. Experimental results on the VisDial v1.0 dataset show that our approach achieves state-of-theart performance on both image-guessing task and question diversity. Human study further proves that our model generates more visually related, informative and coherent questions.
Advancing maturity in mobile and legged robotics technologies is changing the landscapes where robots are being deployed and found. This innovation calls for a transformation in simultaneous localization and mapping (SLAM) systems to support this new generation of service and consumer robots. No longer can traditionally robust 2D lidar systems dominate while robots are being deployed in multi-story indoor, outdoor unstructured, and urban domains with increasingly inexpensive stereo and RGB-D cameras. Visual SLAM (VSLAM) systems have been a topic of study for decades and a small number of openly available implementations have stood out: ORB-SLAM3, OpenVSLAM and RTABMap. This paper presents a comparison of these 3 modern, feature rich, and uniquely robust VSLAM techniques that have yet to be benchmarked against each other, using several different datasets spanning multiple domains negotiated by service robots. ORB-SLAM3 and OpenVSLAM each were not compared against at least one of these datasets previously in literature and we provide insight through this lens. This analysis is motivated to find general purpose, feature complete, and multi-domain VSLAM options to support a broad class of robot applications for integration into the new and improved ROS 2 Nav2 System as suitable alternatives to traditional 2D lidar solutions.
Mild cognitive impairment (MCI) conversion prediction, i.e., identifying MCI patients of high risks converting to Alzheimer's disease (AD), is essential for preventing or slowing the progression of AD. Although previous studies have shown that the fusion of multi-modal data can effectively improve the prediction accuracy, their applications are largely restricted by the limited availability or high cost of multi-modal data. Building an effective prediction model using only magnetic resonance imaging (MRI) remains a challenging research topic. In this work, we propose a multi-modal multi-instance distillation scheme, which aims to distill the knowledge learned from multi-modal data to an MRI-based network for MCI conversion prediction. In contrast to existing distillation algorithms, the proposed multi-instance probabilities demonstrate a superior capability of representing the complicated atrophy distributions, and can guide the MRI-based network to better explore the input MRI. To our best knowledge, this is the first study that attempts to improve an MRI-based prediction model by leveraging extra supervision distilled from multi-modal information. Experiments demonstrate the advantage of our framework, suggesting its potentials in the data-limited clinical settings.
3D object grounding aims to locate the most relevant target object in a raw point cloud scene based on a free-form language description. Understanding complex and diverse descriptions, and lifting them directly to a point cloud is a new and challenging topic due to the irregular and sparse nature of point clouds. There are three main challenges in 3D object grounding: to find the main focus in the complex and diverse description; to understand the point cloud scene; and to locate the target object. In this paper, we address all three challenges. Firstly, we propose a language scene graph module to capture the rich structure and long-distance phrase correlations. Secondly, we introduce a multi-level 3D proposal relation graph module to extract the object-object and object-scene co-occurrence relationships, and strengthen the visual features of the initial proposals. Lastly, we develop a description guided 3D visual graph module to encode global contexts of phrases and proposals by a nodes matching strategy. Extensive experiments on challenging benchmark datasets (ScanRefer and Nr3D) show that our algorithm outperforms existing state-of-the-art. Our code is available at https://github.com/PNXD/FFL-3DOG.
Traffic accidents have been a severe issue in metropolises with the development of traffic flow. This paper explores the theory and application of a recently developed machine learning technique, namely Import Vector Machines (IVMs), in real-time crash risk analysis, which is a hot topic to reduce traffic accidents. Historical crash data and corresponding traffic data from Shanghai Urban Expressway System were employed and matched. Traffic conditions are labelled as dangerous (i.e. probably leading to a crash) and safe (i.e. a normal traffic condition) based on 5-minute measurements of average speed, volume and occupancy. The IVM algorithm is trained to build the classifier and its performance is compared to the popular and successfully applied technique of Support Vector Machines (SVMs). The main findings indicate that IVMs could successfully be employed in real-time identification of dangerous pro-active traffic conditions. Furthermore, similar to the "support points" of the SVM, the IVM model uses only a fraction of the training data to index kernel basis functions, typically a much smaller fraction than the SVM, and its classification rates are similar to those of SVMs. This gives the IVM a computational advantage over the SVM, especially when the size of the training data set is large.
The notion of concept has been studied for centuries, by philosophers, linguists, cognitive scientists, and researchers in artificial intelligence (Margolis & Laurence, 1999). There is a large literature on formal, mathematical models of concepts, including a whole sub-field of AI -- Formal Concept Analysis -- devoted to this topic (Ganter & Obiedkov, 2016). Recently, researchers in machine learning have begun to investigate how methods from representation learning can be used to induce concepts from raw perceptual data (Higgins, Sonnerat, et al., 2018). The goal of this report is to provide a formal account of concepts which is compatible with this latest work in deep learning. The main technical goal of this report is to show how techniques from representation learning can be married with a lattice-theoretic formulation of conceptual spaces. The mathematics of partial orders and lattices is a standard tool for modelling conceptual spaces (Ch.2, Mitchell (1997), Ganter and Obiedkov (2016)); however, there is no formal work that we are aware of which defines a conceptual lattice on top of a representation that is induced using unsupervised deep learning (Goodfellow et al., 2016). The advantages of partially-ordered lattice structures are that these provide natural mechanisms for use in concept discovery algorithms, through the meets and joins of the lattice.
Sentence semantic understanding is a key topic in the field of natural language processing. Recently, contextualized word representations derived from pre-trained language models such as ELMO and BERT have shown significant improvements for a wide range of semantic tasks, e.g. question answering, text classification and sentiment analysis. However, how to add external knowledge to further improve the semantic modeling capability of model is worth probing. In this paper, we propose a novel approach to combining syntax information with a pre-trained language model. In order to evaluate the effect of the pre-training model, first, we introduce RNN-based and Transformer-based pre-trained language models; secondly, to better integrate external knowledge, such as syntactic information integrate with the pre-training model, we propose a dependency syntax expansion (DSE) model. For evaluation, we have selected two subtasks: sentence completion task and biological relation extraction task. The experimental results show that our model achieves 91.2\% accuracy, outperforming the baseline model by 37.8\% on sentence completion task. And it also gets competitive performance by 75.1\% $F_{1}$ score on relation extraction task.
Sensors are being extensively deployed and are expected to expand at significant rates in the coming years. They typically generate a large volume of data on the internet of things (IoT) application areas like smart cities, intelligent traffic systems, smart grid, and e-health. Cloud, edge and fog computing are potential and competitive strategies for collecting, processing, and distributing IoT data. However, cloud, edge, and fog-based solutions need to tackle the distribution of a high volume of IoT data efficiently through constrained and limited resource network infrastructures. This paper addresses the issue of conveying a massive volume of IoT data through a network with limited communications resources (bandwidth) using a cognitive communications resource allocation based on Reinforcement Learning (RL) with SARSA algorithm. The proposed network infrastructure (PSIoTRL) uses a Publish/ Subscribe architecture to access massive and highly distributed IoT data. It is demonstrated that the PSIoTRL bandwidth allocation for buffer flushing based on SARSA enhances the IoT aggregator buffer occupation and network link utilization. The PSIoTRL dynamically adapts the IoT aggregator traffic flushing according to the Pub/Sub topic's priority and network constraint requirements.
Safety is an important topic in autonomous driving since any collision may cause serious damage to people and the environment. Hamilton-Jacobi (HJ) Reachability is a formal method that verifies safety in multi-agent interaction and provides a safety controller for collision avoidance. However, due to the worst-case assumption on the car's future actions, reachability might result in too much conservatism such that the normal operation of the vehicle is largely hindered. In this paper, we leverage the power of trajectory prediction, and propose a prediction-based reachability framework for the safety controller. Instead of always assuming for the worst-case, we first cluster the car's behaviors into multiple driving modes, e.g. left turn or right turn. Under each mode, a reachability-based safety controller is designed based on a less conservative action set. For online purpose, we first utilize the trajectory prediction and our proposed mode classifier to predict the possible modes, and then deploy the corresponding safety controller. Through simulations in a T-intersection and an 8-way roundabout, we demonstrate that our prediction-based reachability method largely avoids collision between two interacting cars and reduces the conservatism that the safety controller brings to the car's original operations.