Deep learning models have been developed for a variety of tasks and are deployed every day to work in real conditions. Some of these tasks are critical and models need to be trusted and safe, e.g. military communications or cancer diagnosis. These models are given real data, simulated data or combination of both and are trained to be highly predictive on them. However, gathering enough real data or simulating them to be representative of all the real conditions is: costly, sometimes impossible due to confidentiality and most of the time impossible. Indeed, real conditions are constantly changing and sometimes are intractable. A solution is to deploy machine learning models that are able to give predictions when they are confident enough otherwise raise a flag or abstain. One issue is that standard models easily fail at detecting out-of-distribution samples where their predictions are unreliable. We present here TrustGAN, a generative adversarial network pipeline targeting trustness. It is a deep learning pipeline which improves a target model estimation of the confidence without impacting its predictive power. The pipeline can accept any given deep learning model which outputs a prediction and a confidence on this prediction. Moreover, the pipeline does not need to modify this target model. It can thus be easily deployed in a MLOps (Machine Learning Operations) setting. The pipeline is applied here to a target classification model trained on MNIST data to recognise numbers based on images. We compare such a model when trained in the standard way and with TrustGAN. We show that on out-of-distribution samples, here FashionMNIST and CIFAR10, the estimated confidence is largely reduced. We observe similar conclusions for a classification model trained on 1D radio signals from AugMod, tested on RML2016.04C. We also publicly release the code.
The provision of paratransit services is one option to meet the transportation needs of Vulnerable Road Users (VRUs). Like any other means of transportation, paratransit has obstacles such as high operational costs and longer trip times. As a result, customers are dissatisfied, and paratransit operators have a low approval rating. Researchers have undertaken various studies over the years to better understand the travel behaviors of paratransit customers and how they are operated. According to the findings of these researches, paratransit operators confront the challenge of determining the optimal route for their trips in order to save travel time. Depending on the nature of the challenge, most research used different optimization techniques to solve these routing problems. As a result, the goal of this study is to use Graph Convolutional Neural Networks (GCNs) to assist paratransit operators in researching various operational scenarios in a strategic setting in order to optimize routing, minimize operating costs and minimize their users' travel time. The study was carried out by using a randomized simulated dataset to help determine the decision to make in terms of fleet composition and capacity under different situations. For the various scenarios investigated, the GCN assisted in determining the minimum optimal gap.
Techniques have been proposed to estimate unknown antenna impedance due to time-varying near-field loading conditions at multiple-input single-output (MISO) receivers. However, it remains unclear when a change occurs and impedance estimation becomes necessary. In this letter, we address this problem by formulating it as a hypothesis test. Our contributions include deriving a generalized likelihood-ratio test (GLRT) detector to decide if the antenna impedance has changed over two groups of packets. This GLRT formulation leads to a novel optimization problem, but we propose a binary search based algorithm to solve it efficiently. Our derived GLRT detector enjoys a better detection and false alarm trade-off when compared with a well-known, reference detector in simulations. As one result, more transmit diversity significantly improves detection accuracy at a given false alarm rate, especially in slow fading channels.
Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.
Intelligent agents need to remember salient information to reason in partially-observed environments. For example, agents with a first-person view should remember the positions of relevant objects even if they go out of view. Similarly, to effectively navigate through rooms agents need to remember the floor plan of how rooms are connected. However, most benchmark tasks in reinforcement learning do not test long-term memory in agents, slowing down progress in this important research direction. In this paper, we introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents. Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities and requires the agent to localize itself by integrating information over time. With Memory Maze, we propose an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation. Recording a human player establishes a strong baseline and verifies the need to build up and retain memories, which is reflected in their gradually increasing rewards within each episode. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes, leaving room for future algorithmic designs to be evaluated on the Memory Maze.
Lidar odometry has attracted considerable attention as a robust localization method for autonomous robots operating in complex GNSS-denied environments. However, achieving reliable and efficient performance on heterogeneous platforms in large-scale environments remains an open challenge due to the limitations of onboard computation and memory resources needed for autonomous operation. In this work, we present LOCUS 2.0, a robust and computationally-efficient \lidar odometry system for real-time underground 3D mapping. LOCUS 2.0 includes a novel normals-based \morrell{Generalized Iterative Closest Point (GICP)} formulation that reduces the computation time of point cloud alignment, an adaptive voxel grid filter that maintains the desired computation load regardless of the environment's geometry, and a sliding-window map approach that bounds the memory consumption. The proposed approach is shown to be suitable to be deployed on heterogeneous robotic platforms involved in large-scale explorations under severe computation and memory constraints. We demonstrate LOCUS 2.0, a key element of the CoSTAR team's entry in the DARPA Subterranean Challenge, across various underground scenarios. We release LOCUS 2.0 as an open-source library and also release a \lidar-based odometry dataset in challenging and large-scale underground environments. The dataset features legged and wheeled platforms in multiple environments including fog, dust, darkness, and geometrically degenerate surroundings with a total of $11~h$ of operations and $16~km$ of distance traveled.
Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent. This is because partial observability gives rise to hidden confounders in the causal graph. We break down the space of confounded imitation learning problems and identify three settings with different data requirements in which the correct imitation policy can be identified. We then introduce an algorithm for deconfounded imitation learning, which trains an inference model jointly with a latent-conditional policy. At test time, the agent alternates between updating its belief over the latent and acting under the belief. We show in theory and practice that this algorithm converges to the correct interventional policy, solves the confounding issue, and can under certain assumptions achieve an asymptotically optimal imitation performance.
Traffic light detection is essential for self-driving cars to navigate safely in urban areas. Publicly available traffic light datasets are inadequate for the development of algorithms for detecting distant traffic lights that provide important navigation information. We introduce a novel benchmark traffic light dataset captured using a synchronized pair of narrow-angle and wide-angle cameras covering urban and semi-urban roads. We provide 1032 images for training and 813 synchronized image pairs for testing. Additionally, we provide synchronized video pairs for qualitative analysis. The dataset includes images of resolution 1920$\times$1080 covering 10 different classes. Furthermore, we propose a post-processing algorithm for combining outputs from the two cameras. Results show that our technique can strike a balance between speed and accuracy, compared to the conventional approach of using a single camera frame.
Endoscopic content area refers to the informative area enclosed by the dark, non-informative, border regions present in most endoscopic footage. The estimation of the content area is a common task in endoscopic image processing and computer vision pipelines. Despite the apparent simplicity of the problem, several factors make reliable real-time estimation surprisingly challenging. The lack of rigorous investigation into the topic combined with the lack of a common benchmark dataset for this task has been a long-lasting issue in the field. In this paper, we propose two variants of a lean GPU-based computational pipeline combining edge detection and circle fitting. The two variants differ by relying on handcrafted features, and learned features respectively to extract content area edge point candidates. We also present a first-of-its-kind dataset of manually annotated and pseudo-labelled content areas across a range of surgical indications. To encourage further developments, the curated dataset, and an implementation of both algorithms, has been made public (https://doi.org/10.7303/syn32148000, https://github.com/charliebudd/torch-content-area). We compare our proposed algorithm with a state-of-the-art U-Net-based approach and demonstrate significant improvement in terms of both accuracy (Hausdorff distance: 6.3 px versus 118.1 px) and computational time (Average runtime per frame: 0.13 ms versus 11.2 ms).