Route recommendation is significant in navigation service. Two major challenges for route recommendation are route representation and user representation. Different from items that can be identified by unique IDs in traditional recommendation, routes are combinations of links (i.e., a road segment and its following action like turning left) and the number of combinations could be close to infinite. Besides, the representation of a route changes under different scenarios. These facts result in severe sparsity of routes, which increases the difficulty of route representation. Moreover, link attribute deficiencies and errors affect preciseness of route representation. Because of the sparsity of routes, the interaction data between users and routes are also sparse. This makes it not easy to acquire user representation from historical user-item interactions as traditional recommendations do. To address these issues, we propose a novel learning framework R4. In R4, we design a sparse & dense network to obtain representations of routes. The sparse unit learns link ID embeddings and aggregates them to represent a route, which captures implicit route characteristics and subsequently alleviates problems caused by link attribute deficiencies and errors. The dense unit extracts implicit local features of routes from link attributes. For user representation, we utilize a series of historical navigation to extract user preference. R4 achieves remarkable performance in both offline and online experiments.
Consistent state estimation is challenging, especially under the epistemic uncertainties arising from learned (nonlinear) dynamic and observation models. In this work, we develop a set-based estimation algorithm, that produces zonotopic state estimates that respect the epistemic uncertainties in the learned models, in addition to the aleatoric uncertainties. Our algorithm guarantees probabilistic consistency, in the sense that the true state is always bounded by the zonotopes, with a high probability. We formally relate our set-based approach with the corresponding probabilistic approach (GP-EKF) in the case of learned (nonlinear) models. In particular, when linearization errors and aleatoric uncertainties are omitted, and epistemic uncertainties are simplified, our set-based approach reduces to its probabilistic counterpart. Our method's improved consistency is empirically demonstrated in both a simulated pendulum domain and a real-world robot-assisted dressing domain, where the robot estimates the configuration of the human arm utilizing the force measurements at its end effector.
When observing task demonstrations, human apprentices are able to identify whether a given task is executed correctly long before they gain expertise in actually performing that task. Prior research into learning from demonstrations (LfD) has failed to capture this notion of the acceptability of a task's execution; meanwhile, temporal logics provide a flexible language for expressing task specifications. Inspired by this, we present Bayesian specification inference, a probabilistic model for inferring task specification as a temporal logic formula. We incorporate methods from probabilistic programming to define our priors, along with a domain-independent likelihood function to enable sampling-based inference. We demonstrate the efficacy of our model for inferring specifications, with over 90% similarity observed between the inferred specification and the ground truth, both within a synthetic domain and during a real-world table setting task.
Though network sparsity emerges as a promising direction to overcome the drastically increasing size of neural networks, it remains an open problem to concurrently maintain model accuracy as well as achieve significant speedups on general CPUs. In this paper, we propose one novel concept of $1\times N$ block sparsity pattern (block pruning) to break this limitation. In particular, consecutive $N$ output kernels with the same input channel index are grouped into one block, which serves as a basic pruning granularity of our pruning pattern. Our $1 \times N$ sparsity pattern prunes these blocks considered unimportant. We also provide a workflow of filter rearrangement that first rearranges the weight matrix in the output channel dimension to derive more influential blocks for accuracy improvements, and then applies similar rearrangement to the next-layer weights in the input channel dimension to ensure correct convolutional operations. Moreover, the output computation after our $1 \times N$ block sparsity can be realized via a parallelized block-wise vectorized operation, leading to significant speedups on general CPUs-based platforms. The efficacy of our pruning pattern is proved with experiments on ILSVRC-2012. For example, in the case of 50% sparsity and $N=4$, our pattern obtains about 3.0% improvements over filter pruning in the top-1 accuracy of MobileNet-V2. Meanwhile, it obtains 56.04ms inference savings on Cortex-A7 CPU over weight pruning. Code is available at https://github.com/lmbxmu/1xN.
Despite superior performance on various natural language processing tasks, pre-trained models such as BERT are challenged by deploying on resource-constraint devices. Most existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments. This practically limits the further application of model compression. Moreover, the ineffective training and searching process of existing elastic compression paradigms[4,27] prevents the direct migration to BERT compression. Motivated by the necessity of efficient inference across various constraints on BERT, we propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere. Specifically, we first construct a huge search space with 10^13 architectures, which covers nearly all configurations in BERT model. Then, we propose a novel stochastic nature gradient optimization method to guide the generation of optimal candidate architecture which could keep a balanced trade-off between explorations and exploitation. When a certain resource constraint is given, a lightweight distribution optimization approach is utilized to obtain the optimal network for target deployment without fine-tuning. Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark. Besides, YOCO-BERT is also more effective, e.g.,the training complexity is O(1)for N different devices. Code is availablehttps://github.com/MAC-AutoML/YOCO-BERT.
The state-of-the-art driving automation system demands extreme computational resources to meet rigorous accuracy and latency requirements. Though emerging driving automation computing platforms are based on ASIC to provide better performance and power guarantee, building such an accelerator-based computing platform for driving automation still present challenges. First, the workloads mix and performance requirements exposed to driving automation system present significant variability. Second, with more cameras/sensors integrated in a future fully autonomous driving vehicle, a heterogeneous multi-accelerator architecture substrate is needed that requires a design space exploration for a new form of parallelism. In this work, we aim to extensively explore the above system design challenges and these challenges motivate us to propose a comprehensive framework that synergistically handles the heterogeneous hardware accelerator design principles, system design criteria, and task scheduling mechanism. Specifically, we propose a novel heterogeneous multi-core AI accelerator (HMAI) to provide the hardware substrate for the driving automation tasks with variability. We also define system design criteria to better utilize hardware resources and achieve increased throughput while satisfying the performance and energy restrictions. Finally, we propose a deep reinforcement learning (RL)-based task scheduling mechanism FlexAI, to resolve task mapping issue. Experimental results show that with FlexAI scheduling, basically 100% tasks in each driving route can be processed by HMAI within their required period to ensure safety, and FlexAI can also maximally reduce the breaking distance up to 96% as compared to typical heuristics and guided random-search-based algorithms.
This paper proposes a novel model, named Continuity-Discrimination Convolutional Neural Network (CD-CNN), for visual object tracking. Existing state-of-the-art tracking methods do not deal with temporal relationship in video sequences, which leads to imperfect feature representations. To address this problem, CD-CNN models temporal appearance continuity based on the idea of temporal slowness. Mathematically, we prove that, by introducing temporal appearance continuity into tracking, the upper bound of target appearance representation error can be sufficiently small with high probability. Further, in order to alleviate inaccurate target localization and drifting, we propose a novel notion, object-centroid, to characterize not only objectness but also the relative position of the target within a given patch. Both temporal appearance continuity and object-centroid are jointly learned during offline training and then transferred for online tracking. We evaluate our tracker through extensive experiments on two challenging benchmarks and show its competitive tracking performance compared with state-of-the-art trackers.
We present a task-and-motion planning (TAMP) algorithm robust against a human operator's cooperative or adversarial interventions. Interventions often invalidate the current plan and require replanning on the fly. Replanning can be computationally expensive and often interrupts seamless task execution. We introduce a dynamically reconfigurable planning methodology with behavior tree-based control strategies toward reactive TAMP, which takes the advantage of previous plans and incremental graph search during temporal logic-based reactive synthesis. Our algorithm also shows efficient recovery functionalities that minimize the number of replanning steps. Finally, our algorithm produces a robust, efficient, and complete TAMP solution. Our experimental results show the algorithm results in superior manipulation performance in both simulated and real-world tasks.
The size of Transformer models is growing at an unprecedented pace. It has only taken less than one year to reach trillion-level parameters after the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated and elastic pipelining and data parallelism for efficient distributed training of Transformer models. PipeTransformer automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training, and instead allocates resources for training of the remaining active layers. More specifically, PipeTransformer dynamically excludes converged layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets. Our results show that PipeTransformer attains a 2.4 fold speedup compared to the state-of-the-art baseline. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. We also develop open-sourced flexible APIs for PipeTransformer, which offer a clean separation among the freeze algorithm, model definitions, and training accelerations, hence allowing it to be applied to other algorithms that require similar freezing strategies.