Breaking safety constraints in control systems can lead to potential risks, resulting in unexpected costs or catastrophic damage. Nevertheless, uncertainty is ubiquitous, even among similar tasks. In this paper, we develop a novel adaptive safe control framework that integrates meta learning, Bayesian models, and control barrier function (CBF) method. Specifically, with the help of CBF method, we learn the inherent and external uncertainties by a unified adaptive Bayesian linear regression (ABLR) model, which consists of a forward neural network (NN) and a Bayesian output layer. Meta learning techniques are leveraged to pre-train the NN weights and priors of the ABLR model using data collected from historical similar tasks. For a new control task, we refine the meta-learned models using a few samples, and introduce pessimistic confidence bounds into CBF constraints to ensure safe control. Moreover, we provide theoretical criteria to guarantee probabilistic safety during the control processes. To validate our approach, we conduct comparative experiments in various obstacle avoidance scenarios. The results demonstrate that our algorithm significantly improves the Bayesian model-based CBF method, and is capable for efficient safe exploration even with multiple uncertain constraints.
Vision Transformers have witnessed prevailing success in a series of vision tasks. However, they often require enormous amount of computations to achieve high performance, which is burdensome to deploy on resource-constrained devices. To address these issues, we draw lessons from depthwise separable convolution and imitate its ideology to design the Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the information interaction within and among the windows via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to model the attention relationship among windows with negligible computational cost and capture a long-range visual dependencies of multiple windows, respectively. Extensive experiments on various benchmark tasks demonstrate SepViT can achieve state-of-the-art results in terms of trade-off between accuracy and latency. Among them, SepViT achieves 84.0% top-1 accuracy on ImageNet-1K classification while decreasing the latency by 40%, compared to the ones with similar accuracy (e.g., CSWin, PVTV2). As for the downstream vision tasks, SepViT with fewer FLOPs can achieve 50.4% mIoU on ADE20K semantic segmentation task, 47.5 AP on the RetinaNet-based COCO detection task, 48.7 box AP and 43.9 mask AP on Mask R-CNN-based COCO detection and segmentation tasks.
The sparse representation of graphs has shown its great potential for accelerating the computation of the graph applications (e.g. Social Networks, Knowledge Graphs) on traditional computing architectures (CPU, GPU, or TPU). But the exploration of the large-scale sparse graph computing on processing-in-memory (PIM) platforms (typically with memristive crossbars) is still in its infancy. As we look to implement the computation or storage of large-scale or batch graphs on memristive crossbars, a natural assumption would be that we need a large-scale crossbar, but with low utilization. Some recent works have questioned this assumption to avoid the waste of the storage and computational resource by "block partition", which is fixed-size, progressively scheduled, or coarse-grained, thus is not effectively sparsity-aware in our view. This work proposes the dynamic sparsity-aware mapping scheme generating method that models the problem as a sequential decision-making problem which is solved by reinforcement learning (RL) algorithm (REINFORCE). Our generating model (LSTM, combined with our dynamic-fill mechanism) generates remarkable mapping performance on a small-scale typical graph/matrix data (43% area of the original matrix with fully mapping), and two large-scale matrix data (22.5% area on qh882, and 17.1% area on qh1484). Moreover, our coding framework of the scheme is intuitive and has promising adaptability with the deployment or compilation system.
Differentiable architecture search has gradually become the mainstream research topic in the field of Neural Architecture Search (NAS) for its capability to improve efficiency compared with the early NAS (EA-based, RL-based) methods. Recent differentiable NAS also aims at further improving search efficiency, reducing the GPU-memory consumption, and addressing the "depth gap" issue. However, these methods are no longer capable of tackling the non-differentiable objectives, let alone multi-objectives, e.g., performance, robustness, efficiency, and other metrics. We propose an end-to-end architecture search framework towards non-differentiable objectives, TND-NAS, with the merits of the high efficiency in differentiable NAS framework and the compatibility among non-differentiable metrics in Multi-objective NAS (MNAS). Under differentiable NAS framework, with the continuous relaxation of the search space, TND-NAS has the architecture parameters ($\alpha$) been optimized in discrete space, while resorting to the search policy of progressively shrinking the supernetwork by $\alpha$. Our representative experiment takes two objectives (Parameters, Accuracy) as an example, we achieve a series of high-performance compact architectures on CIFAR10 (1.09M/3.3%, 2.4M/2.95%, 9.57M/2.54%) and CIFAR100 (2.46M/18.3%, 5.46/16.73%, 12.88/15.20%) datasets. Favorably, under real-world scenarios (resource-constrained, platform-specialized), the Pareto-optimal solutions can be conveniently reached by TND-NAS.
For multimodal tasks, a good feature extraction network should extract information as much as possible and ensure that the extracted feature embedding and other modal feature embedding have an excellent mutual understanding. The latter is often more critical in feature fusion than the former. Therefore, selecting the optimal feature extraction network collocation is a very important subproblem in multimodal tasks. Most of the existing studies ignore this problem or adopt an ergodic approach. This problem is modeled as an optimization problem in this paper. A novel method is proposed to convert the optimization problem into an issue of comparative upper bounds by referring to the general practice of extreme value conversion in mathematics. Compared with the traditional method, it reduces the time cost. Meanwhile, aiming at the common problem that the feature similarity and the feature semantic similarity are not aligned in the multimodal time-series problem, we refer to the idea of contrast learning and propose a multimodal time-series contrastive loss(MTSC). Based on the above issues, We demonstrated the feasibility of our approach in the audio-visual video parsing task. Substantial analyses verify that our methods promote the fusion of different modal features.
Video prediction is a challenging task with wide application prospects in meteorology and robot systems. Existing works fail to trade off short-term and long-term prediction performances and extract robust latent dynamics laws in video frames. We propose a two-branch seq-to-seq deep model to disentangle the Taylor feature and the residual feature in video frames by a novel recurrent prediction module (TaylorCell) and residual module. TaylorCell can expand the video frames' high-dimensional features into the finite Taylor series to describe the latent laws. In TaylorCell, we propose the Taylor prediction unit (TPU) and the memory correction unit (MCU). TPU employs the first input frame's derivative information to predict the future frames, avoiding error accumulation. MCU distills all past frames' information to correct the predicted Taylor feature from TPU. Correspondingly, the residual module extracts the residual feature complementary to the Taylor feature. On three generalist datasets (Moving MNIST, TaxiBJ, Human 3.6), our model outperforms or reaches state-of-the-art models, and ablation experiments demonstrate the effectiveness of our model in long-term prediction.
While convolutional neural networks (CNNs) have significantly boosted the performance of face related algorithms, maintaining accuracy and efficiency simultaneously in practical use remains challenging. Recent study shows that using a cascade of hourglass modules which consist of a number of bottom-up and top-down convolutional layers can extract facial structural information for face alignment to improve accuracy. However, previous studies have shown that features produced by shallow convolutional layers are highly correspond to edges. These features could be directly used to provide the structural information without addition cost. Motivated by this intuition, we propose an efficient multitask face alignment, face tracking and head pose estimation network (ATPN). Specifically, we introduce a shortcut connection between shallow-layer features and deep-layer features to provide the structural information for face alignment and apply the CoordConv to the last few layers to provide coordinate information. The predicted facial landmarks enable us to generate a cheap heatmap which contains both geometric and appearance information for head pose estimation and it also provides attention clues for face tracking. Moreover, the face tracking task saves us the face detection procedure for each frame, which is significant to boost performance for video-based tasks. The proposed framework is evaluated on four benchmark datasets, WFLW, 300VW, WIDER Face and 300W-LP. The experimental results show that the ATPN achieves improved performance compared to previous state-of-the-art methods while having less number of parameters and FLOPS.
In this work, we propose a novel Knowledge Graph Embedding (KGE) strategy, called M\"{o}biusE, in which the entities and relations are embedded to the surface of a M\"{o}bius ring. The proposition of such a strategy is inspired by the classic TorusE, in which the addition of two arbitrary elements is subject to a modulus operation. In this sense, TorusE naturally guarantees the critical boundedness of embedding vectors in KGE. However, the nonlinear property of addition operation on Torus ring is uniquely derived by the modulus operation, which in some extent restricts the expressiveness of TorusE. As a further generalization of TorusE, M\"{o}biusE also uses modulus operation to preserve the closeness of addition operation on it, but the coordinates on M\"{o}bius ring interacts with each other in the following way: {\em \color{red} any vector on the surface of a M\"{o}bius ring moves along its parametric trace will goes to the right opposite direction after a cycle}. Hence, M\"{o}biusE assumes much more nonlinear representativeness than that of TorusE, and in turn it generates much more precise embedding results. In our experiments, M\"{o}biusE outperforms TorusE and other classic embedding strategies in several key indicators.
The task of crowd counting is extremely challenging due to complicated difficulties, especially the huge variation in vision scale. Previous works tend to adopt a naive concatenation of multi-scale information to tackle it, while the scale shifts between the feature maps are ignored. In this paper, we propose a novel Hierarchical Scale Recalibration Network (HSRNet), which addresses the above issues by modeling rich contextual dependencies and recalibrating multiple scale-associated information. Specifically, a Scale Focus Module (SFM) first integrates global context into local features by modeling the semantic inter-dependencies along channel and spatial dimensions sequentially. In order to reallocate channel-wise feature responses, a Scale Recalibration Module (SRM) adopts a step-by-step fusion to generate final density maps. Furthermore, we propose a novel Scale Consistency loss to constrain that the scale-associated outputs are coherent with groundtruth of different scales. With the proposed modules, our approach can ignore various noises selectively and focus on appropriate crowd scales automatically. Extensive experiments on crowd counting datasets (ShanghaiTech, MALL, WorldEXPO'10, and UCSD) show that our HSRNet can deliver superior results over all state-of-the-art approaches. More remarkably, we extend experiments on an extra vehicle dataset, whose results indicate that the proposed model is generalized to other applications.