Time-to-Contact (TTC) estimation is a critical task for assessing collision risk and is widely used in various driver assistance and autonomous driving systems. The past few decades have witnessed development of related theories and algorithms. The prevalent learning-based methods call for a large-scale TTC dataset in real-world scenarios. In this work, we present a large-scale object oriented TTC dataset in the driving scene for promoting the TTC estimation by a monocular camera. To collect valuable samples and make data with different TTC values relatively balanced, we go through thousands of hours of driving data and select over 200K sequences with a preset data distribution. To augment the quantity of small TTC cases, we also generate clips using the latest Neural rendering methods. Additionally, we provide several simple yet effective TTC estimation baselines and evaluate them extensively on the proposed dataset to demonstrate their effectiveness. The proposed dataset is publicly available at https://open-dataset.tusen.ai/TSTTC.
Recently, multiple data-driven models based on machine learning for weather forecasting have emerged. These models are highly competitive in terms of accuracy compared to traditional numerical weather prediction (NWP) systems. In particular, the Pangu-Weather model, which is open source for non-commercial use, has been validated for its forecasting performance by the European Centre for Medium-Range Weather Forecasts (ECMWF) and has recently been published in the journal "Nature". In this paper, we evaluate the compatibility of the Pangu-Weather model with several commonly used NWP operational analyses through case studies. The results indicate that the Pangu-Weather model is compatible with different operational analyses from various NWP systems as the model initial conditions, and it exhibits a relatively stable forecasting capability. Furthermore, we have verified that improving the quality of global or local initial conditions significantly contributes to enhancing the forecasting performance of the Pangu-Weather model.
Conformal prediction (CP) is a framework to quantify uncertainty of machine learning classifiers including deep neural networks. Given a testing example and a trained classifier, CP produces a prediction set of candidate labels with a user-specified coverage (i.e., true class label is contained with high probability). Almost all the existing work on CP assumes clean testing data and there is not much known about the robustness of CP algorithms w.r.t natural/adversarial perturbations to testing examples. This paper studies the problem of probabilistically robust conformal prediction (PRCP) which ensures robustness to most perturbations around clean input examples. PRCP generalizes the standard CP (cannot handle perturbations) and adversarially robust CP (ensures robustness w.r.t worst-case perturbations) to achieve better trade-offs between nominal performance and robustness. We propose a novel adaptive PRCP (aPRCP) algorithm to achieve probabilistically robust coverage. The key idea behind aPRCP is to determine two parallel thresholds, one for data samples and another one for the perturbations on data (aka "quantile-of-quantile" design). We provide theoretical analysis to show that aPRCP algorithm achieves robust coverage. Our experiments on CIFAR-10, CIFAR-100, and ImageNet datasets using deep neural networks demonstrate that aPRCP achieves better trade-offs than state-of-the-art CP and adversarially robust CP algorithms.
With recent advances in computing hardware and surges of deep-learning architectures, learning-based deep image registration methods have surpassed their traditional counterparts, in terms of metric performance and inference time. However, these methods focus on improving performance measurements such as Dice, resulting in less attention given to model behaviors that are equally desirable for registrations, especially for medical imaging. This paper investigates these behaviors for popular learning-based deep registrations under a sanity-checking microscope. We find that most existing registrations suffer from low inverse consistency and nondiscrimination of identical pairs due to overly optimized image similarities. To rectify these behaviors, we propose a novel regularization-based sanity-enforcer method that imposes two sanity checks on the deep model to reduce its inverse consistency errors and increase its discriminative power simultaneously. Moreover, we derive a set of theoretical guarantees for our sanity-checked image registration method, with experimental results supporting our theoretical findings and their effectiveness in increasing the sanity of models without sacrificing any performance. Our code and models are available at https://github.com/tuffr5/Saner-deep-registration.
With the rapid development of global road transportation, countries worldwide have completed the construction of road networks. However, the ensuing challenge lies in the maintenance of existing roads. It is well-known that countries allocate limited budgets to road maintenance projects, and road management departments face difficulties in making scientifically informed maintenance decisions. Therefore, integrating various artificial intelligence decision-making techniques to thoroughly explore historical maintenance data and adapt them to the context of road maintenance scientific decision-making has become an urgent issue. This integration aims to provide road management departments with more scientific tools and evidence for decision-making. The framework proposed in this paper primarily addresses the following four issues: 1) predicting the pavement performance of various routes, 2) determining the prioritization of maintenance routes, 3) making maintenance decisions based on the evaluation of the effects of past maintenance, and considering comprehensive technical and management indicators, and 4) determining the prioritization of maintenance sections based on the maintenance effectiveness and recommended maintenance effectiveness. By tackling these four problems, the framework enables intelligent decision-making for the optimal maintenance plan and maintenance sections, taking into account limited funding and historical maintenance management experience.
Visible-infrared person re-identification (VI-ReID), which aims to search identities across different spectra, is a challenging task due to large cross-modality discrepancy between visible and infrared images. The key to reduce the discrepancy is to filter out identity-irrelevant interference and effectively learn modality-invariant person representations. In this paper, we propose a novel Modality Restitution and Compensation Network (MRCN) to narrow the gap between the two modalities. Specifically, we first reduce the modality discrepancy by using two Instance Normalization (IN) layers. Next, to reduce the influence of IN layers on removing discriminative information and to reduce modality differences, we propose a Modality Restitution Module (MRM) and a Modality Compensation Module (MCM) to respectively distill modality-irrelevant and modality-relevant features from the removed information. Then, the modality-irrelevant features are used to restitute to the normalized visible and infrared features, while the modality-relevant features are used to compensate for the features of the other modality. Furthermore, to better disentangle the modality-relevant features and the modality-irrelevant features, we propose a novel Center-Quadruplet Causal (CQC) loss to encourage the network to effectively learn the modality-relevant features and the modality-irrelevant features. Extensive experiments are conducted to validate the superiority of our method on the challenging SYSU-MM01 and RegDB datasets. More remarkably, our method achieves 95.1% in terms of Rank-1 and 89.2% in terms of mAP on the RegDB dataset.
Safe deployment of deep neural networks in high-stake real-world applications requires theoretically sound uncertainty quantification. Conformal prediction (CP) is a principled framework for uncertainty quantification of deep models in the form of prediction set for classification tasks with a user-specified coverage (i.e., true class label is contained with high probability). This paper proposes a novel algorithm referred to as Neighborhood Conformal Prediction (NCP) to improve the efficiency of uncertainty quantification from CP for deep classifiers (i.e., reduce prediction set size). The key idea behind NCP is to use the learned representation of the neural network to identify k nearest-neighbors calibration examples for a given testing input and assign them importance weights proportional to their distance to create adaptive prediction sets. We theoretically show that if the learned data representation of the neural network satisfies some mild conditions, NCP will produce smaller prediction sets than traditional CP algorithms. Our comprehensive experiments on CIFAR-10, CIFAR-100, and ImageNet datasets using diverse deep neural networks strongly demonstrate that NCP leads to significant reduction in prediction set size over prior CP methods.
Place recognition, an algorithm to recognize the re-visited places, plays the role of back-end optimization trigger in a full SLAM system. Many works equipped with deep learning tools, such as MLP, CNN, and transformer, have achieved great improvements in this research field. Point cloud transformer is one of the excellent frameworks for place recognition applied in robotics, but with large memory consumption and expensive computation, it is adverse to widely deploy the various point cloud transformer networks in mobile or embedded devices. To solve this issue, we propose a binary point cloud transformer for place recognition. As a result, a 32-bit full-precision model can be reduced to a 1-bit model with less memory occupation and faster binarized bitwise operations. To our best knowledge, this is the first binary point cloud transformer that can be deployed on mobile devices for online applications such as place recognition. Experiments on several standard benchmarks demonstrate that the proposed method can get comparable results with the corresponding full-precision transformer model and even outperform some full-precision deep learning methods. For example, the proposed method achieves 93.28% at the top @1% and 85.74% at the top @1% on the Oxford RobotCar dataset in terms of the metric of the average recall rate. Meanwhile, the size and floating point operations of the model with the same transformer structure reduce 56.1% and 34.1% respectively from original precision to binary precision.
Applying powerful generative denoising diffusion models (DDMs) for downstream tasks such as image semantic editing usually requires either fine-tuning pre-trained DDMs or learning auxiliary editing networks. In this work, we achieve SOTA semantic control performance on various application settings by optimizing the denoising trajectory solely via frozen DDMs. As one of the first optimization-based diffusion editing work, we start by seeking a more comprehensive understanding of the intermediate high-dimensional latent spaces by theoretically and empirically analyzing their probabilistic and geometric behaviors in the Markov chain. We then propose to further explore the critical step in the denoising trajectory that characterizes the convergence of a pre-trained DDM. Last but not least, we further present our method to search for the semantic subspaces boundaries for controllable manipulation, by guiding the denoising trajectory towards the targeted boundary at the critical convergent step. We conduct extensive experiments on various DPMs architectures (DDPM, iDDPM) and datasets (CelebA, CelebA-HQ, LSUN-church, LSUN-bedroom, AFHQ-dog) with different resolutions (64, 256) as empirical demonstrations.
Optical flow estimation has been a long-lasting and fundamental problem in the computer vision community. However, despite the advances of optical flow estimation in perspective videos, the 360$^\circ$ videos counterpart remains in its infancy, primarily due to the shortage of benchmark datasets and the failure to accommodate the omnidirectional nature of 360$^\circ$ videos. We propose the first perceptually realistic 360$^\circ$ filed-of-view video benchmark dataset, namely FLOW360, with 40 different videos and 4,000 video frames. We then conduct comprehensive characteristic analysis and extensive comparisons with existing datasets, manifesting FLOW360's perceptual realism, uniqueness, and diversity. Moreover, we present a novel Siamese representation Learning framework for Omnidirectional Flow (SLOF) estimation, which is trained in a contrastive manner via a hybrid loss that combines siamese contrastive and optical flow losses. By training the model on random rotations of the input omnidirectional frames, our proposed contrastive scheme accommodates the omnidirectional nature of optical flow estimation in 360$^\circ$ videos, resulting in significantly reduced prediction errors. The learning scheme is further proven to be efficient by expanding our siamese learning scheme and omnidirectional optical flow estimation to the egocentric activity recognition task, where the classification accuracy is boosted up to $\sim$26%. To summarize, we study the optical flow estimation in 360$^\circ$ videos problem from perspectives of the benchmark dataset, learning model, and also practical application. The FLOW360 dataset and code are available at https://siamlof.github.io.