Video panoptic segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. We believe that the decoupling strategy proposed by DVIS enables more effective utilization of temporal information for both "thing" and "stuff" objects. In this report, we successfully validated the effectiveness of the decoupling strategy in video panoptic segmentation. Finally, our method achieved a VPQ score of 51.4 and 53.7 in the development and test phases, respectively, and ultimately ranked 1st in the VPS track of the 2nd PVUW Challenge. The code is available at https://github.com/zhang-tao-whu/DVIS
Capitalizing on the rapid development of neural networks, recent video frame interpolation (VFI) methods have achieved notable improvements. However, they still fall short for real-world videos containing large motions. Complex deformation and/or occlusion caused by large motions make it an extremely difficult problem in video frame interpolation. In this paper, we propose a simple yet effective solution, H-VFI, to deal with large motions in video frame interpolation. H-VFI contributes a hierarchical video interpolation transformer (HVIT) to learn a deformable kernel in a coarse-to-fine strategy in multiple scales. The learnt deformable kernel is then utilized in convolving the input frames for predicting the interpolated frame. Starting from the smallest scale, H-VFI updates the deformable kernel by a residual in succession based on former predicted kernels, intermediate interpolated results and hierarchical features from transformer. Bias and masks to refine the final outputs are then predicted by a transformer block based on interpolated results. The advantage of such a progressive approximation is that the large motion frame interpolation problem can be decomposed into several relatively simpler sub-tasks, which enables a very accurate prediction in the final results. Another noteworthy contribution of our paper consists of a large-scale high-quality dataset, YouTube200K, which contains videos depicting a great variety of scenarios captured at high resolution and high frame rate. Extensive experiments on multiple frame interpolation benchmarks validate that H-VFI outperforms existing state-of-the-art methods especially for videos with large motions.
This paper proposes a novel video inpainting method. We make three main contributions: First, we extended previous Transformers with patch alignment by introducing Deformed Patch-based Homography (DePtH), which improves patch-level feature alignments without additional supervision and benefits challenging scenes with various deformation. Second, we introduce Mask Pruning-based Patch Attention (MPPA) to improve patch-wised feature matching by pruning out less essential features and using saliency map. MPPA enhances matching accuracy between warped tokens with invalid pixels. Third, we introduce a Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to spatial-temporal tokens under the guidance of the Deformation Factor learned from DePtH, especially for videos with agile motions. Experimental results demonstrate that our method outperforms recent methods qualitatively and quantitatively and achieves a new state-of-the-art.
With the development of vehicular technologies on automation, electrification, and digitalization, vehicles are becoming more intelligent while being exposed to more complex, uncertain, and frequently occurring faults. In this paper, we look into the maintenance planning of an operating vehicle under fault condition and formulate it as a multi-criteria decision-making problem. The maintenance decisions are generated by route searching in road networks and evaluated based on risk assessment considering the uncertainty of vehicle breakdowns. Particularly, we consider two criteria, namely the risk of public time loss and the risk of mission delay, representing the concerns of the public sector and the private sector, respectively. A public time loss model is developed to evaluate the traffic congestion caused by a vehicle breakdown and the corresponding towing process. The Pareto optimal set of non-dominated decisions is derived by evaluating the risk of the decisions. We demonstrate the relevance of the problem and the effectiveness of the proposed method by numerical experiments derived from real-world scenarios. The experiments show that neglecting the risk of vehicle breakdown on public roads can cause a high risk of public time loss in dense traffic flow. With the proposed method, alternate decisions can be derived to reduce the risks of public time loss significantly with a low increase in the risk of mission delay. This study aims at catalyzing public-private partnership through collaborative decision-making between the private sector and the public sector, thus archiving a more sustainable transportation system in the future.
Temporal modeling is crucial for video super-resolution. Most of the video super-resolution methods adopt the optical flow or deformable convolution for explicitly motion compensation. However, such temporal modeling techniques increase the model complexity and might fail in case of occlusion or complex motion, resulting in serious distortion and artifacts. In this paper, we propose to explore the role of explicit temporal difference modeling in both LR and HR space. Instead of directly feeding consecutive frames into a VSR model, we propose to compute the temporal difference between frames and divide those pixels into two subsets according to the level of difference. They are separately processed with two branches of different receptive fields in order to better extract complementary information. To further enhance the super-resolution result, not only spatial residual features are extracted, but the difference between consecutive frames in high-frequency domain is also computed. It allows the model to exploit intermediate SR results in both future and past to refine the current SR output. The difference at different time steps could be cached such that information from further distance in time could be propagated to the current frame for refinement. Experiments on several video super-resolution benchmark datasets demonstrate the effectiveness of the proposed method and its favorable performance against state-of-the-art methods.
Scenario-based approaches have been receiving a huge amount of attention in research and engineering of automated driving systems. Due to the complexity and uncertainty of the driving environment, and the complexity of the driving task itself, the number of possible driving scenarios that an ADS or ADAS may encounter is virtually infinite. Therefore it is essential to be able to reason about the identification of scenarios and in particular critical ones that may impose unacceptable risk if not considered. Critical scenarios are particularly important to support design, verification and validation efforts, and as a basis for a safety case. In this paper, we present the results of a systematic literature review in the context of autonomous driving. The main contributions are: (i) introducing a comprehensive taxonomy for critical scenario identification methods; (ii) giving an overview of the state-of-the-art research based on the taxonomy encompassing 86 papers between 2017 and 2020; and (iii) identifying open issues and directions for further research. The provided taxonomy comprises three main perspectives encompassing the problem definition (the why), the solution (the methods to derive scenarios), and the assessment of the established scenarios. In addition, we discuss open research issues considering the perspectives of coverage, practicability, and scenario space explosion.
Reference-based image super-resolution (RefSR) has shown promising success in recovering high-frequency details by utilizing an external reference image (Ref). In this task, texture details are transferred from the Ref image to the low-resolution (LR) image according to their point- or patch-wise correspondence. Therefore, high-quality correspondence matching is critical. It is also desired to be computationally efficient. Besides, existing RefSR methods tend to ignore the potential large disparity in distributions between the LR and Ref images, which hurts the effectiveness of the information utilization. In this paper, we propose the MASA network for RefSR, where two novel modules are designed to address these problems. The proposed Match & Extraction Module significantly reduces the computational cost by a coarse-to-fine correspondence matching scheme. The Spatial Adaptation Module learns the difference of distribution between the LR and Ref images, and remaps the distribution of Ref features to that of LR features in a spatially adaptive way. This scheme makes the network robust to handle different reference images. Extensive quantitative and qualitative experiments validate the effectiveness of our proposed model.
New autonomous driving technologies are emerging every day and some of them have been commercially applied in the real world. While benefiting from these technologies, autonomous trucks are facing new challenges in short-term maintenance planning, which directly influences the truck operator's profit. In this paper, we implement a vehicle health management system by addressing the maintenance planning issues of autonomous trucks on a transport mission. We also present a maintenance planning model using a risk-based decision-making method, which identifies the maintenance decision with minimal economic risk of the truck company. Both availability losses and maintenance costs are considered when evaluating the economic risk. We demonstrate the proposed model by numerical experiments illustrating real-world scenarios. In the experiments, compared to three baseline methods, the expected economic risk of the proposed method is reduced by up to $47\%$. We also conduct sensitivity analyses of different model parameters. The analyses show that the economic risk significantly decreases when the estimation accuracy of remaining useful life, the maximal allowed time of delivery delay before order cancellation, or the number of workshops increases. The experiment results contribute to identifying future research and development attentions of autonomous trucks from an economic perspective.
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame. In this process, inter- and intra-frames are the key sources for exploiting temporal and spatial information. However, there are a couple of limitations for existing VSR methods. First, optical flow is often used to establish temporal correspondence. But flow estimation itself is error-prone and affects recovery results. Second, similar patterns existing in natural images are rarely exploited for the VSR task. Motivated by these findings, we propose a temporal multi-correspondence aggregation strategy to leverage similar patches across frames, and a cross-scale nonlocal-correspondence aggregation scheme to explore self-similarity of images across scales. Based on these two new modules, we build an effective multi-correspondence aggregation network (MuCAN) for VSR. Our method achieves state-of-the-art results on multiple benchmark datasets. Extensive experiments justify the effectiveness of our method.
Blind inpainting is a task to automatically complete visual contents without specifying masks for missing areas in an image. Previous works assume missing region patterns are known, limiting its application scope. In this paper, we relax the assumption by defining a new blind inpainting setting, making training a blind inpainting neural system robust against various unknown missing region patterns. Specifically, we propose a two-stage visual consistency network (VCN), meant to estimate where to fill (via masks) and generate what to fill. In this procedure, the unavoidable potential mask prediction errors lead to severe artifacts in the subsequent repairing. To address it, our VCN predicts semantically inconsistent regions first, making mask prediction more tractable. Then it repairs these estimated missing regions using a new spatial normalization, enabling VCN to be robust to the mask prediction errors. In this way, semantically convincing and visually compelling content is thus generated. Extensive experiments are conducted, showing our method is effective and robust in blind image inpainting. And our VCN allows for a wide spectrum of applications.