Abstract:Recent advancements in video restoration have focused on recovering high-quality video frames from low-quality inputs. Compared with static images, the performance of video restoration significantly depends on efficient exploitation of temporal correlations among successive video frames. The numerous techniques make use of temporal information via flow-based strategies or recurrent architectures. However, these methods often encounter difficulties in preserving temporal consistency as they utilize degraded input video frames. To resolve this issue, we propose a novel video restoration framework named Joint Flow and Feature Refinement using Attention (JFFRA). The proposed JFFRA is based on key philosophy of iteratively enhancing data through the synergistic collaboration of flow (alignment) and restoration. By leveraging previously enhanced features to refine flow and vice versa, JFFRA enables efficient feature enhancement using temporal information. This interplay between flow and restoration is executed at multiple scales, reducing the dependence on precise flow estimation. Moreover, we incorporate an occlusion-aware temporal loss function to enhance the network's capability in eliminating flickering artifacts. Comprehensive experiments validate the versatility of JFFRA across various restoration tasks such as denoising, deblurring, and super-resolution. Our method demonstrates a remarkable performance improvement of up to 1.62 dB compared to state-of-the-art approaches.
Abstract:One of the central tasks of multi-object tracking involves learning a distance metric that is consistent with the semantic similarities of objects. The design of an appropriate loss function that encourages discriminative feature learning is among the most crucial challenges in deep neural network-based metric learning. Despite significant progress, slow convergence and a poor local optimum of the existing contrastive and triplet loss based deep metric learning methods necessitates a better solution. In this paper, we propose cosine-margin-contrastive (CMC) and cosine-margin-triplet (CMT) loss by reformulating both contrastive and triplet loss functions from the perspective of cosine distance. The proposed reformulation as a cosine loss is achieved by feature normalization which distributes the learned features on a hypersphere. We then propose the MOTS R-CNN framework for joint multi-object tracking and segmentation, particularly targeted at improving the tracking performance. Specifically, the tracking problem is addressed through deep metric learning based on the proposed loss functions. We propose a scale-invariant tracking by using a multi-layer feature aggregation scheme to make the model robust against object scale variations and occlusions. The MOTS R-CNN achieves the state-of-the-art tracking performance on the KITTI MOTS dataset. We show that the MOTS R-CNN reduces the identity switching by $62\%$ and $61\%$ on cars and pedestrians, respectively in comparison to Track R-CNN.