Skiing is a popular winter sport discipline with a long history of competitive events. In this domain, computer vision has the potential to enhance the understanding of athletes' performance, but its application lags behind other sports due to limited studies and datasets. This paper makes a step forward in filling such gaps. A thorough investigation is performed on the task of skier tracking in a video capturing his/her complete performance. Obtaining continuous and accurate skier localization is preemptive for further higher-level performance analyses. To enable the study, the largest and most annotated dataset for computer vision in skiing, SkiTB, is introduced. Several visual object tracking algorithms, including both established methodologies and a newly introduced skier-optimized baseline algorithm, are tested using the dataset. The results provide valuable insights into the applicability of different tracking methods for vision-based skiing analysis. SkiTB, code, and results are available at https://machinelearning.uniud.it/datasets/skitb.
Trajectories are fundamental to winning in alpine skiing. Tools enabling the analysis of such curves can enhance the training activity and enrich broadcasting content. In this paper, we propose SkiTraVis, an algorithm to visualize the sequence of points traversed by a skier during its performance. SkiTraVis works on monocular videos and constitutes a pipeline of a visual tracker to model the skier's motion and of a frame correspondence module to estimate the camera's motion. The separation of the two motions enables the visualization of the trajectory according to the moving camera's perspective. We performed experiments on videos of real-world professional competitions to quantify the visualization error, the computational efficiency, as well as the applicability. Overall, the results achieved demonstrate the potential of our solution for broadcasting media enhancement and coach assistance.
The degradation in the underwater images is due to wavelength-dependent light attenuation, scattering, and to the diversity of the water types in which they are captured. Deep neural networks take a step in this field, providing autonomous models able to achieve the enhancement of underwater images. We introduce Underwater Capsules Vectors GAN UWCVGAN based on the discrete features quantization paradigm from VQGAN for this task. The proposed UWCVGAN combines an encoding network, which compresses the image into its latent representation, with a decoding network, able to reconstruct the enhancement of the image from the only latent representation. In contrast with VQGAN, UWCVGAN achieves feature quantization by exploiting the clusterization ability of capsule layer, making the model completely trainable and easier to manage. The model obtains enhanced underwater images with high quality and fine details. Moreover, the trained encoder is independent of the decoder giving the possibility to be embedded onto the collector as compressing algorithm to reduce the memory space required for the images, of factor $3\times$. \myUWCVGAN{ }is validated with quantitative and qualitative analysis on benchmark datasets, and we present metrics results compared with the state of the art.
The current existing deep image super-resolution methods usually assume that a Low Resolution (LR) image is bicubicly downscaled of a High Resolution (HR) image. However, such an ideal bicubic downsampling process is different from the real LR degradations, which usually come from complicated combinations of different degradation processes, such as camera blur, sensor noise, sharpening artifacts, JPEG compression, and further image editing, and several times image transmission over the internet and unpredictable noises. It leads to the highly ill-posed nature of the inverse upscaling problem. To address these issues, we propose a GAN-based SR approach with learnable adaptive sinusoidal nonlinearities incorporated in LR and SR models by directly learn degradation distributions and then synthesize paired LR/HR training data to train the generalized SR model to real image degradations. We demonstrate the effectiveness of our proposed approach in quantitative and qualitative experiments.
The understanding of human-object interactions is fundamental in First Person Vision (FPV). Visual tracking algorithms which follow the objects manipulated by the camera wearer can provide useful information to effectively model such interactions. In the last years, the computer vision community has significantly improved the performance of tracking algorithms for a large variety of target objects and scenarios. Despite a few previous attempts to exploit trackers in the FPV domain, a methodical analysis of the performance of state-of-the-art trackers is still missing. This research gap raises the question of whether current solutions can be used ``off-the-shelf'' or more domain-specific investigations should be carried out. This paper aims to provide answers to such questions. We present the first systematic investigation of single object tracking in FPV. Our study extensively analyses the performance of 42 algorithms including generic object trackers and baseline FPV-specific trackers. The analysis is carried out by focusing on different aspects of the FPV setting, introducing new performance measures, and in relation to FPV-specific tasks. The study is made possible through the introduction of TREK-150, a novel benchmark dataset composed of 150 densely annotated video sequences. Our results show that object tracking in FPV poses new challenges to current visual trackers. We highlight the factors causing such behavior and point out possible research directions. Despite their difficulties, we prove that trackers bring benefits to FPV downstream tasks requiring short-term object tracking. We expect that generic object tracking will gain popularity in FPV as new and FPV-specific methodologies are investigated.
How to combine the complementary capabilities of an ensemble of different algorithms has been of central interest in visual object tracking. A significant progress on such a problem has been achieved, but considering short-term tracking scenarios. Instead, long-term tracking settings have been substantially ignored by the solutions. In this paper, we explicitly consider long-term tracking scenarios and provide a framework, named CoCoLoT, that combines the characteristics of complementary visual trackers to achieve enhanced long-term tracking performance. CoCoLoT perceives whether the trackers are following the target object through an online learned deep verification model, and accordingly activates a decision policy which selects the best performing tracker as well as it corrects the performance of the failing one. The proposed methodology is evaluated extensively and the comparison with several other solutions reveals that it competes favourably with the state-of-the-art on the most popular long-term visual tracking benchmarks.
Trajectories are fundamental in different skiing disciplines. Tools enabling the analysis of such curves can enhance the training activity and enrich the broadcasting contents. However, the solutions currently available are based on geo-localized sensors and surface models. In this short paper, we propose a video-based approach to reconstruct the sequence of points traversed by an athlete during its performance. Our prototype is constituted by a pipeline of deep learning-based algorithms to reconstruct the athlete's motion and to visualize it according to the camera perspective. This is achieved for different skiing disciplines in the wild without any camera calibration. We tested our solution on broadcast and smartphone-captured videos of alpine skiing and ski jumping professional competitions. The qualitative results achieved show the potential of our solution.
Modern digital cameras and smartphones mostly rely on image signal processing (ISP) pipelines to produce realistic colored RGB images. However, compared to DSLR cameras, low-quality images are usually obtained in many portable mobile devices with compact camera sensors due to their physical limitations. The low-quality images have multiple degradations i.e., sub-pixel shift due to camera motion, mosaick patterns due to camera color filter array, low-resolution due to smaller camera sensors, and the rest information are corrupted by the noise. Such degradations limit the performance of current Single Image Super-resolution (SISR) methods in recovering high-resolution (HR) image details from a single low-resolution (LR) image. In this work, we propose a Raw Burst Super-Resolution Iterative Convolutional Neural Network (RBSRICNN) that follows the burst photography pipeline as a whole by a forward (physical) model. The proposed Burst SR scheme solves the problem with classical image regularization, convex optimization, and deep learning techniques, compared to existing black-box data-driven methods. The proposed network produces the final output by an iterative refinement of the intermediate SR estimates. We demonstrate the effectiveness of our proposed approach in quantitative and qualitative experiments that generalize robustly to real LR burst inputs with onl synthetic burst data available for training.
Person re-identification (re-id) aims to retrieve images of same identities across different camera views. Resolution mismatch occurs due to varying distances between person of interest and cameras, this significantly degrades the performance of re-id in real world scenarios. Most of the existing approaches resolve the re-id task as low resolution problem in which a low resolution query image is searched in a high resolution images gallery. Several approaches apply image super resolution techniques to produce high resolution images but ignore the multiple resolutions of gallery images which is a better realistic scenario. In this paper, we introduce channel correlations to improve the learning of features from the degraded data. In addition, to overcome the problem of multiple resolutions we propose a Resolution based Feature Distillation (RFD) approach. Such an approach learns resolution invariant features by filtering the resolution related features from the final feature vectors that are used to compute the distance matrix. We tested the proposed approach on two synthetically created datasets and on one original multi resolution dataset with real degradation. Our approach improves the performance when multiple resolutions occur in the gallery and have comparable results in case of single resolution (low resolution re-id).