In recent years, resolution adaptation based on deep neural networks has enabled significant performance gains for conventional (2D) video codecs. This paper investigates the effectiveness of spatial resolution resampling in the context of immersive content. The proposed approach reduces the spatial resolution of input multi-view videos before encoding, and reconstructs their original resolution after decoding. During the up-sampling process, an advanced CNN model is used to reduce potential re-sampling, compression, and synthesis artifacts. This work has been fully tested with the TMIV coding standard using a Versatile Video Coding (VVC) codec. The results demonstrate that the proposed method achieves a significant rate-quality performance improvement for the majority of the test sequences, with an average BD-VMAF improvement of 3.07 overall sequences.
We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process, where the state transition matrix of the underlying environment stays unknown. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem. This algorithm achieves a total regret bound of $\tilde{\mathcal{O}}(D\sqrt{SAT})$\footnote{The symbol $\tilde{\mathcal{O}}$ means $\mathcal{O}$ with log factors ignored} in time horizon $T$ with $S$ states, $A$ actions and diameter $D$. DTS consists of two parts, the first part is the traditional part where we apply the posterior sampling method on transition matrix based on prior distribution. In the second part, we employ a count-based posterior update method to balance between the local optimal action and the long-term optimal action in order to find the global optimal game value. We established a regret bound of $\tilde{\mathcal{O}}(\sqrt{T}/S^{2})$. Which is by far the best regret bound for finite discounted Markov Decision Process to our knowledge. Numerical results proves the efficiency and superiority of our approach.
In recent years, deep learning techniques have been widely applied to video quality assessment (VQA), showing significant potential to achieve higher correlation performance with subjective opinions compared to conventional approaches. However, these methods are often developed based on limited training materials and evaluated through cross validation, due to the lack of large scale subjective databases. In this context, this paper proposes a new hybrid training methodology, which generates large volumes of training data by using quality indices from an existing perceptual quality metric, VMAF, as training targets, instead of actual subjective opinion scores. An additional shallow CNN is also employed for temporal pooling, which was trained based on a small subjective video database. The resulting Deep Video Quality Metric (based on Hybrid Training), DVQM-HT, has been fully tested on eight HD subjective video databases, and consistently exhibits higher correlation with perceptual quality compared to other deep quality assessment methods, with an average SROCC value of 0.8263.
This paper presents a new deformable convolution-based video frame interpolation (VFI) method, using a coarse to fine 3D CNN to enhance the multi-flow prediction. This model first extracts spatio-temporal features at multiple scales using a 3D CNN, and estimates multi-flows using these features in a coarse-to-fine manner. The estimated multi-flows are then used to warp the original input frames as well as context maps, and the warped results are fused by a synthesis network to produce the final output. This VFI approach has been fully evaluated against 12 state-of-the-art VFI methods on three commonly used test databases. The results evidently show the effectiveness of the proposed method, which offers superior interpolation performance over other state of the art algorithms, with PSNR gains up to 0.19dB.
Video frame interpolation (VFI) is one of the fundamental research areas in video processing and there has been extensive research on novel and enhanced interpolation algorithms. The same is not true for quality assessment of the interpolated content. In this paper, we describe a subjective quality study for VFI based on a newly developed video database, BVI-VFI. BVI-VFI contains 36 reference sequences at three different frame rates and 180 distorted videos generated using five conventional and learning based VFI algorithms. Subjective opinion scores have been collected from 60 human participants, and then employed to evaluate eight popular quality metrics, including PSNR, SSIM and LPIPS which are all commonly used for assessing VFI methods. The results indicate that none of these metrics provide acceptable correlation with the perceived quality on interpolated content, with the best-performing metric, LPIPS, offering a SROCC value below 0.6. Our findings show that there is an urgent need to develop a bespoke perceptual quality metric for VFI. The BVI-VFI dataset is publicly available and can be accessed at https://danielism97.github.io/BVI-VFI/.
The human brain's white matter (WM) structure is of immense interest to the scientific community. Diffusion MRI gives a powerful tool to describe the brain WM structure noninvasively. To potentially enable monitoring of age-related changes and investigation of sex-related brain structure differences on the mapping between the brain connectome and healthy subjects' age and sex, we extract fiber-cluster-based diffusion features and predict sex and age with a novel ensembled neural network classifier. We conduct experiments on the Human Connectome Project (HCP) young adult dataset and show that our model achieves 94.82% accuracy in sex prediction and 2.51 years MAE in age prediction. We also show that the fractional anisotropy (FA) is the most predictive of sex, while the number of fibers is the most predictive of age and the combination of different features can improve the model performance.
White matter parcellation classifies tractography streamlines into clusters or anatomically meaningful tracts to enable quantification and visualization. Most parcellation methods focus on the deep white matter (DWM), while fewer methods address the superficial white matter (SWM) due to its complexity. We propose a deep-learning-based framework, Superficial White Matter Analysis (SupWMA), that performs an efficient and consistent parcellation of 198 SWM clusters from whole-brain tractography. A point-cloud-based network is modified for our SWM parcellation task, and supervised contrastive learning enables more discriminative representations between plausible streamlines and outliers. We perform evaluation on a large tractography dataset with ground truth labels and on three independently acquired testing datasets from individuals across ages and health conditions. Compared to several state-of-the-art methods, SupWMA obtains a highly consistent and accurate SWM parcellation result. In addition, the computational speed of SupWMA is much faster than other methods.
This paper presents a deep learning-based video compression framework (ViSTRA3). The proposed framework intelligently adapts video format parameters of the input video before encoding, subsequently employing a CNN at the decoder to restore their original format and enhance reconstruction quality. ViSTRA3 has been integrated with the H.266/VVC Test Model VTM 14.0, and evaluated under the Joint Video Exploration Team Common Test Conditions. Bj{\o}negaard Delta (BD) measurement results show that the proposed framework consistently outperforms the original VVC VTM, with average BD-rate savings of 1.8% and 3.7% based on the assessment of PSNR and VMAF.
Video frame interpolation (VFI) is currently a very active research topic, with applications spanning computer vision, post production and video encoding. VFI can be extremely challenging, particularly in sequences containing large motions, occlusions or dynamic textures, where existing approaches fail to offer perceptually robust interpolation performance. In this context, we present a novel deep learning based VFI method, ST-MFNet, based on a Spatio-Temporal Multi-Flow architecture. ST-MFNet employs a new multi-scale multi-flow predictor to estimate many-to-one intermediate flows, which are combined with conventional one-to-one optical flows to capture both large and complex motions. In order to enhance interpolation performance for various textures, a 3D CNN is also employed to model the content dynamics over an extended temporal window. Moreover, ST-MFNet has been trained within an ST-GAN framework, which was originally developed for texture synthesis, with the aim of further improving perceptual interpolation quality. Our approach has been comprehensively evaluated -- compared with fourteen state-of-the-art VFI algorithms -- clearly demonstrating that ST-MFNet consistently outperforms these benchmarks on varied and representative test datasets, with significant gains up to 1.09dB in PSNR for cases including large motions and dynamic textures. Project page: https://danielism97.github.io/ST-MFNet.
With the rapid development of deep learning techniques, various recent work has tried to apply graph neural networks (GNNs) to solve NP-hard problems such as Boolean Satisfiability (SAT), which shows the potential in bridging the gap between machine learning and symbolic reasoning. However, the quality of solutions predicted by GNNs has not been well investigated in the literature. In this paper, we study the capability of GNNs in learning to solve Maximum Satisfiability (MaxSAT) problem, both from theoretical and practical perspectives. We build two kinds of GNN models to learn the solution of MaxSAT instances from benchmarks, and show that GNNs have attractive potential to solve MaxSAT problem through experimental evaluation. We also present a theoretical explanation of the effect that GNNs can learn to solve MaxSAT problem to some extent for the first time, based on the algorithmic alignment theory.