To improve the viewer's Quality of Experience (QoE) and optimize computer graphics applications, 3D model quality assessment (3D-QA) has become an important task in the multimedia area. Point cloud and mesh are the two most widely used digital representation formats of 3D models, the visual quality of which is quite sensitive to lossy operations like simplification and compression. Therefore, many related studies such as point cloud quality assessment (PCQA) and mesh quality assessment (MQA) have been carried out to measure the caused visual quality degradations. However, a large part of previous studies utilizes full-reference (FR) metrics, which means they may fail to predict the quality level with the absence of the reference 3D model. Furthermore, few 3D-QA metrics are carried out to consider color information, which significantly restricts the effectiveness and scope of application. In this paper, we propose a no-reference (NR) quality assessment metric for colored 3D models represented by both point cloud and mesh. First, we project the 3D models from 3D space into quality-related geometry and color feature domains. Then, the natural scene statistics (NSS) and entropy are utilized to extract quality-aware features. Finally, the Support Vector Regressor (SVR) is employed to regress the quality-aware features into quality scores. Our method is mainly validated on the colored point cloud quality assessment database (SJTU-PCQA) and the colored mesh quality assessment database (CMDM). The experimental results show that the proposed method outperforms all the state-of-art NR 3D-QA metrics and obtains an acceptable gap with the state-of-art FR 3D-QA metrics.
To improve the viewer's quality of experience and optimize processing systems in computer graphics applications, the 3D quality assessment (3D-QA) has become an important task in the multimedia area. Point cloud and mesh are the two most widely used electronic representation formats of 3D models, the quality of which is quite sensitive to operations like simplification and compression. Therefore, many studies concerning point cloud quality assessment (PCQA) and mesh quality assessment (MQA) have been carried out to measure the visual quality degradations caused by lossy operations. However, a large part of previous studies utilizes full-reference (FR) metrics, which means they may fail to predict the accurate quality level of 3D models when the reference 3D model is not available. Furthermore, limited numbers of 3D-QA metrics are carried out to take color features into consideration, which significantly restricts the effectiveness and scope of application. In many quality assessment studies, natural scene statistics (NSS) have shown a good ability to quantify the distortion of natural scenes to statistical parameters. Therefore, we propose an NSS-based no-reference quality assessment metric for colored 3D models. In this paper, quality-aware features are extracted from the aspects of color and geometry directly from the 3D models. Then the statistic parameters are estimated using different distribution models to describe the characteristic of the 3D models. Our method is mainly validated on the colored point cloud quality assessment database (SJTU-PCQA) and the colored mesh quality assessment database (CMDM). The experimental results show that the proposed method outperforms all the state-of-art NR 3D-QA metrics and obtains an acceptable gap with the state-of-art FR 3D-QA metrics.
Recent studies reveal that Convolutional Neural Networks (CNNs) are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications. Many adversarial defense methods improve robustness at the cost of accuracy, raising the contradiction between standard and adversarial accuracies. In this paper, we observe an interesting phenomenon that feature statistics change monotonically and smoothly w.r.t the rising of attacking strength. Based on this observation, we propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths. Our method is trained to automatically align features of arbitrary attacking strength. This is done by predicting a fusing weight in a dual-BN architecture. Unlike previous works that need to either retrain the model or manually tune a hyper-parameters for different attacking strengths, our method can deal with arbitrary attacking strengths with a single model without introducing any hyper-parameter. Importantly, our method improves the model robustness against adversarial samples without incurring much loss in standard accuracy. Experiments on CIFAR-10, SVHN, and tiny-ImageNet datasets demonstrate that our method outperforms the state-of-the-art under a wide range of attacking strengths.
With an increasing demand for robots, robotic grasping will has a more important role in future applications. This paper takes grasp stability prediction as the key technology for grasping and tries to solve the problem with time series data inputs including the force and pressure data. Widely applied to more fields to predict unstable grasping with time series data, algorithms can significantly promote the application of artificial intelligence in traditional industries. This research investigates models that combine short-time Fourier transform (STFT) and long short-term memory (LSTM) and then tested generalizability with dexterous hand and suction cup gripper. The experiments suggest good results for grasp stability prediction with the force data and the generalized results in the pressure data. Among the 4 models, (Data + STFT) & LSTM delivers the best performance. We plan to perform more work on grasp stability prediction, generalize the findings to different types of sensors, and apply the grasp stability prediction in more grasping use cases in real life.
In this paper, we propose a deep learning based video quality assessment (VQA) framework to evaluate the quality of the compressed user's generated content (UGC) videos. The proposed VQA framework consists of three modules, the feature extraction module, the quality regression module, and the quality pooling module. For the feature extraction module, we fuse the features from intermediate layers of the convolutional neural network (CNN) network into final quality-aware feature representation, which enables the model to make full use of visual information from low-level to high-level. Specifically, the structure and texture similarities of feature maps extracted from all intermediate layers are calculated as the feature representation for the full reference (FR) VQA model, and the global mean and standard deviation of the final feature maps fused by intermediate feature maps are calculated as the feature representation for the no reference (NR) VQA model. For the quality regression module, we use the fully connected (FC) layer to regress the quality-aware features into frame-level scores. Finally, a subjectively-inspired temporal pooling strategy is adopted to pool frame-level scores into the video-level score. The proposed model achieves the best performance among the state-of-the-art FR and NR VQA models on the Compressed UGC VQA database and also achieves pretty good performance on the in-the-wild UGC VQA databases.
Recent advances in Named Entity Recognition (NER) show that document-level contexts can significantly improve model performance. In many application scenarios, however, such contexts are not available. In this paper, we propose to find external contexts of a sentence by retrieving and selecting a set of semantically relevant texts through a search engine, with the original sentence as the query. We find empirically that the contextual representations computed on the retrieval-based input view, constructed through the concatenation of a sentence and its external contexts, can achieve significantly improved performance compared to the original input view based only on the sentence. Furthermore, we can improve the model performance of both input views by Cooperative Learning, a training method that encourages the two input views to produce similar contextual representations or output label distributions. Experiments show that our approach can achieve new state-of-the-art performance on 8 NER data sets across 5 domains.
As the quality of mobile cameras starts to play a crucial role in modern smartphones, more and more attention is now being paid to ISP algorithms used to improve various perceptual aspects of mobile photos. In this Mobile AI challenge, the target was to develop an end-to-end deep learning-based image signal processing (ISP) pipeline that can replace classical hand-crafted ISPs and achieve nearly real-time performance on smartphone NPUs. For this, the participants were provided with a novel learned ISP dataset consisting of RAW-RGB image pairs captured with the Sony IMX586 Quad Bayer mobile sensor and a professional 102-megapixel medium format camera. The runtime of all models was evaluated on the MediaTek Dimensity 1000+ platform with a dedicated AI processing unit capable of accelerating both floating-point and quantized neural networks. The proposed solutions are fully compatible with the above NPU and are capable of processing Full HD photos under 60-100 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.
This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 8.1 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simultaneous translation, we explore the best practice to optimize the wait-k model. As a result, our final submitted systems exceed the benchmark at around 7 BLEU on the same latency regime. We will publish our code and model to facilitate both future research works and industrial applications.