With the advancement of deep learning, artificial intelligence (AI) has made many breakthroughs in recent years and achieved superhuman performance in various tasks such as object detection, reading comprehension, and video games. Generative Modeling, such as various Generative Adversarial Networks (GAN) models, has been applied to generate paintings and music. Research in Natural Language Processing (NLP) also had a leap forward in 2018 since the release of the pre-trained contextual neural language models such as BERT and recently released GPT3. Despite the exciting AI applications aforementioned, AI is still significantly lagging behind humans in creativity, which is often considered the ultimate moonshot for AI. Our work is inspired by Chinese calligraphy, which is a unique form of visual art where the character itself is an aesthetic painting. We also draw inspirations from paintings of the Abstract Expressionist movement in the 1940s and 1950s, such as the work by American painter Franz Kline. In this paper, we present a creative framework based on Conditional Generative Adversarial Networks and Contextual Neural Language Model to generate abstract artworks that have intrinsic meaning and aesthetic value, which is different from the existing work, such as image captioning and text-to-image generation, where the texts are the descriptions of the images. In addition, we have publicly released a Chinese calligraphy image dataset and demonstrate our framework using a prototype system and a user study.
Generative models, and Generative Adversarial Networks (GAN) in particular, are being studied as possible alternatives to Monte Carlo simulations. It has been proposed that, in certain circumstances, simulation using GANs can be sped-up by using quantum GANs (qGANs). We present a new design of qGAN, the dual-Parameterized Quantum Circuit(PQC) GAN, which consists of a classical discriminator and two quantum generators which take the form of PQCs. The first PQC learns a probability distribution over N-pixel images, while the second generates normalized pixel intensities of an individual image for each PQC input. With a view to HEP applications, we evaluated the dual-PQC architecture on the task of imitating calorimeter outputs, translated into pixelated images. The results demonstrate that the model can reproduce a fixed number of images with a reduced size as well as their probability distribution and we anticipate it should allow us to scale up to real calorimeter outputs.
There are currently many challenges related to three-dimensional (3D) surface reconstruction using capsule endoscopy (CE) images. There are also challenges related to viewing the content of reconstructed 3D surfaces. In this preliminary investigation, the author focuses on the latter and evaluates their effects on the content of reconstructed 3D surfaces using CE images. The evaluation of such challenges is preliminarily conducted into two parts. The first part focuses on the comparison of the content of 3D surfaces reconstructed using both preprocessed and non-preprocessed CE images. The second part focuses on the comparison of the content of 3D surfaces viewed at the same azimuth angles and different elevation angles of the line-of-sight. The experiments demonstrated the need for generalizable line-of-sight and advanced CE image preprocessing means as well as further research in 3D surface reconstruction.
Face super-resolution, also known as face hallucination, which is aimed at enhancing the resolution of low-resolution (LR) one or a sequence of face images to generate the corresponding high-resolution (HR) face images, is a domain-specific image super-resolution problem. Recently, face super-resolution has received considerable attention, and witnessed dazzling advances with deep learning techniques. To date, few summaries of the studies on the deep learning-based face super-resolution are available. In this survey, we present a comprehensive review of deep learning techniques in face super-resolution in a systematic manner. First, we summarize the problem formulation of face super-resolution. Second, we compare the differences between generic image super-resolution and face super-resolution. Third, datasets and performance metrics commonly used in facial hallucination are presented. Fourth, we roughly categorize existing methods according to the utilization of face-specific information. In each category, we start with a general description of design principles, present an overview of representative approaches, and compare the similarities and differences among various methods. Finally, we envision prospects for further technical advancement in this field.
Convolutional neural networks have recently demonstrated high-quality reconstruction for single image super-resolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating high-accuracy super-resolution results. In this paper, we propose the deep Laplacian Pyramid Super-Resolution Network for fast and accurate image super-resolution. The proposed network progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels. In contrast to existing methods that involve the bicubic interpolation for pre-processing (which results in large feature maps), the proposed method directly extracts features from the low-resolution input space and thereby entails low computational loads. We train the proposed network with deep supervision using the robust Charbonnier loss functions and achieve high-quality image reconstruction. Furthermore, we utilize the recursive layers to share parameters across as well as within pyramid levels, and thus drastically reduce the number of parameters. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of run-time and image quality.
Face detector frequently confronts extreme scale variance challenge. The famous solutions are Multi-scale training, Data-anchor-sampling and Random crop strategy. In this paper, we indicate 2 significant elements to resolve extreme scale variance problem by investigating the difference among the previous solutions, including the fore-ground and back-ground information of an image and the scale information. However, current excellent solutions can only utilize the former information while neglecting to absorb the latter one effectively. In order to help the detector utilize the scale information efficiently, we analyze the relationship between the detector performance and the scale distribution of the training data. Based on this analysis, we propose a Selective Scale Enhancement (SSE) strategy which can assimilate these two information efficiently and simultaneously. Finally, our method achieves state-of-the-art detection performance on all common face detection benchmarks, including AFW, PASCAL face, FDDB and Wider Face datasets. Note that our result achieves six champions on the Wider Face dataset.
In the recent time deep learning has achieved huge popularity due to its performance in various machine learning algorithms. Deep learning as hierarchical or structured learning attempts to model high level abstractions in data by using a group of processing layers. The foundation of deep learning architectures is inspired by the understanding of information processing and neural responses in human brain. The architectures are created by stacking multiple linear or non-linear operations. The article mainly focuses on the state-of-art deep learning models and various real world applications specific training methods. Selecting optimal architecture for specific problem is a challenging task, at a closing stage of the article we proposed optimal approach to deep convolutional architecture for the application of image recognition.
Curved refractive objects are common in the human environment, and have a complex visual appearance that can cause robotic vision algorithms to fail. Light-field cameras allow us to address this challenge by capturing the view-dependent appearance of such objects in a single exposure. We propose a novel image feature for light fields that detects and describes the patterns of light refracted through curved transparent objects. We derive characteristic points based on these features allowing them to be used in place of conventional 2D features. Using our features, we demonstrate improved structure-from-motion performance in challenging scenes containing refractive objects, including quantitative evaluations that show improved camera pose estimates and 3D reconstructions. Additionally, our methods converge 15-35% more frequently than the state-of-the-art. Our method is a critical step towards allowing robots to operate around refractive objects, with applications in manufacturing, quality assurance, pick-and-place, and domestic robots working with acrylic, glass and other transparent materials.
Although significant progress in automatic learning of steganographic cost has been achieved recently, existing methods designed for spatial images are not well applicable to JPEG images which are more common media in daily life. The difficulties of migration mostly lie in the unique and complicated JPEG characteristics caused by 8x8 DCT mode structure. To address the issue, in this paper we extend an existing automatic cost learning scheme to JPEG, where the proposed scheme called JEC-RL (JPEG Embedding Cost with Reinforcement Learning) is explicitly designed to tailor the JPEG DCT structure. It works with the embedding action sampling mechanism under reinforcement learning, where a policy network learns the optimal embedding policies via maximizing the rewards provided by an environment network. The policy network is constructed following a domain-transition design paradigm, where three modules including pixel-level texture complexity evaluation, DCT feature extraction, and mode-wise rearrangement, are proposed. These modules operate in serial, gradually extracting useful features from a decompressed JPEG image and converting them into embedding policies for DCT elements, while considering JPEG characteristics including inter-block and intra-block correlations simultaneously. The environment network is designed in a gradient-oriented way to provide stable reward values by using a wide architecture equipped with a fixed preprocessing layer with 8x8 DCT basis filters. Extensive experiments and ablation studies demonstrate that the proposed method can achieve good security performance for JPEG images against both advanced feature based and modern CNN based steganalyzers.
Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57\% using fewer parameters on VQA-v2.0 test-standard split.