Directly regressing all 6 degrees-of-freedom (6DoF) for the object pose (e.g. the 3D rotation and translation) in a cluttered environment from a single RGB image is a challenging problem. While end-to-end methods have recently demonstrated promising results at high efficiency, they are still inferior when compared with elaborate P$n$P/RANSAC-based approaches in terms of pose accuracy. In this work, we address this shortcoming by means of a novel reasoning about self-occlusion, in order to establish a two-layer representation for 3D objects which considerably enhances the accuracy of end-to-end 6D pose estimation. Our framework, named SO-Pose, takes a single RGB image as input and respectively generates 2D-3D correspondences as well as self-occlusion information harnessing a shared encoder and two separate decoders. Both outputs are then fused to directly regress the 6DoF pose parameters. Incorporating cross-layer consistencies that align correspondences, self-occlusion and 6D pose, we can further improve accuracy and robustness, surpassing or rivaling all other state-of-the-art approaches on various challenging datasets.
Following the pioneering works of Rudin, Osher and Fatemi on total variation (TV) and of Buades, Coll and Morel on non-local means (NL-means), the last decade has seen a large number of denoising methods mixing these two approaches, starting with the nonlocal total variation (NLTV) model. The present article proposes an analysis of the NLTV model for image denoising as well as a number of improvements, the most important of which being to apply the denoising both in the space domain and in the Fourier domain, in order to exploit the complementarity of the representation of image data in both domains. A local version obtained by a regionwise implementation followed by an aggregation process, called Local Spatial-Frequency NLTV (L- SFNLTV) model, is finally proposed as a new reference algorithm for image denoising among the family of approaches mixing TV and NL operators. The experiments show the great performance of L-SFNLTV, both in terms of image quality and of computational speed, comparing with other recently proposed NLTV-related methods.
In this paper, a color texture image retrieval framework is proposed based on Shearlet domain modeling using Copula multivariate model. In the proposed framework, Gaussian Copula is used to model the dependencies between different sub-bands of the Non Subsample Shearlet Transform (NSST) and non-Gaussian models are used for marginal modeling of the coefficients. Six different schemes are proposed for modeling NSST coefficients based on the four types of neighboring defined; moreover, Kullback Leibler Divergence(KLD) close form is calculated in different situations for the two Gaussian Copula and non Gaussian functions in order to investigate the similarities in the proposed retrieval framework. The Jeffery divergence (JD) criterion, which is a symmetrical version of KLD, is used for investigating similarities in the proposed framework. We have implemented our experiments on four texture image retrieval benchmark datasets, the results of which show the superiority of the proposed framework over the existing state-of-the-art methods. In addition, the retrieval time of the proposed framework is also analyzed in the two steps of feature extraction and similarity matching, which also shows that the proposed framework enjoys an appropriate retrieval time.
Handwriting recognition has been one of the most fascinating and challenging research areas in field of image processing and pattern recognition. It contributes enormously to the improvement of automation process. In this paper, a system for recognition of unconstrained handwritten Malayalam characters is proposed. A database of 10,000 character samples of 44 basic Malayalam characters is used in this work. A discriminate feature set of 64 local and 4 global features are used to train and test SVM classifier and achieved 92.24% accuracy
Due to limited size and imperfect of the optical components in a spectrometer, aberration has inevitably been brought into two-dimensional multi-fiber spectrum image in LAMOST, which leads to obvious spacial variation of the point spread functions (PSFs). Consequently, if spatial variant PSFs are estimated directly , the huge storage and intensive computation requirements result in deconvolutional spectral extraction method become intractable. In this paper, we proposed a novel method to solve the problem of spatial variation PSF through image aberration correction. When CCD image aberration is corrected, PSF, the convolution kernel, can be approximated by one spatial invariant PSF only. Specifically, machine learning techniques are adopted to calibrate distorted spectral image, including Total Least Squares (TLS) algorithm, intelligent sampling method, multi-layer feed-forward neural networks. The calibration experiments on the LAMOST CCD images show that the calibration effect of proposed method is effectible. At the same time, the spectrum extraction results before and after calibration are compared, results show the characteristics of the extracted one-dimensional waveform are more close to an ideal optics system, and the PSF of the corrected object spectrum image estimated by the blind deconvolution method is nearly central symmetry, which indicates that our proposed method can significantly reduce the complexity of spectrum extraction and improve extraction accuracy.
Existing deep learning-based approaches for monocular 3D object detection in autonomous driving often model the object as a rotated 3D cuboid while the object's geometric shape has been ignored. In this work, we propose an approach for incorporating the shape-aware 2D/3D constraints into the 3D detection framework. Specifically, we employ the deep neural network to learn distinguished 2D keypoints in the 2D image domain and regress their corresponding 3D coordinates in the local 3D object coordinate first. Then the 2D/3D geometric constraints are built by these correspondences for each object to boost the detection performance. For generating the ground truth of 2D/3D keypoints, an automatic model-fitting approach has been proposed by fitting the deformed 3D object model and the object mask in the 2D image. The proposed framework has been verified on the public KITTI dataset and the experimental results demonstrate that by using additional geometrical constraints the detection performance has been significantly improved as compared to the baseline method. More importantly, the proposed framework achieves state-of-the-art performance with real time. Data and code will be available at https://github.com/zongdai/AutoShape
The image blurring process is generally modelled as the convolution of a blur kernel with a latent image. Therefore, the estimation of the blur kernel is essentially important for blind image deblurring. Unlike existing approaches which focus on approaching the problem by enforcing various priors on the blur kernel and the latent image, we are aiming at obtaining a high quality blur kernel directly by studying the problem in the frequency domain. We show that the auto-correlation of the absolute phase-only image can provide faithful information about the motion (e.g. the motion direction and magnitude, we call it the motion pattern in this paper.) that caused the blur, leading to a new and efficient blur kernel estimation approach. The blur kernel is then refined and the sharp image is estimated by solving an optimization problem by enforcing a regularization on the blur kernel and the latent image. We further extend our approach to handle non-uniform blur, which involves spatially varying blur kernels. Our approach is evaluated extensively on synthetic and real data and shows good results compared to the state-of-the-art deblurring approaches.
Medical ultrasound provides images which are the spatial map of the tissue echogenicity. Unfortunately, an ultrasound image is a low-quality version of the expected Tissue Reflectivity Function (TRF) mainly due to the non-ideal Point Spread Function (PSF) of the imaging system. This paper presents a novel beamforming approach based on deep learning to get closer to the ideal PSF in Plane-Wave Imaging (PWI). The proposed approach is designed to reconstruct the desired TRF from echo traces acquired by transducer elements using only a single plane-wave transmission. In this approach, first, an ideal model for the TRF is introduced by setting the imaging PSF as a sharp Gaussian function. Then, a mapping function between the pre-beamformed Radio-Frequency (RF) channel data and the proposed TRF is constructed using deep learning. Network architecture contains multi-resolution decomposition and reconstruction using wavelet transform for effective recovery of high-frequency content of the desired TRF. Inspired by curriculum learning, we exploit step by step training from coarse (mean square error) to fine ($\ell_{0.2}$) loss functions. The proposed method is trained on a large number of simulation ultrasound data with the ground-truth echogenicity map extracted from real photographic images. The performance of the trained network is evaluated on the publicly available simulation and \textit{in vivo} test data without any further fine-tuning. Simulation test results confirm that the proposed method reconstructs images with a high quality in terms of resolution and contrast, which are also visually similar to the proposed ground-truth image. Furthermore, \textit{in vivo} results show that the trained mapping function preserves its performance in the new domain. Therefore, the proposed approach maintains high resolution, contrast, and framerate simultaneously.
Timely handgun detection is a crucial problem to improve public safety; nevertheless, the effectiveness of many surveillance systems still depend of finite human attention. Much of the previous research on handgun detection is based on static image detectors, leaving aside valuable temporal information that could be used to improve object detection in videos. To improve the performance of surveillance systems, a real-time temporal handgun detection system should be built. Using Temporal Yolov5, an architecture based in Quasi-Recurrent Neural Networks, temporal information is extracted from video to improve the results of the handgun detection. Moreover, two publicity available datasets are proposed, labeled with hands, guns, and phones. One containing 2199 static images to train static detectors, and another with 5960 frames of videos to train temporal modules. Additionally, we explore two temporal data augmentation techniques based in Mosaic and Mixup. The resulting systems are three temporal architectures: one focused in reducing inference with a mAP$_{50:95}$ of 56.1, another in having a good balance between inference and accuracy with a mAP$_{50:95}$ of 59.4, and a last one specialized in accuracy with a mAP$_{50:95}$ of 60.2. Temporal Yolov5 achieves real-time detection in the small and medium architectures. Moreover, it takes advantage of temporal features contained in videos to perform better than Yolov5 in our temporal dataset, making TYolov5 suitable for real-world applications. The source code is publicly available at https://github.com/MarioDuran/TYolov5.
Unpaired Image-to-Image Translation (UIT) focuses on translating images among different domains by using unpaired data, which has received increasing research focus due to its practical usage. However, existing UIT schemes defect in the need of supervised training, as well as the lack of encoding domain information. In this paper, we propose an Attribute Guided UIT model termed AGUIT to tackle these two challenges. AGUIT considers multi-modal and multi-domain tasks of UIT jointly with a novel semi-supervised setting, which also merits in representation disentanglement and fine control of outputs. Especially, AGUIT benefits from two-fold: (1) It adopts a novel semi-supervised learning process by translating attributes of labeled data to unlabeled data, and then reconstructing the unlabeled data by a cycle consistency operation. (2) It decomposes image representation into domain-invariant content code and domain-specific style code. The redesigned style code embeds image style into two variables drawn from standard Gaussian distribution and the distribution of domain label, which facilitates the fine control of translation due to the continuity of both variables. Finally, we introduce a new challenge, i.e., disentangled transfer, for UIT models, which adopts the disentangled representation to translate data less related with the training set. Extensive experiments demonstrate the capacity of AGUIT over existing state-of-the-art models.