High-resolution images are widely adopted for high-performance object detection in videos. However, processing high-resolution inputs comes with high computation costs, and naive down-sampling of the input to reduce the computation costs quickly degrades the detection performance. In this paper, we propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection that allows for heavy down-sampling of unimportant background regions while preserving the fine-grained details of a high-resolution image. The resulting image is spatially smaller, leading to reduced computational costs while enabling a performance comparable to a high-resolution input. To achieve this, we propose a differentiable resampling module based on a thin plate spline spatial transformer network (TPS-STN). This module is regularized by a novel loss to provide an explicit supervision signal to learn to "magnify" salient regions. We report state-of-the-art results in the low compute regime on the ImageNet-VID and UA-DETRAC video object detection datasets. We demonstrate that on both datasets, the mAP of an EfficientDet-D1 (EfficientDet-D2) gets on par with EfficientDet-D2 (EfficientDet-D3) at a much lower computational cost. We also show that SALISA significantly improves the detection of small objects. In particular, SALISA with an EfficientDet-D1 detector improves the detection of small objects by $77\%$, and remarkably also outperforms EfficientDetD3 baseline.
There has been an increasing interest in utilizing machine learning methods in inverse problems and imaging. Most of the work has, however, concentrated on image reconstruction problems, and the number of studies regarding the full solution of the inverse problem is limited. In this work, we study a machine learning based approach for the Bayesian inverse problem of photoacoustic tomography. We develop an approach for estimating the posterior distribution in photoacoustic tomography using an approach based on the variational autoencoder. The approach is evaluated with numerical simulations and compared to the solution of the inverse problem using a Bayesian approach.
Visualizing information inside objects is an ever-lasting need to bridge the world from physics, chemistry, biology to computation. Among all tomographic techniques, terahertz (THz) computational imaging has demonstrated its unique sensing features to digitalize multi-dimensional object information in a non-destructive, non-ionizing, and non-invasive way. Applying modern signal processing and physics-guided modalities, THz computational imaging systems are now launched in various application fields in industrial inspection, security screening, chemical inspection and non-destructive evaluation. In this article, we overview recent advances in THz computational imaging modalities in the aspects of system configuration, wave propagation and interaction models, physics-guided algorithm for digitalizing interior information of imaged objects. Several image restoration and reconstruction issues based on multi-dimensional THz signals are further discussed, which provides a crosslink between material digitalization, functional property extraction, and multi-dimensional imager utilization from a signal processing perspective.
Polyps in the colon can turn into cancerous cells if not removed with early intervention. Deep learning models are used to minimize the number of polyps that goes unnoticed by the experts, and to accurately segment the detected polyps during these interventions. Although these models perform well on these tasks, they require too many parameters, which can pose a problem with real-time applications. To address this problem, we propose a novel segmentation model called PlutoNet which requires only 2,626,337 parameters while outperforming state-of-the-art models on multiple medical image segmentation tasks. We use EfficientNetB0 architecture as a backbone and propose the novel \emph{modified partial decoder}, which is a combination of partial decoder and full scale connections, which further reduces the number of parameters required, as well as captures semantic details. We use asymmetric convolutions to handle varying polyp sizes. Finally, we weight each feature map to improve segmentation by using a squeeze and excitation block. In addition to polyp segmentation in colonoscopy, we tested our model on segmentation of nuclei and surgical instruments to demonstrate its generalizability to different medical image segmentation tasks. Our model outperformed the state-of-the-art models with a Dice score of \%92.3 in CVC-ClinicDB dataset and \%89.3 in EndoScene dataset, a Dice score of \%91.93 on the 2018 Data Science Bowl Challenge dataset, and a Dice score of \%94.8 on Kvasir-Instrument dataset. Our experiments and ablation studies show that our model is superior in terms of accuracy, and it is able generalize well to multiple medical segmentation tasks.
The task of Outside Knowledge Visual Question Answering (OKVQA) requires an automatic system to answer natural language questions about pictures and images using external knowledge. We observe that many visual questions, which contain deictic referential phrases referring to entities in the image, can be rewritten as "non-grounded" questions and can be answered by existing text-based question answering systems. This allows for the reuse of existing text-based Open Domain Question Answering (QA) Systems for visual question answering. In this work, we propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions. Given an image and a question pertaining to that image (a visual question), we first extract the entities present in the image using pre-trained object and scene classifiers. Using these detected entities, the visual questions can be rewritten so as to be answerable by open domain QA systems. We explore two rewriting strategies: (1) an unsupervised method using BERT for masking and rewriting, and (2) a weakly supervised approach that combines adaptive rewriting and reinforcement learning techniques to use the implicit feedback from the QA system. We test our strategies on the publicly available OKVQA dataset and obtain a competitive performance with state-of-the-art models while using only 10% of the training data.
This paper shows the application of autonomous Crater Detection using the U-Net, a Fully-Convolutional Neural Network, on Ceres. The U-Net is trained on optical images of the Moon Global Morphology Mosaic based on data collected by the LRO and manual crater catalogues. The Moon-trained network will be tested on Dawn optical images of Ceres: this task is accomplished by means of a Transfer Learning (TL) approach. The trained model has been fine-tuned using 100, 500 and 1000 additional images of Ceres. The test performance was measured on 350 never before seen images, reaching a testing accuracy of 96.24%, 96.95% and 97.19%, respectively. This means that despite the intrinsic differences between the Moon and Ceres, TL works with encouraging results. The output of the U-Net contains predicted craters: it will be post-processed applying global thresholding for image binarization and a template matching algorithm to extract craters positions and radii in the pixel space. Post-processed craters will be counted and compared to the ground truth data in order to compute image segmentation metrics: precision, recall and F1 score. These indices will be computed, and their effect will be discussed for tasks such as automated crater cataloguing and optical navigation.
Rate-distortion optimization (RDO) of codecs, where distortion is quantified by the mean-square error, has been a standard practice in image/video compression over the years. RDO serves well for optimization of codec performance for evaluation of the results in terms of PSNR. However, it is well known that the PSNR does not correlate well with perceptual evaluation of images; hence, RDO is not well suited for perceptual optimization of codecs. Recently, rate-distortion-perception trade-off has been formalized by taking the Kullback-Leibner (KL) divergence between the distributions of the original and reconstructed images as a perception measure. Learned image compression methods that simultaneously optimize rate, mean-square loss, VGG loss, and an adversarial loss were proposed. Yet, there exists no easy approach to fix the rate, distortion or perception at a desired level in a practical learned image compression solution to perform an analysis of the trade-off between rate, distortion and perception measures. In this paper, we propose a practical approach to fix the rate to carry out perception-distortion analysis at a fixed rate in order to perform perceptual evaluation of image compression results in a principled manner. Experimental results provide several insights for practical rate-distortion-perception analysis in learned image compression.
Gesture recognition is essential for the interaction of autonomous vehicles with humans. While the current approaches focus on combining several modalities like image features, keypoints and bone vectors, we present neural network architecture that delivers state-of-the-art results only with body skeleton input data. We propose the spatio-temporal multilayer perceptron for gesture recognition in the context of autonomous vehicles. Given 3D body poses over time, we define temporal and spatial mixing operations to extract features in both domains. Additionally, the importance of each time step is re-weighted with Squeeze-and-Excitation layers. An extensive evaluation of the TCG and Drive&Act datasets is provided to showcase the promising performance of our approach. Furthermore, we deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
This paper examines a combined supervised-unsupervised framework involving dictionary-based blind learning and deep supervised learning for MR image reconstruction from under-sampled k-space data. A major focus of the work is to investigate the possible synergy of learned features in traditional shallow reconstruction using adaptive sparsity-based priors and deep prior-based reconstruction. Specifically, we propose a framework that uses an unrolled network to refine a blind dictionary learning-based reconstruction. We compare the proposed method with strictly supervised deep learning-based reconstruction approaches on several datasets of varying sizes and anatomies. We also compare the proposed method to alternative approaches for combining dictionary-based methods with supervised learning in MR image reconstruction. The improvements yielded by the proposed framework suggest that the blind dictionary-based approach preserves fine image details that the supervised approach can iteratively refine, suggesting that the features learned using the two methods are complementary
Multi-shot interleaved echo planer imaging can obtain diffusion-weighted images (DWI) with high spatial resolution and low distortion, but suffers from ghost artifacts introduced by phase variations between shots. In this work, we aim at solving the challenging reconstructions under severe motions between shots and low signal-to-noise ratio. An explicit phase model with paired phase and magnitude priors is proposed to regularize the reconstruction (PAIR). The former prior is derived from the smoothness of the shot phase and enforced with low-rankness in the k-space domain. The latter explores similar edges among multi-b-value and multi-direction DWI with weighted total variation in the image domain. Extensive simulation and in vivo results show that PAIR can remove ghost image artifacts very well under the high number of shots (8 shots) and significantly suppress the noise under the ultra-high b-value (4000 s/mm2). The explicit phase model PAIR with complementary priors has a good performance on challenging reconstructions under severe motions between shots and low signal-to-noise ratio. PAIR has great potential in the advanced clinical DWI applications and brain function research.