Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Image": models, code, and papers

Gaze Embeddings for Zero-Shot Image Classification

Apr 12, 2017
Nour Karessli, Zeynep Akata, Bernt Schiele, Andreas Bulling

Figure 1 for Gaze Embeddings for Zero-Shot Image Classification

Figure 2 for Gaze Embeddings for Zero-Shot Image Classification

Figure 3 for Gaze Embeddings for Zero-Shot Image Classification

Figure 4 for Gaze Embeddings for Zero-Shot Image Classification

Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification.

Via

Access Paper or Ask Questions

Spectral Machine Learning for Pancreatic Mass Imaging Classification

May 03, 2021
Yiming Liu, Ying Chen, Guangming Pan, Weichung Wang, Wei-Chih Liao, Yee Liang Thian, Cheng E. Chee, Constantinos P. Anastassiades

Figure 1 for Spectral Machine Learning for Pancreatic Mass Imaging Classification

Figure 2 for Spectral Machine Learning for Pancreatic Mass Imaging Classification

Figure 3 for Spectral Machine Learning for Pancreatic Mass Imaging Classification

Figure 4 for Spectral Machine Learning for Pancreatic Mass Imaging Classification

We present a novel spectral machine learning (SML) method in screening for pancreatic mass using CT imaging. Our algorithm is trained with approximately 30,000 images from 250 patients (50 patients with normal pancreas and 200 patients with abnormal pancreas findings) based on public data sources. A test accuracy of 94.6 percents was achieved in the out-of-sample diagnosis classification based on a total of approximately 15,000 images from 113 patients, whereby 26 out of 32 patients with normal pancreas and all 81 patients with abnormal pancreas findings were correctly diagnosed. SML is able to automatically choose fundamental images (on average 5 or 9 images for each patient) in the diagnosis classification and achieve the above mentioned accuracy. The computational time is 75 seconds for diagnosing 113 patients in a laptop with standard CPU running environment. Factors that influenced high performance of a well-designed integration of spectral learning and machine learning included: 1) use of eigenvectors corresponding to several of the largest eigenvalues of sample covariance matrix (spike eigenvectors) to choose input attributes in classification training, taking into account only the fundamental information of the raw images with less noise; 2) removal of irrelevant pixels based on mean-level spectral test to lower the challenges of memory capacity and enhance computational efficiency while maintaining superior classification accuracy; 3) adoption of state-of-the-art machine learning classification, gradient boosting and random forest. Our methodology showcases practical utility and improved accuracy of image diagnosis in pancreatic mass screening in the era of AI.

* 17 pages, 3 figures

Via

Access Paper or Ask Questions

Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

Mar 23, 2021
Qing Liu, Vignesh Ramanathan, Dhruv Mahajan, Alan Yuille, Zhenheng Yang

Figure 1 for Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

Figure 2 for Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

Figure 3 for Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

Figure 4 for Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

Weakly supervised instance segmentation reduces the cost of annotations required to train models. However, existing approaches which rely only on image-level class labels predominantly suffer from errors due to (a) partial segmentation of objects and (b) missing object predictions. We show that these issues can be better addressed by training with weakly labeled videos instead of images. In videos, motion and temporal consistency of predictions across frames provide complementary signals which can help segmentation. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation. We propose two ways to leverage this information in our model. First, we adapt inter-pixel relation network (IRN) to effectively incorporate motion information during training. Second, we introduce a new MaskConsist module, which addresses the problem of missing object instances by transferring stable predictions between neighboring frames during training. We demonstrate that both approaches together improve the instance segmentation metric $AP_{50}$ on video frames of two datasets: Youtube-VIS and Cityscapes by $5\%$ and $3\%$ respectively.

* 14 pages, 8 figures, accepted by CVPR 2021

Via

Access Paper or Ask Questions

OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

Mar 13, 2021
Ke-Jyun Wang, Yun-Hsuan Liu, Hung-Ting Su, Jen-Wei Wang, Yu-Siang Wang, Winston H. Hsu, Wen-Chin Chen

Figure 1 for OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

Figure 2 for OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

Figure 3 for OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

Figure 4 for OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

To effectively apply robots in working environments and assist humans, it is essential to develop and evaluate how visual grounding (VG) can affect machine performance on occluded objects. However, current VG works are limited in working environments, such as offices and warehouses, where objects are usually occluded due to space utilization issues. In our work, we propose a novel OCID-Ref dataset featuring a referring expression segmentation task with referring expressions of occluded objects. OCID-Ref consists of 305,694 referring expressions from 2,300 scenes with providing RGB image and point cloud inputs. To resolve challenging occlusion issues, we argue that it's crucial to take advantage of both 2D and 3D signals to resolve challenging occlusion issues. Our experimental results demonstrate the effectiveness of aggregating 2D and 3D signals but referring to occluded objects still remains challenging for the modern visual grounding systems. OCID-Ref is publicly available at https://github.com/lluma/OCID-Ref

* NAACL 2021

Via

Access Paper or Ask Questions

Im2Avatar: Colorful 3D Reconstruction from a Single Image

Apr 17, 2018
Yongbin Sun, Ziwei Liu, Yue Wang, Sanjay E. Sarma

Figure 1 for Im2Avatar: Colorful 3D Reconstruction from a Single Image

Figure 2 for Im2Avatar: Colorful 3D Reconstruction from a Single Image

Figure 3 for Im2Avatar: Colorful 3D Reconstruction from a Single Image

Figure 4 for Im2Avatar: Colorful 3D Reconstruction from a Single Image

Existing works on single-image 3D reconstruction mainly focus on shape recovery. In this work, we study a new problem, that is, simultaneously recovering 3D shape and surface color from a single image, namely "colorful 3D reconstruction". This problem is both challenging and intriguing because the ability to infer textured 3D model from a single image is at the core of visual understanding. Here, we propose an end-to-end trainable framework, Colorful Voxel Network (CVN), to tackle this problem. Conditioned on a single 2D input, CVN learns to decompose shape and surface color information of a 3D object into a 3D shape branch and a surface color branch, respectively. Specifically, for the shape recovery, we generate a shape volume with the state of its voxels indicating occupancy. For the surface color recovery, we combine the strength of appearance hallucination and geometric projection by concurrently learning a regressed color volume and a 2D-to-3D flow volume, which are then fused into a blended color volume. The final textured 3D model is obtained by sampling color from the blended color volume at the positions of occupied voxels in the shape volume. To handle the severe sparse volume representations, a novel loss function, Mean Squared False Cross-Entropy Loss (MSFCEL), is designed. Extensive experiments demonstrate that our approach achieves significant improvement over baselines, and shows great generalization across diverse object categories and arbitrary viewpoints.

* 10 pages

Via

Access Paper or Ask Questions

Contrastive Learning Improves Model Robustness Under Label Noise

Apr 19, 2021
Aritra Ghosh, Andrew Lan

Figure 1 for Contrastive Learning Improves Model Robustness Under Label Noise

Figure 2 for Contrastive Learning Improves Model Robustness Under Label Noise

Figure 3 for Contrastive Learning Improves Model Robustness Under Label Noise

Deep neural network-based classifiers trained with the categorical cross-entropy (CCE) loss are sensitive to label noise in the training data. One common type of method that can mitigate the impact of label noise can be viewed as supervised robust methods; one can simply replace the CCE loss with a loss that is robust to label noise, or re-weight training samples and down-weight those with higher loss values. Recently, another type of method using semi-supervised learning (SSL) has been proposed, which augments these supervised robust methods to exploit (possibly) noisy samples more effectively. Although supervised robust methods perform well across different data types, they have been shown to be inferior to the SSL methods on image classification tasks under label noise. Therefore, it remains to be seen that whether these supervised robust methods can also perform well if they can utilize the unlabeled samples more effectively. In this paper, we show that by initializing supervised robust methods using representations learned through contrastive learning leads to significantly improved performance under label noise. Surprisingly, even the simplest method (training a classifier with the CCE loss) can outperform the state-of-the-art SSL method by more than 50\% under high label noise when initialized with contrastive learning. Our implementation will be publicly available at {\url{https://github.com/arghosh/noisy_label_pretrain}}.

* Learning from Limited or Imperfect Data (L^2ID) Workshop @ CVPR 2021

Via

Access Paper or Ask Questions

Image Quality Assessment Guided Deep Neural Networks Training

Aug 13, 2017
Zhuo Chen, Weisi Lin, Shiqi Wang, Long Xu, Leida Li

Figure 1 for Image Quality Assessment Guided Deep Neural Networks Training

Figure 2 for Image Quality Assessment Guided Deep Neural Networks Training

Figure 3 for Image Quality Assessment Guided Deep Neural Networks Training

Figure 4 for Image Quality Assessment Guided Deep Neural Networks Training

For many computer vision problems, the deep neural networks are trained and validated based on the assumption that the input images are pristine (i.e., artifact-free). However, digital images are subject to a wide range of distortions in real application scenarios, while the practical issues regarding image quality in high level visual information understanding have been largely ignored. In this paper, in view of the fact that most widely deployed deep learning models are susceptible to various image distortions, the distorted images are involved for data augmentation in the deep neural network training process to learn a reliable model for practical applications. In particular, an image quality assessment based label smoothing method, which aims at regularizing the label distribution of training images, is further proposed to tune the objective functions in learning the neural network. Experimental results show that the proposed method is effective in dealing with both low and high quality images in the typical image classification task.

Via

Access Paper or Ask Questions

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Dec 30, 2020
Peixuan Li, Shun Su, Huaici Zhao

Figure 1 for RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Figure 2 for RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Figure 3 for RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Figure 4 for RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Although the recent image-based 3D object detection methods using Pseudo-LiDAR representation have shown great capabilities, a notable gap in efficiency and accuracy still exist compared with LiDAR-based methods. Besides, over-reliance on the stand-alone depth estimator, requiring a large number of pixel-wise annotations in the training stage and more computation in the inferencing stage, limits the scaling application in the real world. In this paper, we propose an efficient and accurate 3D object detection method from stereo images, named RTS3D. Different from the 3D occupancy space in the Pseudo-LiDAR similar methods, we design a novel 4D feature-consistent embedding (FCE) space as the intermediate representation of the 3D scene without depth supervision. The FCE space encodes the object's structural and semantic information by exploring the multi-scale feature consistency warped from stereo pair. Furthermore, a semantic-guided RBF (Radial Basis Function) and a structure-aware attention module are devised to reduce the influence of FCE space noise without instance mask supervision. Experiments on the KITTI benchmark show that RTS3D is the first true real-time system (FPS$>$24) for stereo image 3D detection meanwhile achieves $10\%$ improvement in average precision comparing with the previous state-of-the-art method. The code will be available at https://github.com/Banconxuan/RTS3D

* 9 pages,6 figures

Via

Access Paper or Ask Questions

Continuous control of an underground loader using deep reinforcement learning

Mar 01, 2021
Sofi Backman, Daniel Lindmark, Kenneth Bodin, Martin Servin, Joakim Mörk, Håkan Löfgren

Figure 1 for Continuous control of an underground loader using deep reinforcement learning

Figure 2 for Continuous control of an underground loader using deep reinforcement learning

Figure 3 for Continuous control of an underground loader using deep reinforcement learning

Figure 4 for Continuous control of an underground loader using deep reinforcement learning

Reinforcement learning control of an underground loader is investigated in simulated environment, using a multi-agent deep neural network approach. At the start of each loading cycle, one agent selects the dig position from a depth camera image of the pile of fragmented rock. A second agent is responsible for continuous control of the vehicle, with the goal of filling the bucket at the selected loading point, while avoiding collisions, getting stuck, or losing ground traction. It relies on motion and force sensors, as well as on camera and lidar. Using a soft actor-critic algorithm the agents learn policies for efficient bucket filling over many subsequent loading cycles, with clear ability to adapt to the changing environment. The best results, on average 75% of the max capacity, are obtained when including a penalty for energy usage in the reward.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

Deep Learning Segmentation of Complex Features in Atomic-Resolution Phase Contrast Transmission Electron Microscopy Images

Dec 09, 2020
Robbie Sadre, Colin Ophus, Anstasiia Butko, Gunther H Weber

Figure 1 for Deep Learning Segmentation of Complex Features in Atomic-Resolution Phase Contrast Transmission Electron Microscopy Images

Figure 2 for Deep Learning Segmentation of Complex Features in Atomic-Resolution Phase Contrast Transmission Electron Microscopy Images

Figure 3 for Deep Learning Segmentation of Complex Features in Atomic-Resolution Phase Contrast Transmission Electron Microscopy Images

Figure 4 for Deep Learning Segmentation of Complex Features in Atomic-Resolution Phase Contrast Transmission Electron Microscopy Images

Phase contrast transmission electron microscopy (TEM) is a powerful tool for imaging the local atomic structure of materials. TEM has been used heavily in studies of defect structures of 2D materials such as monolayer graphene due to its high dose efficiency. However, phase contrast imaging can produce complex nonlinear contrast, even for weakly-scattering samples. It is therefore difficult to develop fully-automated analysis routines for phase contrast TEM studies using conventional image processing tools. For automated analysis of large sample regions of graphene, one of the key problems is segmentation between the structure of interest and unwanted structures such as surface contaminant layers. In this study, we compare the performance of a conventional Bragg filtering method to a deep learning routine based on the U-Net architecture. We show that the deep learning method is more general, simpler to apply in practice, and produces more accurate and robust results than the conventional algorithm. We provide easily-adaptable source code for all results in this paper, and discuss potential applications for deep learning in fully-automated TEM image analysis.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions