Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Image": models, code, and papers

Robust Trust Region for Weakly Supervised Segmentation

Apr 05, 2021
Dmitrii Marin, Yuri Boykov

Figure 1 for Robust Trust Region for Weakly Supervised Segmentation

Figure 2 for Robust Trust Region for Weakly Supervised Segmentation

Figure 3 for Robust Trust Region for Weakly Supervised Segmentation

Figure 4 for Robust Trust Region for Weakly Supervised Segmentation

Acquisition of training data for the standard semantic segmentation is expensive if requiring that each pixel is labeled. Yet, current methods significantly deteriorate in weakly supervised settings, e.g. where a fraction of pixels is labeled or when only image-level tags are available. It has been shown that regularized losses - originally developed for unsupervised low-level segmentation and representing geometric priors on pixel labels - can considerably improve the quality of weakly supervised training. However, many common priors require optimization stronger than gradient descent. Thus, such regularizers have limited applicability in deep learning. We propose a new robust trust region approach for regularized losses improving the state-of-the-art results. Our approach can be seen as a higher-order generalization of the classic chain rule. It allows neural network optimization to use strong low-level solvers for the corresponding regularizers, including discrete ones.

Via

Access Paper or Ask Questions

ILP-M Conv: Optimize Convolution Algorithm for Single-Image Convolution Neural Network Inference on Mobile GPUs

Oct 03, 2019
Zhuoran Ji

Figure 1 for ILP-M Conv: Optimize Convolution Algorithm for Single-Image Convolution Neural Network Inference on Mobile GPUs

Figure 2 for ILP-M Conv: Optimize Convolution Algorithm for Single-Image Convolution Neural Network Inference on Mobile GPUs

Figure 3 for ILP-M Conv: Optimize Convolution Algorithm for Single-Image Convolution Neural Network Inference on Mobile GPUs

Figure 4 for ILP-M Conv: Optimize Convolution Algorithm for Single-Image Convolution Neural Network Inference on Mobile GPUs

Convolution neural networks are widely used for mobile applications. However, GPU convolution algorithms are designed for mini-batch neural network training, the single-image convolution neural network inference algorithm on mobile GPUs is not well-studied. After discussing the usage difference and examining the existing convolution algorithms, we proposed the HNTMP convolution algorithm. The HNTMP convolution algorithm achieves $14.6 \times$ speedup than the most popular \textit{im2col} convolution algorithm, and $2.30 \times$ speedup than the fastest existing convolution algorithm (direct convolution) as far as we know.

Via

Access Paper or Ask Questions

High-Resolution Complex Scene Synthesis with Transformers

May 13, 2021
Manuel Jahn, Robin Rombach, Björn Ommer

Figure 1 for High-Resolution Complex Scene Synthesis with Transformers

Figure 2 for High-Resolution Complex Scene Synthesis with Transformers

Figure 3 for High-Resolution Complex Scene Synthesis with Transformers

Figure 4 for High-Resolution Complex Scene Synthesis with Transformers

The use of coarse-grained layouts for controllable synthesis of complex scene images via deep generative models has recently gained popularity. However, results of current approaches still fall short of their promise of high-resolution synthesis. We hypothesize that this is mostly due to the highly engineered nature of these approaches which often rely on auxiliary losses and intermediate steps such as mask generators. In this note, we present an orthogonal approach to this task, where the generative model is based on pure likelihood training without additional objectives. To do so, we first optimize a powerful compression model with adversarial training which learns to reconstruct its inputs via a discrete latent bottleneck and thereby effectively strips the latent representation of high-frequency details such as texture. Subsequently, we train an autoregressive transformer model to learn the distribution of the discrete image representations conditioned on a tokenized version of the layouts. Our experiments show that the resulting system is able to synthesize high-quality images consistent with the given layouts. In particular, we improve the state-of-the-art FID score on COCO-Stuff and on Visual Genome by up to 19% and 53% and demonstrate the synthesis of images up to 512 x 512 px on COCO and Open Images.

* AI for Content Creation Workshop, CVPR 2021

Via

Access Paper or Ask Questions

OmniFlow: Human Omnidirectional Optical Flow

Apr 16, 2021
Roman Seidel, André Apitzsch, Gangolf Hirtz

Figure 1 for OmniFlow: Human Omnidirectional Optical Flow

Figure 2 for OmniFlow: Human Omnidirectional Optical Flow

Figure 3 for OmniFlow: Human Omnidirectional Optical Flow

Optical flow is the motion of a pixel between at least two consecutive video frames and can be estimated through an end-to-end trainable convolutional neural network. To this end, large training datasets are required to improve the accuracy of optical flow estimation. Our paper presents OmniFlow: a new synthetic omnidirectional human optical flow dataset. Based on a rendering engine we create a naturalistic 3D indoor environment with textured rooms, characters, actions, objects, illumination and motion blur where all components of the environment are shuffled during the data capturing process. The simulation has as output rendered images of household activities and the corresponding forward and backward optical flow. To verify the data for training volumetric correspondence networks for optical flow estimation we train different subsets of the data and test on OmniFlow with and without Test-Time-Augmentation. As a result we have generated 23,653 image pairs and corresponding forward and backward optical flow. Our dataset can be downloaded from: https://mytuc.org/byfs

* CVPRW 2021: The Second OmniCV Workshop: Omnidirectional Computer Vision in Research and Industry

Via

Access Paper or Ask Questions

A Perceptual Model for Eccentricity-dependent Spatio-temporal Flicker Fusion and its Applications to Foveated Graphics

May 05, 2021
Brooke Krajancich, Petr Kellnhofer, Gordon Wetzstein

Figure 1 for A Perceptual Model for Eccentricity-dependent Spatio-temporal Flicker Fusion and its Applications to Foveated Graphics

Figure 2 for A Perceptual Model for Eccentricity-dependent Spatio-temporal Flicker Fusion and its Applications to Foveated Graphics

Figure 3 for A Perceptual Model for Eccentricity-dependent Spatio-temporal Flicker Fusion and its Applications to Foveated Graphics

Figure 4 for A Perceptual Model for Eccentricity-dependent Spatio-temporal Flicker Fusion and its Applications to Foveated Graphics

Virtual and augmented reality (VR/AR) displays strive to provide a resolution, framerate and field of view that matches the perceptual capabilities of the human visual system, all while constrained by limited compute budgets and transmission bandwidths of wearable computing systems. Foveated graphics techniques have emerged that could achieve these goals by exploiting the falloff of spatial acuity in the periphery of the visual field. However, considerably less attention has been given to temporal aspects of human vision, which also vary across the retina. This is in part due to limitations of current eccentricity-dependent models of the visual system. We introduce a new model, experimentally measuring and computationally fitting eccentricity-dependent critical flicker fusion thresholds jointly for both space and time. In this way, our model is unique in enabling the prediction of temporal information that is imperceptible for a certain spatial frequency, eccentricity, and range of luminance levels. We validate our model with an image quality user study, and use it to predict potential bandwidth savings 7x higher than those afforded by current spatial-only foveated models. As such, this work forms the enabling foundation for new temporally foveated graphics techniques.

* 11 pages, 6 figures

Via

Access Paper or Ask Questions

Multimodal Image Captioning for Marketing Analysis

Feb 06, 2018
Philipp Harzig, Stephan Brehm, Rainer Lienhart, Carolin Kaiser, René Schallner

Figure 1 for Multimodal Image Captioning for Marketing Analysis

Figure 2 for Multimodal Image Captioning for Marketing Analysis

Figure 3 for Multimodal Image Captioning for Marketing Analysis

Automatically captioning images with natural language sentences is an important research topic. State of the art models are able to produce human-like sentences. These models typically describe the depicted scene as a whole and do not target specific objects of interest or emotional relationships between these objects in the image. However, marketing companies require to describe these important attributes of a given scene. In our case, objects of interest are consumer goods, which are usually identifiable by a product logo and are associated with certain brands. From a marketing point of view, it is desirable to also evaluate the emotional context of a trademarked product, i.e., whether it appears in a positive or a negative connotation. We address the problem of finding brands in images and deriving corresponding captions by introducing a modified image captioning network. We also add a third output modality, which simultaneously produces real-valued image ratings. Our network is trained using a classification-aware loss function in order to stimulate the generation of sentences with an emphasis on words identifying the brand of a product. We evaluate our model on a dataset of images depicting interactions between humans and branded products. The introduced network improves mean class accuracy by 24.5 percent. Thanks to adding the third output modality, it also considerably improves the quality of generated captions for images depicting branded products.

* 4 pages, 1 figure, accepted at MIPR2018

Via

Access Paper or Ask Questions

Why do deep convolutional networks generalize so poorly to small image transformations?

May 30, 2018
Aharon Azulay, Yair Weiss

Figure 1 for Why do deep convolutional networks generalize so poorly to small image transformations?

Figure 2 for Why do deep convolutional networks generalize so poorly to small image transformations?

Figure 3 for Why do deep convolutional networks generalize so poorly to small image transformations?

Figure 4 for Why do deep convolutional networks generalize so poorly to small image transformations?

Deep convolutional network architectures are often assumed to guarantee generalization for small image translations and deformations. In this paper we show that modern CNNs (VGG16, ResNet50, and InceptionResNetV2) can drastically change their output when an image is translated in the image plane by a few pixels, and that this failure of generalization also happens with other realistic small image transformations. Furthermore, the deeper the network the more we see these failures to generalize. We show that these failures are related to the fact that the architecture of modern CNNs ignores the classical sampling theorem so that generalization is not guaranteed. We also show that biases in the statistics of commonly used image datasets makes it unlikely that CNNs will learn to be invariant to these transformations. Taken together our results suggest that the performance of CNNs in object recognition falls far short of the generalization capabilities of humans.

Via

Access Paper or Ask Questions

Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Nov 23, 2020
Varnith Chordia, Vijay Kumar BG

Figure 1 for Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Figure 2 for Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Figure 3 for Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Figure 4 for Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Accurate and efficient product classification is significant for E-commerce applications, as it enables various downstream tasks such as recommendation, retrieval, and pricing. Items often contain textual and visual information, and utilizing both modalities usually outperforms classification utilizing either mode alone. In this paper we describe our methodology and results for the SIGIR eCom Rakuten Data Challenge. We employ a dual attention technique to model image-text relationships using pretrained language and image embeddings. While dual attention has been widely used for Visual Question Answering(VQA) tasks, ours is the first attempt to apply the concept for multimodal classification.

Via

Access Paper or Ask Questions

Polynomial Networks in Deep Classifiers

Apr 16, 2021
Grigorios G Chrysos, Markos Georgopoulos, Jiankang Deng, Yannis Panagakis

Figure 1 for Polynomial Networks in Deep Classifiers

Figure 2 for Polynomial Networks in Deep Classifiers

Figure 3 for Polynomial Networks in Deep Classifiers

Figure 4 for Polynomial Networks in Deep Classifiers

Deep neural networks have been the driving force behind the success in classification tasks, e.g., object and audio recognition. Impressive results and generalization have been achieved by a variety of recently proposed architectures, the majority of which are seemingly disconnected. In this work, we cast the study of deep classifiers under a unifying framework. In particular, we express state-of-the-art architectures (e.g., residual and non-local networks) in the form of different degree polynomials of the input. Our framework provides insights on the inductive biases of each model and enables natural extensions building upon their polynomial nature. The efficacy of the proposed models is evaluated on standard image and audio classification benchmarks. The expressivity of the proposed models is highlighted both in terms of increased model performance as well as model compression. Lastly, the extensions allowed by this taxonomy showcase benefits in the presence of limited data and long-tailed data distributions. We expect this taxonomy to provide links between existing domain-specific architectures.

* Under review

Via

Access Paper or Ask Questions

Siamese Basis Function Networks for Defect Classification

Dec 09, 2020
Tobias Schlagenhauf, Faruk Yildirim, Benedikt Brückner, Jürgen Fleischer

Figure 1 for Siamese Basis Function Networks for Defect Classification

Figure 2 for Siamese Basis Function Networks for Defect Classification

Figure 3 for Siamese Basis Function Networks for Defect Classification

Figure 4 for Siamese Basis Function Networks for Defect Classification

Defect classification on metallic surfaces is considered a critical issue since substantial quantities of steel and other metals are processed by the manufacturing industry on a daily basis. The authors propose a new approach where they introduce the usage of so called Siamese Kernels in a Basis Function Network to create the Siamese Basis Function Network (SBF-Network). The underlying idea is to classify by comparison using similarity scores. This classification is reinforced through efficient deep learning based feature extraction methods. First, a center image is assigned to each Siamese Kernel. The Kernels are then trained to generate encodings in a way that enables them to distinguish their center from other images in the dataset. Using this approach the authors created some kind of class-awareness inside the Siamese Kernels. To classify a given image, each Siamese Kernel generates a feature vector for its center as well as the given image. These vectors represent encodings of the respective images in a lower-dimensional space. The distance between each pair of encodings is then computed using the cosine distance together with radial basis functions. The distances are fed into a multilayer neural network to perform the classification. With this approach the authors achieved outstanding results on the state of the art NEU surface defect dataset.

Via

Access Paper or Ask Questions