Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Brian Price

DeepStrip: High Resolution Boundary Refinement

Mar 25, 2020

Peng Zhou, Brian Price, Scott Cohen, Gregg Wilensky, Larry S. Davis

Figure 1 for DeepStrip: High Resolution Boundary Refinement

Figure 2 for DeepStrip: High Resolution Boundary Refinement

Figure 3 for DeepStrip: High Resolution Boundary Refinement

Figure 4 for DeepStrip: High Resolution Boundary Refinement

Abstract:In this paper, we target refining the boundaries in high resolution images given low resolution masks. For memory and computation efficiency, we propose to convert the regions of interest into strip images and compute a boundary prediction in the strip domain. To detect the target boundary, we present a framework with two prediction layers. First, all potential boundaries are predicted as an initial prediction and then a selection layer is used to pick the target boundary and smooth the result. To encourage accurate prediction, a loss which measures the boundary distance in the strip domain is introduced. In addition, we enforce a matching consistency and C0 continuity regularization to the network to reduce false alarms. Extensive experiments on both public and a newly created high resolution dataset strongly validate our approach.

* CVPR 2020

Via

Access Paper or Ask Questions

Getting to 99% Accuracy in Interactive Segmentation

Mar 17, 2020

Marco Forte, Brian Price, Scott Cohen, Ning Xu, François Pitié

Figure 1 for Getting to 99% Accuracy in Interactive Segmentation

Figure 2 for Getting to 99% Accuracy in Interactive Segmentation

Figure 3 for Getting to 99% Accuracy in Interactive Segmentation

Figure 4 for Getting to 99% Accuracy in Interactive Segmentation

Abstract:Interactive object cutout tools are the cornerstone of the image editing workflow. Recent deep-learning based interactive segmentation algorithms have made significant progress in handling complex images and rough binary selections can typically be obtained with just a few clicks. Yet, deep learning techniques tend to plateau once this rough selection has been reached. In this work, we interpret this plateau as the inability of current algorithms to sufficiently leverage each user interaction and also as the limitations of current training/testing datasets. We propose a novel interactive architecture and a novel training scheme that are both tailored to better exploit the user workflow. We also show that significant improvements can be further gained by introducing a synthetic training dataset that is specifically designed for complex object boundaries. Comprehensive experiments support our approach, and our network achieves state of the art performance.

* Submitted for review to Signal Processing: Image Communication

Via

Access Paper or Ask Questions

Deep Visual Template-Free Form Parsing

Sep 18, 2019

Brian Davis, Bryan Morse, Scott Cohen, Brian Price, Chris Tensmeyer

Figure 1 for Deep Visual Template-Free Form Parsing

Figure 2 for Deep Visual Template-Free Form Parsing

Figure 3 for Deep Visual Template-Free Form Parsing

Figure 4 for Deep Visual Template-Free Form Parsing

Abstract:Automatic, template-free extraction of information from form images is challenging due to the variety of form layouts. This is even more challenging for historical forms due to noise and degradation. A crucial part of the extraction process is associating input text with pre-printed labels. We present a learned, template-free solution to detecting pre-printed text and input text/handwriting and predicting pair-wise relationships between them. While previous approaches to this problem have been focused on clean images and clear layouts, we show our approach is effective in the domain of noisy, degraded, and varied form images. We introduce a new dataset of historical form images (late 1800s, early 1900s) for training and validating our approach. Our method uses a convolutional network to detect pre-printed text and input text lines. We pool features from the detection network to classify possible relationships in a language-agnostic way. We show that our proposed pairing method outperforms heuristic rules and that visual features are critical to obtaining high accuracy.

* Accepted at ICDAR 2019. Updated results with average of repeated experiments

Via

Access Paper or Ask Questions

Unconstrained Foreground Object Search

Aug 10, 2019

Yinan Zhao, Brian Price, Scott Cohen, Danna Gurari

Figure 1 for Unconstrained Foreground Object Search

Figure 2 for Unconstrained Foreground Object Search

Figure 3 for Unconstrained Foreground Object Search

Figure 4 for Unconstrained Foreground Object Search

Abstract:Many people search for foreground objects to use when editing images. While existing methods can retrieve candidates to aid in this, they are constrained to returning objects that belong to a pre-specified semantic class. We instead propose a novel problem of unconstrained foreground object (UFO) search and introduce a solution that supports efficient search by encoding the background image in the same latent space as the candidate foreground objects. A key contribution of our work is a cost-free, scalable approach for creating a large-scale training dataset with a variety of foreground objects of differing semantic categories per image location. Quantitative and human-perception experiments with two diverse datasets demonstrate the advantage of our UFO search solution over related baselines.

* To appear in ICCV 2019

Via

Access Paper or Ask Questions

Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Aug 05, 2019

Kushal Kafle, Robik Shrestha, Brian Price, Scott Cohen, Christopher Kanan

Figure 1 for Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Figure 2 for Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Figure 3 for Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Figure 4 for Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Abstract:Chart question answering (CQA) is a newly proposed visual question answering (VQA) task where an algorithm must answer questions about data visualizations, e.g. bar charts, pie charts, and line graphs. CQA requires capabilities that natural-image VQA algorithms lack: fine-grained measurements, optical character recognition, and handling out-of-vocabulary words in both questions and answers. Without modifications, state-of-the-art VQA algorithms perform poorly on this task. Here, we propose a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL). PReFIL first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learned embeddings to answer the given question. Despite its simplicity, PReFIL greatly surpasses state-of-the art systems and human baselines on both the FigureQA and DVQA datasets. Additionally, we demonstrate that PReFIL can be used to reconstruct tables by asking a series of questions about a chart.

Via

Access Paper or Ask Questions

Measuring Human Perception to Improve Handwritten Document Transcription

Apr 10, 2019

Samuel Grieggs, Bingyu Shen, Pei Li, Cana Short, Jiaqi Ma, Mihow McKenny, Melody Wauke, Brian Price, Walter Scheirer

Figure 1 for Measuring Human Perception to Improve Handwritten Document Transcription

Figure 2 for Measuring Human Perception to Improve Handwritten Document Transcription

Figure 3 for Measuring Human Perception to Improve Handwritten Document Transcription

Figure 4 for Measuring Human Perception to Improve Handwritten Document Transcription

Abstract:The subtleties of human perception, as measured by vision scientists through the use of psychophysics, are important clues to the internal workings of visual recognition. For instance, measured reaction time can indicate whether a visual stimulus is easy for a subject to recognize, or whether it is hard. In this paper, we consider how to incorporate psychophysical measurements of visual perception into the loss function of a deep neural network being trained for a recognition task, under the assumption that such information can enforce consistency with human behavior. As a case study to assess the viability of this approach, we look at the problem of handwritten document transcription. While good progress has been made towards automatically transcribing modern handwriting, significant challenges remain in transcribing historical documents. Here we work towards a comprehensive transcription solution for Medieval manuscripts that combines networks trained using our novel loss formulation with natural language processing elements. In a baseline assessment, reliable performance is demonstrated for the standard IAM and RIMES datasets. Further, we go on to show feasibility for our approach on a previously published dataset and a new dataset of digitized Latin manuscripts, originally produced by scribes in the Cloister of St. Gall around the middle of the 9th century.

Via

Access Paper or Ask Questions

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Sep 03, 2018

Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, Thomas Huang

Figure 1 for YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Figure 2 for YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Figure 3 for YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Figure 4 for YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Abstract:Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatial-temporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 3,252 YouTube video clips and 78 categories including common objects and human activities. This is by far the largest video object segmentation dataset to our knowledge and we have released it at https://youtube-vos.org. Based on this dataset, we propose a novel sequence-to-sequence network to fully exploit long-term spatial-temporal information in videos for segmentation. We demonstrate that our method is able to achieve the best results on our YouTube-VOS test set and comparable results on DAVIS 2016 compared to the current state-of-the-art methods. Experiments show that the large scale dataset is indeed a key factor to the success of our model.

* ECCV 2018 accepted paper

Via

Access Paper or Ask Questions

Discriminability objective for training descriptive captions

Jun 08, 2018

Ruotian Luo, Brian Price, Scott Cohen, Gregory Shakhnarovich

Figure 1 for Discriminability objective for training descriptive captions

Figure 2 for Discriminability objective for training descriptive captions

Figure 3 for Discriminability objective for training descriptive captions

Figure 4 for Discriminability objective for training descriptive captions

Abstract:One property that remains lacking in image captions generated by contemporary methods is discriminability: being able to tell two images apart given the caption for one of them. We propose a way to improve this aspect of caption generation. By incorporating into the captioning training objective a loss component directly related to ability (by a machine) to disambiguate image/caption matches, we obtain systems that produce much more discriminative caption, according to human evaluation. Remarkably, our approach leads to improvement in other aspects of generated captions, reflected by a battery of standard scores such as BLEU, SPICE etc. Our approach is modular and can be applied to a variety of model/loss combinations commonly proposed for image captioning.

* CVPR2018

Via

Access Paper or Ask Questions

DVQA: Understanding Data Visualizations via Question Answering

Mar 29, 2018

Kushal Kafle, Brian Price, Scott Cohen, Christopher Kanan

Figure 1 for DVQA: Understanding Data Visualizations via Question Answering

Figure 2 for DVQA: Understanding Data Visualizations via Question Answering

Figure 3 for DVQA: Understanding Data Visualizations via Question Answering

Figure 4 for DVQA: Understanding Data Visualizations via Question Answering

Abstract:Bar charts are an effective way to convey numeric information, but today's algorithms cannot parse them. Existing methods fail when faced with even minor variations in appearance. Here, we present DVQA, a dataset that tests many aspects of bar chart understanding in a question answering framework. Unlike visual question answering (VQA), DVQA requires processing words and answers that are unique to a particular bar chart. State-of-the-art VQA algorithms perform poorly on DVQA, and we propose two strong baselines that perform considerably better. Our work will enable algorithms to automatically extract numeric and semantic information from vast quantities of bar charts found in scientific publications, Internet articles, business reports, and many other areas.

* CVPR 2018 Camera Ready Version

Via

Access Paper or Ask Questions

Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image

Mar 22, 2018

Yinan Zhao, Brian Price, Scott Cohen, Danna Gurari

Figure 1 for Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image

Figure 2 for Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image

Figure 3 for Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image

Figure 4 for Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image

Abstract:Deep generative models have shown success in automatically synthesizing missing image regions using surrounding context. However, users cannot directly decide what content to synthesize with such approaches. We propose an end-to-end network for image inpainting that uses a different image to guide the synthesis of new content to fill the hole. A key challenge addressed by our approach is synthesizing new content in regions where the guidance image and the context of the original image are inconsistent. We conduct four studies that demonstrate our results yield more realistic image inpainting results over seven baselines.

Via

Access Paper or Ask Questions