Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yash Patel

Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, USA

Neural Network-based Acoustic Vehicle Counting

Oct 22, 2020

Slobodan Djukanović, Yash Patel, Jiři Matas, Tuomas Virtanen

Figure 1 for Neural Network-based Acoustic Vehicle Counting

Figure 2 for Neural Network-based Acoustic Vehicle Counting

Figure 3 for Neural Network-based Acoustic Vehicle Counting

Figure 4 for Neural Network-based Acoustic Vehicle Counting

Abstract:This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehicles from local minima of a vehicle-to-microphone distance predicted from audio. The distance is predicted via a two-stage (coarse-fine) regression, both realised using neural networks (NNs). Experiments show that the NN-based distance regression outperforms by far the previously proposed support vector regression. The $ 95\% $ confidence interval for the mean of vehicle counting error is within $[0.28\%, -0.55\%]$. Besides the minima-based counting, we propose a deep learning counting which operates on the predicted distance without detecting local minima. Results also show that removing low frequencies in features improves the counting performance.

Via

Access Paper or Ask Questions

A Mobile App for Wound Localization using Deep Learning

Sep 15, 2020

D. M. Anisuzzaman, Yash Patel, Jeffrey Niezgoda, Sandeep Gopalakrishnan, Zeyun Yu

Figure 1 for A Mobile App for Wound Localization using Deep Learning

Figure 2 for A Mobile App for Wound Localization using Deep Learning

Figure 3 for A Mobile App for Wound Localization using Deep Learning

Figure 4 for A Mobile App for Wound Localization using Deep Learning

Abstract:We present an automated wound localizer from 2D wound and ulcer images by using deep neural network, as the first step towards building an automated and complete wound diagnostic system. The wound localizer has been developed by using YOLOv3 model, which is then turned into an iOS mobile application. The developed localizer can detect the wound and its surrounding tissues and isolate the localized wounded region from images, which would be very helpful for future processing such as wound segmentation and classification due to the removal of unnecessary regions from wound images. For Mobile App development with video processing, a lighter version of YOLOv3 named tiny-YOLOv3 has been used. The model is trained and tested on our own image dataset in collaboration with AZH Wound and Vascular Center, Milwaukee, Wisconsin. The YOLOv3 model is compared with SSD model, showing that YOLOv3 gives a mAP value of 93.9%, which is much better than the SSD model (86.4%). The robustness and reliability of these models are also tested on a publicly available dataset named Medetec and shows a very good performance as well.

* 8 pages, 5 figures, 1 table

Via

Access Paper or Ask Questions

Learning Surrogates via Deep Embedding

Jul 17, 2020

Yash Patel, Tomas Hodan, Jiri Matas

Figure 1 for Learning Surrogates via Deep Embedding

Figure 2 for Learning Surrogates via Deep Embedding

Figure 3 for Learning Surrogates via Deep Embedding

Figure 4 for Learning Surrogates via Deep Embedding

Abstract:This paper proposes a technique for training a neural network by minimizing a surrogate loss that approximates the target evaluation metric, which may be non-differentiable. The surrogate is learned via a deep embedding where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. The effectiveness of the proposed technique is demonstrated in a post-tuning setup, where a trained model is tuned using the learned surrogate. Without a significant computational overhead and any bells and whistles, improvements are demonstrated on challenging and practical tasks of scene-text recognition and detection. In the recognition task, the model is tuned using a surrogate approximating the edit distance metric and achieves up to $39\%$ relative improvement in the total edit distance. In the detection task, the surrogate approximates the intersection over union metric for rotated bounding boxes and yields up to $4.25\%$ relative improvement in the $F_{1}$ score.

* ECCV 2020 camera-ready version

Via

Access Paper or Ask Questions

Hierarchical Auto-Regressive Model for Image Compression Incorporating Object Saliency and a Deep Perceptual Loss

Feb 12, 2020

Yash Patel, Srikar Appalaraju, R. Manmatha

Figure 1 for Hierarchical Auto-Regressive Model for Image Compression Incorporating Object Saliency and a Deep Perceptual Loss

Figure 2 for Hierarchical Auto-Regressive Model for Image Compression Incorporating Object Saliency and a Deep Perceptual Loss

Figure 3 for Hierarchical Auto-Regressive Model for Image Compression Incorporating Object Saliency and a Deep Perceptual Loss

Figure 4 for Hierarchical Auto-Regressive Model for Image Compression Incorporating Object Saliency and a Deep Perceptual Loss

Abstract:We propose a new end-to-end trainable model for lossy image compression which includes a number of novel components. This approach incorporates 1) a hierarchical auto-regressive model; 2)it also incorporates saliency in the images and focuses on reconstructing the salient regions better; 3) in addition, we empirically demonstrate that the popularly used evaluations metrics such as MS-SSIM and PSNR are inadequate for judging the performance of deep learned image compression techniques as they do not align well with human perceptual similarity. We, therefore propose an alternative metric, which is learned on perceptual similarity data specific to image compression. Our experiments show that this new metric aligns significantly better with human judgments when compared to other hand-crafted or learned metrics. The proposed compression model not only generates images that are visually better but also gives superior performance for subsequent computer vision tasks such as object detection and segmentation when compared to other engineered or learned codecs.

Via

Access Paper or Ask Questions

Human Perceptual Evaluations for Image Compression

Aug 09, 2019

Yash Patel, Srikar Appalaraju, R. Manmatha

Figure 1 for Human Perceptual Evaluations for Image Compression

Figure 2 for Human Perceptual Evaluations for Image Compression

Figure 3 for Human Perceptual Evaluations for Image Compression

Figure 4 for Human Perceptual Evaluations for Image Compression

Abstract:Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG). A standard way of comparing image compression schemes today is to use perceptual similarity metrics such as PSNR or MS-SSIM (multi-scale structural similarity). This has led to some deep learning techniques which directly optimize for MS-SSIM by choosing it as a loss function. While this leads to a higher MS-SSIM for such techniques, we demonstrate using user studies that the resulting improvement may be misleading. Deep learning techniques for image compression with a higher MS-SSIM may actually be perceptually worse than engineered compression schemes with a lower MS-SSIM.

* arXiv admin note: text overlap with arXiv:1907.08310

Via

Access Paper or Ask Questions

Deep Perceptual Compression

Jul 31, 2019

Yash Patel, Srikar Appalaraju, R. Manmatha

Figure 1 for Deep Perceptual Compression

Figure 2 for Deep Perceptual Compression

Figure 3 for Deep Perceptual Compression

Figure 4 for Deep Perceptual Compression

Abstract:Several deep learned lossy compression techniques have been proposed in the recent literature. Most of these are optimized by using either MS-SSIM (multi-scale structural similarity) or MSE (mean squared error) as a loss function. Unfortunately, neither of these correlate well with human perception and this is clearly visible from the resulting compressed images. In several cases, the MS-SSIM for deep learned techniques is higher than say a conventional, non-deep learned codec such as JPEG-2000 or BPG. However, the images produced by these deep learned techniques are in many cases clearly worse to human eyes than those produced by JPEG-2000 or BPG. We propose the use of an alternative, deep perceptual metric, which has been shown to align better with human perceptual similarity. We then propose Deep Perceptual Compression (DPC) which makes use of an encoder-decoder based image compression model to jointly optimize on the deep perceptual metric and MS-SSIM. Via extensive human evaluations, we show that the proposed method generates visually better results than previous learning based compression methods and JPEG-2000, and is comparable to BPG. Furthermore, we demonstrate that for tasks like object-detection, images compressed with DPC give better accuracy.

Via

Access Paper or Ask Questions

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Jul 01, 2019

Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin Liu(+1 more)

Figure 1 for ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Figure 2 for ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Figure 3 for ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Figure 4 for ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Abstract:With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

* ICDAR'19 camera-ready version. Competition available at https://rrc.cvc.uab.es/?ch=15. The first two authors contributed equally

Via

Access Paper or Ask Questions

Self-Supervised Visual Representations for Cross-Modal Retrieval

Jan 31, 2019

Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar

Figure 1 for Self-Supervised Visual Representations for Cross-Modal Retrieval

Figure 2 for Self-Supervised Visual Representations for Cross-Modal Retrieval

Figure 3 for Self-Supervised Visual Representations for Cross-Modal Retrieval

Figure 4 for Self-Supervised Visual Representations for Cross-Modal Retrieval

Abstract:Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are usually limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration (global context), and (2) the semantic context of its caption (local context). Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like image classification and object detection, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.

* arXiv admin note: text overlap with arXiv:1807.02110

Via

Access Paper or Ask Questions

TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

Jul 04, 2018

Yash Patel, Lluis Gomez, Raul Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar

Figure 1 for TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

Figure 2 for TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

Figure 3 for TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

Figure 4 for TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

Abstract:The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such datasets requires a tremendous amount of human effort and annotations are limited to popular set of classes. As an alternative, learning visual features by designing auxiliary tasks which make use of freely available self-supervision has become increasingly popular in the computer vision community. In this paper, we put forward an idea to take advantage of multi-modal context to provide self-supervision for the training of computer vision algorithms. We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration. More specifically we use popular text embedding techniques to provide the self-supervision for the training of deep CNN. Our experiments demonstrate state-of-the-art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or naturally-supervised approaches.

* arXiv admin note: text overlap with arXiv:1705.08631

Via

Access Paper or Ask Questions

Learning Sampling Policies for Domain Adaptation

May 19, 2018

Yash Patel, Kashyap Chitta, Bhavan Jasani

Figure 1 for Learning Sampling Policies for Domain Adaptation

Figure 2 for Learning Sampling Policies for Domain Adaptation

Abstract:We address the problem of semi-supervised domain adaptation of classification algorithms through deep Q-learning. The core idea is to consider the predictions of a source domain network on target domain data as noisy labels, and learn a policy to sample from this data so as to maximize classification accuracy on a small annotated reward partition of the target domain. Our experiments show that learned sampling policies construct labeled sets that improve accuracies of visual classifiers over baselines.

Via

Access Paper or Ask Questions