Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinjun Xiong

Tensor Recovery from Noisy and Multi-Level Quantized Measurements

Dec 05, 2019

Ren Wang, Meng Wang, Jinjun Xiong

Figure 1 for Tensor Recovery from Noisy and Multi-Level Quantized Measurements

Figure 2 for Tensor Recovery from Noisy and Multi-Level Quantized Measurements

Figure 3 for Tensor Recovery from Noisy and Multi-Level Quantized Measurements

Figure 4 for Tensor Recovery from Noisy and Multi-Level Quantized Measurements

Abstract:Higher-order tensors can represent scores in a rating system, frames in a video, and images of the same subject. In practice, the measurements are often highly quantized due to the sampling strategies or the quality of devices. Existing works on tensor recovery have focused on data losses and random noises. Only a few works consider tensor recovery from quantized measurements but are restricted to binary measurements. This paper, for the first time, addresses the problem of tensor recovery from multi-level quantized measurements. Leveraging the low-rank property of the tensor, this paper proposes a nonconvex optimization problem for tensor recovery. We provide a theoretical upper bound of the recovery error, which diminishes to zero when the sizes of dimensions increase to infinity. Our error bound significantly improves over the existing results in one-bit tensor recovery and quantized matrix recovery. A tensor-based alternating proximal gradient descent algorithm with a convergence guarantee is proposed to solve the nonconvex problem. Our recovery method can handle data losses and do not need the information of the quantization rule. The method is validated on synthetic data, image datasets, and music recommender datasets.

Via

Access Paper or Ask Questions

Enabling real-time multi-messenger astrophysics discoveries with deep learning

Nov 26, 2019

E. A. Huerta, Gabrielle Allen, Igor Andreoni, Javier M. Antelis, Etienne Bachelet, Bruce Berriman, Federica Bianco, Rahul Biswas, Matias Carrasco, Kyle Chard(+50 more)

Figure 1 for Enabling real-time multi-messenger astrophysics discoveries with deep learning

Figure 2 for Enabling real-time multi-messenger astrophysics discoveries with deep learning

Abstract:Multi-messenger astrophysics is a fast-growing, interdisciplinary field that combines data, which vary in volume and speed of data processing, from many different instruments that probe the Universe using different cosmic messengers: electromagnetic waves, cosmic rays, gravitational waves and neutrinos. In this Expert Recommendation, we review the key challenges of real-time observations of gravitational wave sources and their electromagnetic and astroparticle counterparts, and make a number of recommendations to maximize their potential for scientific discovery. These recommendations refer to the design of scalable and computationally efficient machine learning algorithms; the cyber-infrastructure to numerically simulate astrophysical sources, and to process and interpret multi-messenger astrophysics data; the management of gravitational wave detections to trigger real-time alerts for electromagnetic and astroparticle follow-ups; a vision to harness future developments of machine learning and cyber-infrastructure resources to cope with the big-data requirements; and the need to build a community of experts to realize the goals of multi-messenger astrophysics.

* Nature Reviews Physics volume 1, pages 600-608 (2019)
* Invited Expert Recommendation for Nature Reviews Physics. The art work produced by E. A. Huerta and Shawn Rosofsky for this article was used by Carl Conway to design the cover of the October 2019 issue of Nature Reviews Physics

Via

Access Paper or Ask Questions

DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

Nov 20, 2019

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wen-mei Hwu

Figure 1 for DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

Figure 2 for DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

Figure 3 for DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

Figure 4 for DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

Abstract:The past few years have seen a surge of applying Deep Learning (DL) models for a wide array of tasks such as image classification, object detection, machine translation, etc. While DL models provide an opportunity to solve otherwise intractable tasks, their adoption relies on them being optimized to meet latency and resource requirements. Benchmarking is a key step in this process but has been hampered in part due to the lack of representative and up-to-date benchmarking suites. This is exacerbated by the fast-evolving pace of DL models. This paper proposes DLBricks, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks on CPUs. DLBricks decomposes DL models into a set of unique runnable networks and constructs the original model's performance using the performance of the generated benchmarks. DLBricks leverages two key observations: DL layers are the performance building blocks of DL models and layers are extensively repeated within and across DL models. Since benchmarks are generated automatically and the benchmarking time is minimized, DLBricks can keep up-to-date with the latest proposed models, relieving the pressure of selecting representative DL models. Moreover, DLBricks allows users to represent proprietary models within benchmark suites. We evaluate DLBricks using $50$ MXNet models spanning $5$ DL tasks on $4$ representative CPU systems. We show that DLBricks provides an accurate performance estimate for the DL models and reduces the benchmarking time across systems (e.g. within $95\%$ accuracy and up to $4.4\times$ benchmarking time speedup on Amazon EC2 c5.xlarge).

Via

Access Paper or Ask Questions

The Design and Implementation of a Scalable DL Benchmarking Platform

Nov 19, 2019

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wen-mei Hwu

Figure 1 for The Design and Implementation of a Scalable DL Benchmarking Platform

Figure 2 for The Design and Implementation of a Scalable DL Benchmarking Platform

Figure 3 for The Design and Implementation of a Scalable DL Benchmarking Platform

Figure 4 for The Design and Implementation of a Scalable DL Benchmarking Platform

Abstract:The current Deep Learning (DL) landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone - stifling the adoption of the innovations. In this work, we first identify $10$ design features which are desirable within a DL benchmarking platform. These features include: performing the evaluation in a consistent, reproducible, and scalable manner, being framework and hardware agnostic, supporting real-world benchmarking workloads, providing in-depth model execution inspection across the HW/SW stack levels, etc. We then propose MLModelScope, a DL benchmarking platform design that realizes the $10$ objectives. MLModelScope proposes a specification to define DL model evaluations and techniques to provision the evaluation workflow using the user-specified HW/SW stack. MLModelScope defines abstractions for frameworks and supports board range of DL models and evaluation scenarios. We implement MLModelScope as an open-source project with support for all major frameworks and hardware architectures. Through MLModelScope's evaluation and automated analysis workflows, we performed case-study analyses of $37$ models across $4$ systems and show how model, hardware, and framework selection affects model accuracy and performance under different benchmarking scenarios. We further demonstrated how MLModelScope's tracing capability gives a holistic view of model execution and helps pinpoint bottlenecks.

Via

Access Paper or Ask Questions

Benanza: Automatic $μ$Benchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Nov 19, 2019

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wen-mei Hwu

Figure 1 for Benanza: Automatic $μ$Benchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Figure 2 for Benanza: Automatic $μ$Benchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Figure 3 for Benanza: Automatic $μ$Benchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Figure 4 for Benanza: Automatic $μ$Benchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Abstract:As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced. We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the "lower-bound" latency of DL models using the benchmark data and informs optimizations of model execution. The "lower-bound" latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate 30 ONNX models in MXNet, ONNX Runtime, and PyTorch on 7 GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.

Via

Access Paper or Ask Questions

NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving

Nov 18, 2019

Cong Hao, Yao Chen, Xinheng Liu, Atif Sarwari, Daryl Sew, Ashutosh Dhar, Bryan Wu, Dongdong Fu, Jinjun Xiong, Wen-mei Hwu(+2 more)

Figure 1 for NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving

Figure 2 for NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving

Figure 3 for NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving

Figure 4 for NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving

Abstract:The rapidly growing demands for powerful AI algorithms in many application domains have motivated massive investment in both high-quality deep neural network (DNN) models and high-efficiency implementations. In this position paper, we argue that a simultaneous DNN/implementation co-design methodology, named Neural Architecture and Implementation Search (NAIS), deserves more research attention to boost the development productivity and efficiency of both DNN models and implementation optimization. We propose a stylized design methodology that can drastically cut down the search cost while preserving the quality of the end solution.As an illustration, we discuss this DNN/implementation methodology in the context of both FPGAs and GPUs. We take autonomous driving as a key use case as it is one of the most demanding areas for high quality AI algorithms and accelerators. We discuss how such a co-design methodology can impact the autonomous driving industry significantly. We identify several research opportunities in this exciting domain.

* 8 pages, ICCAD 2019

Via

Access Paper or Ask Questions

PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space

Sep 25, 2019

Omer Anjum, Hongyu Gong, Suma Bhat, Wen-Mei Hwu, Jinjun Xiong

Figure 1 for PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space

Figure 2 for PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space

Figure 3 for PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space

Figure 4 for PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space

Abstract:Finding the right reviewers to assess the quality of conference submissions is a time consuming process for conference organizers. Given the importance of this step, various automated reviewer-paper matching solutions have been proposed to alleviate the burden. Prior approaches, including bag-of-words models and probabilistic topic models have been inadequate to deal with the vocabulary mismatch and partial topic overlap between a paper submission and the reviewer's expertise. Our approach, the common topic model, jointly models the topics common to the submission and the reviewer's profile while relying on abstract topic vectors. Experiments and insightful evaluations on two datasets demonstrate that the proposed method achieves consistent improvements compared to available state-of-the-art implementations of paper-reviewer matching.

Via

Access Paper or Ask Questions

SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems

Sep 20, 2019

Xiaofan Zhang, Haoming Lu, Cong Hao, Jiachen Li, Bowen Cheng, Yuhong Li, Kyle Rupnow, Jinjun Xiong, Thomas Huang, Honghui Shi(+2 more)

Figure 1 for SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems

Figure 2 for SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems

Figure 3 for SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems

Figure 4 for SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems

Abstract:Developing object detection and tracking on resource-constrained embedded systems is challenging. While object detection is one of the most compute-intensive tasks from the artificial intelligence domain, it is only allowed to use limited computation and memory resources on embedded devices. In the meanwhile, such resource-constrained implementations are often required to satisfy additional demanding requirements such as real-time response, high-throughput performance, and reliable inference accuracy. To overcome these challenges, we propose SkyNet, a hardware-efficient method to deliver the state-of-the-art detection accuracy and speed for embedded systems. Instead of following the common top-down flow for compact DNN design, SkyNet provides a bottom-up DNN design approach with comprehensive understanding of the hardware constraints at the very beginning to deliver hardware-efficient DNNs. The effectiveness of SkyNet is demonstrated by winning the extremely competitive System Design Contest for low power object detection in the 56th IEEE/ACM Design Automation Conference (DAC-SDC), where our SkyNet significantly outperforms all other 100+ competitors: it delivers 0.731 Intersection over Union (IoU) and 67.33 frames per second (FPS) on a TX2 embedded GPU; and 0.716 IoU and 25.05 FPS on an Ultra96 embedded FPGA. The evaluation of SkyNet is also extended to GOT-10K, a recent large-scale high-diversity benchmark for generic object tracking in the wild. For state-of-the-art object trackers SiamRPN++ and SiamMask, where ResNet-50 is employed as the backbone, implementations using our SkyNet as the backbone DNN are 1.60X and 1.73X faster with better or similar accuracy when running on a 1080Ti GPU, and 37.20X smaller in terms of parameter size for significantly better memory and storage footprint.

Via

Access Paper or Ask Questions

MSU-Net: Multiscale Statistical U-Net for Real-time 3D Cardiac MRI Video Segmentation

Sep 15, 2019

Tianchen Wang, Jinjun Xiong, Xiaowei Xu, Meng Jiang, Yiyu Shi, Haiyun Yuan, Meiping Huang, Jian Zhuang

Figure 1 for MSU-Net: Multiscale Statistical U-Net for Real-time 3D Cardiac MRI Video Segmentation

Figure 2 for MSU-Net: Multiscale Statistical U-Net for Real-time 3D Cardiac MRI Video Segmentation

Figure 3 for MSU-Net: Multiscale Statistical U-Net for Real-time 3D Cardiac MRI Video Segmentation

Abstract:Cardiac magnetic resonance imaging (MRI) is an essential tool for MRI-guided surgery and real-time intervention. The MRI videos are expected to be segmented on-the-fly in real practice. However, existing segmentation methods would suffer from drastic accuracy loss when modified for speedup. In this work, we propose Multiscale Statistical U-Net (MSU-Net) for real-time 3D MRI video segmentation in cardiac surgical guidance. Our idea is to model the input samples as multiscale canonical form distributions for speedup, while the spatio-temporal correlation is still fully utilized. A parallel statistical U-Net is then designed to efficiently process these distributions. The fast data sampling and efficient parallel structure of MSU-Net endorse the fast and accurate inference. Compared with vanilla U-Net and a modified state-of-the-art method GridNet, our method achieves up to 268% and 237% speedup with 1.6% and 3.6% increased Dice scores.

* MICCAI19

Via

Access Paper or Ask Questions

SPGNet: Semantic Prediction Guidance for Scene Parsing

Aug 26, 2019

Bowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas Huang, Wen-Mei Hwu, Honghui Shi

Figure 1 for SPGNet: Semantic Prediction Guidance for Scene Parsing

Figure 2 for SPGNet: Semantic Prediction Guidance for Scene Parsing

Figure 3 for SPGNet: Semantic Prediction Guidance for Scene Parsing

Figure 4 for SPGNet: Semantic Prediction Guidance for Scene Parsing

Abstract:Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only 'fine' annotations.

* ICCV 2019

Via

Access Paper or Ask Questions