Emerged as one of the best performing techniques for extractive summarization, determinantal point processes select the most probable set of sentences to form a summary according to a probability measure defined by modeling sentence prominence and pairwise repulsion. Traditionally, these aspects are modelled using shallow and linguistically informed features, but the rise of deep contextualized representations raises an interesting question of whether, and to what extent, contextualized representations can be used to improve DPP modeling. Our findings suggest that, despite the success of deep representations, it remains necessary to combine them with surface indicators for effective identification of summary sentences.
We present an axiomatic way of assigning probabilities to black box models. In particular, we quantify an upper bound for probability of a model or in terms of information theory, a lower bound for amount of information that is stored in a model. In our setup, maximizing probabilities of models is equivalent to removing assumptions or information stored in the model. Furthermore, we represent the problem of learning from an alternative view where the underlying probability space is considered directly. In this perspective both the true underlying model and the model at hand are events. Consequently, the problem of learning is represented as minimizing the probability of the symmetric difference of the model and the true underlying model.
We introduce a computationally-efficient CNN micro-architecture Slim Module to design a lightweight deep neural network Slim-Net for face attribute prediction. Slim Modules are constructed by assembling depthwise separable convolutions with pointwise convolution to produce a computationally efficient module. The problem of facial attribute prediction is challenging because of the large variations in pose, background, illumination, and dataset imbalance. We stack these Slim Modules to devise a compact CNN which still maintains very high accuracy. Additionally, the neural network has a very low memory footprint which makes it suitable for mobile and embedded applications. Experiments on the CelebA dataset show that Slim-Net achieves an accuracy of 91.24% with at least 25 times fewer parameters than comparably performing methods, which reduces the memory storage requirement of Slim-net by at least 87%.
The video based CNN works have focused on effective ways to fuse appearance and motion networks, but they typically lack utilizing temporal information over video frames. In this work, we present a novel spatio-temporal fusion network (STFN) that integrates temporal dynamics of appearance and motion information from entire videos. The captured temporal dynamic information is then aggregated for a better video level representation and learned via end-to-end training. The spatio-temporal fusion network consists of two set of Residual Inception blocks that extract temporal dynamics and a fusion connection for appearance and motion features. The benefits of STFN are: (a) it captures local and global temporal dynamics of complementary data to learn video-wide information; and (b) it is applicable to any network for video classification to boost performance. We explore a variety of design choices for STFN and verify how the network performance is varied with the ablation studies. We perform experiments on two challenging human activity datasets, UCF101 and HMDB51, and achieve the state-of-the-art results with the best network.
In this work\footnote {This work was supported in part by the National Science Foundation under grant IIS-1212948.}, we present a method to represent a video with a sequence of words, and learn the temporal sequencing of such words as the key information for predicting and recognizing human actions. We leverage core concepts from the Natural Language Processing (NLP) literature used in sentence classification to solve the problems of action prediction and action recognition. Each frame is converted into a word that is represented as a vector using the Bag of Visual Words (BoW) encoding method. The words are then combined into a sentence to represent the video, as a sentence. The sequence of words in different actions are learned with a simple but effective Temporal Convolutional Neural Network (T-CNN) that captures the temporal sequencing of information in a video sentence. We demonstrate that a key characteristic of the proposed method is its low-latency, i.e. its ability to predict an action accurately with a partial sequence (sentence). Experiments on two datasets, \textit{UCF101} and \textit{HMDB51} show that the method on average reaches 95\% of its accuracy within half the video frames. Results, also demonstrate that our method achieves compatible state-of-the-art performance in action recognition (i.e. at the completion of the sentence) in addition to action prediction.
The most important obstacles facing multi-document summarization include excessive redundancy in source descriptions and the looming shortage of training data. These obstacles prevent encoder-decoder models from being used directly, but optimization-based methods such as determinantal point processes (DPPs) are known to handle them well. In this paper we seek to strengthen a DPP-based method for extractive multi-document summarization by presenting a novel similarity measure inspired by capsule networks. The approach measures redundancy between a pair of sentences based on surface form and semantic information. We show that our DPP system with improved similarity measure performs competitively, outperforming strong summarization baselines on benchmark datasets. Our findings are particularly meaningful for summarizing documents created by multiple authors containing redundant yet lexically diverse expressions.
During the last half decade, convolutional neural networks (CNNs) have triumphed over semantic segmentation, which is one of the core tasks in many applications such as autonomous driving and augmented reality. However, to train CNNs requires a considerable amount of data, which is difficult to collect and laborious to annotate. Recent advances in computer graphics make it possible to train CNNs on photo-realistic synthetic imagery with computer-generated annotations. Despite this, the domain mismatch between the real images and the synthetic data hinders the models' performance. Hence, we propose a curriculum-style learning approach to minimizing the domain gap in urban scene semantic segmentation. The curriculum domain adaptation solves easy tasks first to infer necessary properties about the target domain; in particular, the first task is to learn global label distributions over images and local distributions over landmark superpixels. These are easy to estimate because images of urban scenes have strong idiosyncrasies (e.g., the size and spatial relations of buildings, streets, cars, etc.). We then train a segmentation network, while regularizing its predictions in the target domain to follow those inferred properties. In experiments, our method outperforms the baselines on two datasets and two backbone networks. We also report extensive ablation studies about our approach.
Estimating structures in "big data" and clustering them are among the most fundamental problems in computer vision, pattern recognition, data mining, and many other other research fields. Over the past few decades, many studies have been conducted focusing on different aspects of these problems. One of the main approaches that is explored in the literature to tackle the problems of size and dimensionality is sampling subsets of the data in order to estimate the characteristics of the whole population, e.g. estimating the underlying clusters or structures in the data. In this paper, we propose a `one-time-grab' sampling algorithm\cite{jaberi2015swift,jaberi2018sparse}. This method can be used as the front end to any supervised or unsupervised clustering method. Rather than focusing on the strategy of maximizing the probability of sampling inliers, our goal is to minimize the number of samples needed to instantiate all underlying model instances. More specifically, our goal is to answer the following question: {\em `Given a very large population of points with $C$ embedded structures and gross outliers, what is the minimum number of points $r$ to be selected randomly in one grab in order to make sure with probability $P$ that at least $\varepsilon$ points are selected on each structure, where $\varepsilon$ is the number of degrees of freedom of each structure.'}
Deep neural networks (DNNs) have been demonstrated to be vulnerable to adversarial examples. Specifically, adding imperceptible perturbations to clean images can fool the well trained deep neural networks. In this paper, we propose an end-to-end image compression model to defend adversarial examples: \textbf{ComDefend}. The proposed model consists of a compression convolutional neural network (ComCNN) and a reconstruction convolutional neural network (ResCNN). The ComCNN is used to maintain the structure information of the original image and purify adversarial perturbations. And the ResCNN is used to reconstruct the original image with high quality. In other words, ComDefend can transform the adversarial image to its clean version, which is then fed to the trained classifier. Our method is a pre-processing module, and does not modify the classifier's structure during the whole process. Therefore, it can be combined with other model-specific defense models to jointly improve the classifier's robustness. A series of experiments conducted on MNIST, CIFAR10 and ImageNet show that the proposed method outperforms the state-of-the-art defense methods, and is consistently effective to protect classifiers against adversarial attacks.
During the last half decade, convolutional neural networks (CNNs) have triumphed over semantic segmentation, which is one of the core tasks in many applications such as autonomous driving. However, to train CNNs requires a considerable amount of data, which is difficult to collect and laborious to annotate. Recent advances in computer graphics make it possible to train CNNs on photo-realistic synthetic imagery with computer-generated annotations. Despite this, the domain mismatch between the real images and the synthetic data cripples the models' performance. Hence, we propose a curriculum-style learning approach to minimize the domain gap in urban scenery semantic segmentation. The curriculum domain adaptation solves easy tasks first to infer necessary properties about the target domain; in particular, the first task is to learn global label distributions over images and local distributions over landmark superpixels. These are easy to estimate because images of urban scenes have strong idiosyncrasies (e.g., the size and spatial relations of buildings, streets, cars, etc.). We then train a segmentation network while regularizing its predictions in the target domain to follow those inferred properties. In experiments, our method outperforms the baselines on two datasets and two backbone networks. We also report extensive ablation studies about our approach.