This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading Comprehension dataset. The proposed dual-source model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled-a newly developed public dataset and the task of multiple property extraction. It uses the same data as WikiReading but does not inherit its predecessor's identified disadvantages. In addition, we provide a human-annotated test set with diagnostic subsets for a detailed analysis of model performance.
3D Point clouds are a rich source of information that enjoy growing popularity in the vision community. However, due to the sparsity of their representation, learning models based on large point clouds is still a challenge. In this work, we introduce Graphite, a GRAPH-Induced feaTure Extraction pipeline, a simple yet powerful feature transform and keypoint detector. Graphite enables intensive down-sampling of point clouds with keypoint detection accompanied by a descriptor. We construct a generic graph-based learning scheme to describe point cloud regions and extract salient points. To this end, we take advantage of 6D pose information and metric learning to learn robust descriptions and keypoints across different scans. We Reformulate the 3D keypoint pipeline with graph neural networks which allow efficient processing of the point set while boosting its descriptive power which ultimately results in more accurate 3D registrations. We demonstrate our lightweight descriptor on common 3D descriptor matching and point cloud registration benchmarks and achieve comparable results with the state of the art. Describing 100 patches of a point cloud and detecting their keypoints takes only ~0.018 seconds with our proposed network.
Grounding language in contextual information is crucial for fine-grained natural language understanding. One important task that involves grounding contextual modifiers is color generation. Given a reference color "green", and a modifier "bluey", how does one generate a color that could represent "bluey green"? We propose a computational pragmatics model that formulates this color generation task as a recursive game between speakers and listeners. In our model, a pragmatic speaker reasons about the inferences that a listener would make, and thus generates a modified color that is maximally informative to help the listener recover the original referents. In this paper, we show that incorporating pragmatic information provides significant improvements in performance compared with other state-of-the-art deep learning models where pragmatic inference and flexibility in representing colors from a large continuous space are lacking. Our model has an absolute 98% increase in performance for the test cases where the reference colors are unseen during training, and an absolute 40% increase in performance for the test cases where both the reference colors and the modifiers are unseen during training.
To extend the capabilities of spectral imaging, hyperspectral and depth imaging have been combined to capture the higher-dimensional visual information. However, the form factor of the combined imaging systems increases, limiting the applicability of this new technology. In this work, we propose a monocular imaging system for simultaneously capturing hyperspectral-depth (HS-D) scene information with an optimized diffractive optical element (DOE). In the training phase, this DOE is optimized jointly with a convolutional neural network to estimate HS-D data from a snapshot input. To study natural image statistics of this high-dimensional visual data and to enable such a machine learning-based DOE training procedure, we record two HS-D datasets. One is used for end-to-end optimization in deep optical HS-D imaging, and the other is used for enhancing reconstruction performance with a real-DOE prototype. The optimized DOE is fabricated with a grayscale lithography process and inserted into a portable HS-D camera prototype, which is shown to robustly capture HS-D information. In extensive evaluations, we demonstrate that our deep optical imaging system achieves state-of-the-art results for HS-D imaging and that the optimized DOE outperforms alternative optical designs.
Protein interactions are important in a broad range of biological processes. Traditionally, computational methods have been developed to automatically predict protein interface from hand-crafted features. Recent approaches employ deep neural networks and predict the interaction of each amino acid pair independently. However, these methods do not incorporate the important sequential information from amino acid chains and the high-order pairwise interactions. Intuitively, the prediction of an amino acid pair should depend on both their features and the information of other amino acid pairs. In this work, we propose to formulate the protein interface prediction as a 2D dense prediction problem. In addition, we propose a novel deep model to incorporate the sequential information and high-order pairwise interactions to perform interface predictions. We represent proteins as graphs and employ graph neural networks to learn node features. Then we propose the sequential modeling method to incorporate the sequential information and reorder the feature matrix. Next, we incorporate high-order pairwise interactions to generate a 3D tensor containing different pairwise interactions. Finally, we employ convolutional neural networks to perform 2D dense predictions. Experimental results on multiple benchmarks demonstrate that our proposed method can consistently improve the protein interface prediction performance.
In this paper, we propose a novel deep learning architecture to improving word-level lip-reading. On the one hand, we first introduce the multi-scale processing into the spatial feature extraction for lip-reading. Specially, we proposed hierarchical pyramidal convolution (HPConv) to replace the standard convolution in original module, leading to improvements over the model's ability to discover fine-grained lip movements. On the other hand, we merge information in all time steps of the sequence by utilizing self-attention, to make the model pay more attention to the relevant frames. These two advantages are combined together to further enhance the model's classification power. Experiments on the Lip Reading in the Wild (LRW) dataset show that our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art. We also conducted extensive experiments to better understand the behavior of the proposed model.
We experimentally demonstrate a camera whose primary optic is a cannula (diameter=0.22mm and length=12.5mm) that acts a lightpipe transporting light intensity from an object plane (35cm away) to its opposite end. Deep neural networks (DNNs) are used to reconstruct color and grayscale images with field of view of 180 and angular resolution of ~0.40. When trained on images with depth information, the DNN can create depth maps. Finally, we show DNN-based classification of the EMNIST dataset without and with image reconstructions. The former could be useful for imaging with enhanced privacy.
One of the key components for video deblurring is how to exploit neighboring frames. Recent state-of-the-art methods either used aligned adjacent frames to the center frame or propagated the information on past frames to the current frame recurrently. Here we propose multi-blur-to-deblur (MB2D), a novel concept to exploit neighboring frames for efficient video deblurring. Firstly, inspired by unsharp masking, we argue that using more blurred images with long exposures as additional inputs significantly improves performance. Secondly, we propose multi-blurring recurrent neural network (MBRNN) that can synthesize more blurred images from neighboring frames, yielding substantially improved performance with existing video deblurring methods. Lastly, we propose multi-scale deblurring with connecting recurrent feature map from MBRNN (MSDR) to achieve state-of-the-art performance on the popular GoPro and Su datasets in fast and memory efficient ways.
Machine translation from polysynthetic to fusional languages is a challenging task, which gets further complicated by the limited amount of parallel text available. Thus, translation performance is far from the state of the art for high-resource and more intensively studied language pairs. To shed light on the phenomena which hamper automatic translation to and from polysynthetic languages, we study translations from three low-resource, polysynthetic languages (Nahuatl, Wixarika and Yorem Nokki) into Spanish and vice versa. Doing so, we find that in a morpheme-to-morpheme alignment an important amount of information contained in polysynthetic morphemes has no Spanish counterpart, and its translation is often omitted. We further conduct a qualitative analysis and, thus, identify morpheme types that are commonly hard to align or ignored in the translation process.
Autonomous robot-assisted feeding requires the ability to acquire a wide variety of food items. However, it is impossible for such a system to be trained on all types of food in existence. Therefore, a key challenge is choosing a manipulation strategy for a previously unseen food item. Previous work showed that the problem can be represented as a linear contextual bandit on visual information. However, food has a wide variety of multi-modal properties relevant to manipulation that can be hard to distinguish visually. Our key insight is that we can leverage the haptic information we collect during manipulation to learn some of these properties and more quickly adapt our visual model to previously unseen food. In general, we propose a modified linear contextual bandit framework augmented with post hoc context observed after action selection to empirically increase learning speed (as measured by cross-validation mean square error) and reduce cumulative regret. Experiments on synthetic data demonstrate that this effect is more pronounced when the dimensionality of the context is large relative to the post hoc context or when the post hoc context model is particularly easy to learn. Finally, we apply this framework to the bite acquisition problem and demonstrate the acquisition of 8 previously unseen types of food with 21% fewer failures across 64 attempts.