We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Specially, under both Gaussian and sub-Gaussian models, we study the sample size conditions for the well-known Chow-Liu algorithm to exactly recover the equivalence class of the polytree, which is uniquely represented by a CPDAG. We also study the error rate for the estimation of the inverse correlation matrix under such models. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of the method when the ground truth graphical structure can only be approximated by a polytree.
Synchrotron radiation sources are widely used in various fields, among which computed tomography (CT) is one of the most important. The amount of effort expended by the operator varies depending on the subject. If the number of angles needed to be used can be greatly reduced under the condition of similar imaging effects, the working time and workload of the experimentalists will be greatly reduced. However, decreasing the sampling angle can produce serious artifacts and blur the details. We try to use a deep learning model which can build high quality reconstruction sparse data sampling from the angle of the image and ResAttUnet are put forward. ResAttUnet is roughly a symmetrical U-shaped network that incorporates similar mechanisms to ResNet and attention. In addition, the mixed precision is adopted to reduce the demand for video memory of the model and training time.
Deep neural networks (DNNs) have demonstrated their great potential in recent years, exceeding the per-formance of human experts in a wide range of applications. Due to their large sizes, however, compressiontechniques such as weight quantization and pruning are usually applied before they can be accommodated onthe edge. It is generally believed that quantization leads to performance degradation, and plenty of existingworks have explored quantization strategies aiming at minimum accuracy loss. In this paper, we argue thatquantization, which essentially imposes regularization on weight representations, can sometimes help toimprove accuracy. We conduct comprehensive experiments on three widely used applications: fully con-nected network (FCN) for biomedical image segmentation, convolutional neural network (CNN) for imageclassification on ImageNet, and recurrent neural network (RNN) for automatic speech recognition, and experi-mental results show that quantization can improve the accuracy by 1%, 1.95%, 4.23% on the three applicationsrespectively with 3.5x-6.4x memory reduction.
For current object detectors, the scale of the receptive field of feature extraction operators usually increases layer by layer. Those operators are called scale-oriented operators in this paper, such as the convolution layer in CNN, and the set abstraction layer in PointNet++. The scale-oriented operators are appropriate for 2D images with multi-scale objects, but not natural for 3D point clouds with multi-density but scale-invariant objects. In this paper, we put forward a novel density-oriented PointNet (DPointNet) for 3D object detection in point clouds, in which the density of points increases layer by layer. In experiments for object detection, the DPointNet is applied to PointRCNN, and the results show that the model with the new operator can achieve better performance and higher speed than the baseline PointRCNN, which verify the effectiveness of the proposed DPointNet.
In this paper, we propose a novel deep learning architecture to improving word-level lip-reading. On the one hand, we first introduce the multi-scale processing into the spatial feature extraction for lip-reading. Specially, we proposed hierarchical pyramidal convolution (HPConv) to replace the standard convolution in original module, leading to improvements over the model's ability to discover fine-grained lip movements. On the other hand, we merge information in all time steps of the sequence by utilizing self-attention, to make the model pay more attention to the relevant frames. These two advantages are combined together to further enhance the model's classification power. Experiments on the Lip Reading in the Wild (LRW) dataset show that our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art. We also conducted extensive experiments to better understand the behavior of the proposed model.
The task of multi-turn text-to-SQL semantic parsing aims to translate natural language utterances in an interaction into SQL queries in order to answer them using a database which normally contains multiple table schemas. Previous studies on this task usually utilized contextual information to enrich utterance representations and to further influence the decoding process. While they ignored to describe and track the interaction states which are determined by history SQL queries and are related with the intent of current utterance. In this paper, two kinds of interaction states are defined based on schema items and SQL keywords separately. A relational graph neural network and a non-linear layer are designed to update the representations of these two states respectively. The dynamic schema-state and SQL-state representations are then utilized to decode the SQL query corresponding to current utterance. Experimental results on the challenging CoSQL dataset demonstrate the effectiveness of our proposed method, which achieves better performance than other published methods on the task leaderboard.
In this paper, we propose a visual embedding approach to improving embedding aware speech enhancement (EASE) by synchronizing visual lip frames at the phone and place of articulation levels. We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE). Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE). Experiments on the TCD-TIMIT corpus corrupted by simulated additive noises show that our proposed subword based VEASE approach is more effective than conventional embedding at the word level. Moreover, visual embedding at the articulation place level, leveraging upon a high correlation between place of articulation and lip shapes, shows an even better performance than that at the phone level. Finally the proposed MEASE framework, incorporating both audio and visual embedding, yields significantly better speech quality and intelligibility than those obtained with the best visual-only and audio-only EASE systems.
We present an improved version of PointRCNN for 3D object detection, in which a multi-branch backbone network is adopted to handle the non-uniform density of point clouds. An uncertainty-based sampling policy is proposed to deal with the distribution differences of different point clouds. The new model can achieve about 0.8 AP higher performance than the baseline PointRCNN on KITTI val set. In addition, a simplified model using a single scale grouping for each set-abstraction layer can achieve competitive performance with less computational cost.
Video prediction is a pixel-wise dense prediction task to infer future frames based on past frames. Missing appearance details and motion blur are still two major problems for current predictive models, which lead to image distortion and temporal inconsistency. In this paper, we point out the necessity of exploring multi-frequency analysis to deal with the two problems. Inspired by the frequency band decomposition characteristic of Human Vision System (HVS), we propose a video prediction network based on multi-level wavelet analysis to deal with spatial and temporal information in a unified manner. Specifically, the multi-level spatial discrete wavelet transform decomposes each video frame into anisotropic sub-bands with multiple frequencies, helping to enrich structural information and reserve fine details. On the other hand, multi-level temporal discrete wavelet transform which operates on time axis decomposes the frame sequence into sub-band groups of different frequencies to accurately capture multi-frequency motions under a fixed frame rate. Extensive experiments on diverse datasets demonstrate that our model shows significant improvements on fidelity and temporal consistency over state-of-the-art works.
We introduce a novel single-shot object detector to ease the imbalance of foreground-background class by suppressing the easy negatives while increasing the positives. To achieve this, we propose an Anchor Promotion Module (APM) which predicts the probability of each anchor as positive and adjusts their initial locations and shapes to promote both the quality and quantity of positive anchors. In addition, we design an efficient Feature Alignment Module (FAM) to extract aligned features for fitting the promoted anchors with the help of both the location and shape transformation information from the APM. We assemble the two proposed modules to the backbone of VGG-16 and ResNet-101 network with an encoder-decoder architecture. Extensive experiments on MS COCO well demonstrate our model performs competitively with alternative methods (40.0\% mAP on \textit{test-dev} set) and runs faster (28.6 \textit{fps}).