Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tie-Yan Liu

Double Path Networks for Sequence to Sequence Learning

Jul 04, 2018
Kaitao Song, Xu Tan, Di He, Jianfeng Lu, Tao Qin, Tie-Yan Liu

Figure 1 for Double Path Networks for Sequence to Sequence Learning

Figure 2 for Double Path Networks for Sequence to Sequence Learning

Figure 3 for Double Path Networks for Sequence to Sequence Learning

Figure 4 for Double Path Networks for Sequence to Sequence Learning

Encoder-decoder based Sequence to Sequence learning (S2S) has made remarkable progress in recent years. Different network architectures have been used in the encoder/decoder. Among them, Convolutional Neural Networks (CNN) and Self Attention Networks (SAN) are the prominent ones. The two architectures achieve similar performances but use very different ways to encode and decode context: CNN use convolutional layers to focus on the local connectivity of the sequence, while SAN uses self-attention layers to focus on global semantics. In this work we propose Double Path Networks for Sequence to Sequence learning (DPN-S2S), which leverage the advantages of both models by using double path information fusion. During the encoding step, we develop a double path architecture to maintain the information coming from different paths with convolutional layers and self-attention layers separately. To effectively use the encoded context, we develop a cross attention module with gating and use it to automatically pick up the information needed during the decoding step. By deeply integrating the two paths with cross attention, both types of information are combined and well exploited. Experiments show that our proposed method can significantly improve the performance of sequence to sequence learning over state-of-the-art systems.

* 11 pages, to appear in COLING 2018

Via

Access Paper or Ask Questions

Dense Information Flow for Neural Machine Translation

Jul 02, 2018
Yanyao Shen, Xu Tan, Di He, Tao Qin, Tie-Yan Liu

Figure 1 for Dense Information Flow for Neural Machine Translation

Figure 2 for Dense Information Flow for Neural Machine Translation

Figure 3 for Dense Information Flow for Neural Machine Translation

Figure 4 for Dense Information Flow for Neural Machine Translation

Recently, neural machine translation has achieved remarkable progress by introducing well-designed deep neural networks into its encoder-decoder framework. From the optimization perspective, residual connections are adopted to improve learning performance for both encoder and decoder in most of these deep architectures, and advanced attention connections are applied as well. Inspired by the success of the DenseNet model in computer vision problems, in this paper, we propose a densely connected NMT architecture (DenseNMT) that is able to train more efficiently for NMT. The proposed DenseNMT not only allows dense connection in creating new features for both encoder and decoder, but also uses the dense attention structure to improve attention quality. Our experiments on multiple datasets show that DenseNMT structure is more competitive and efficient.

* in Proceedings of NAACL-HLT 2018

Via

Access Paper or Ask Questions

Achieving Human Parity on Automatic Chinese to English News Translation

Jun 29, 2018
Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, Ming Zhou

Figure 1 for Achieving Human Parity on Automatic Chinese to English News Translation

Figure 2 for Achieving Human Parity on Automatic Chinese to English News Translation

Figure 3 for Achieving Human Parity on Automatic Chinese to English News Translation

Figure 4 for Achieving Human Parity on Automatic Chinese to English News Translation

Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whether such systems can approach or achieve parity with human translations. In this paper, we first address the problem of how to define and accurately measure human parity in translation. We then describe Microsoft's machine translation system and measure the quality of its translations on the widely used WMT 2017 news translation task from Chinese to English. We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations. We also find that it significantly exceeds the quality of crowd-sourced non-professional translations.

Via

Access Paper or Ask Questions

Towards Binary-Valued Gates for Robust LSTM Training

Jun 08, 2018
Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu

Figure 1 for Towards Binary-Valued Gates for Robust LSTM Training

Figure 2 for Towards Binary-Valued Gates for Robust LSTM Training

Figure 3 for Towards Binary-Valued Gates for Robust LSTM Training

Figure 4 for Towards Binary-Valued Gates for Robust LSTM Training

Long Short-Term Memory (LSTM) is one of the most widely used recurrent structures in sequence modeling. It aims to use gates to control information flow (e.g., whether to skip some information or not) in the recurrent computations, although its practical implementation based on soft gates only partially achieves this goal. In this paper, we propose a new way for LSTM training, which pushes the output values of the gates towards 0 or 1. By doing so, we can better control the information flow: the gates are mostly open or closed, instead of in a middle state, which makes the results more interpretable. Empirical studies show that (1) Although it seems that we restrict the model capacity, there is no performance drop: we achieve better or comparable performances due to its better generalization ability; (2) The outputs of gates are not sensitive to their inputs: we can easily compress the LSTM unit in multiple ways, e.g., low-rank approximation and low-precision approximation. The compressed models are even better than the baseline models without compression.

* ICML 2018

Via

Access Paper or Ask Questions

Learning to Teach

May 09, 2018
Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, Tie-Yan Liu

Teaching plays a very important role in our society, by spreading human knowledge and educating our next generations. A good teacher will select appropriate teaching materials, impact suitable methodologies, and set up targeted examinations, according to the learning behaviors of the students. In the field of artificial intelligence, however, one has not fully explored the role of teaching, and pays most attention to machine \emph{learning}. In this paper, we argue that equal attention, if not more, should be paid to teaching, and furthermore, an optimization framework (instead of heuristics) should be used to obtain good teaching strategies. We call this approach `learning to teach'. In the approach, two intelligent agents interact with each other: a student model (which corresponds to the learner in traditional machine learning algorithms), and a teacher model (which determines the appropriate data, loss function, and hypothesis space to facilitate the training of the student model). The teacher model leverages the feedback from the student model to optimize its own teaching strategies by means of reinforcement learning, so as to achieve teacher-student co-evolution. To demonstrate the practical value of our proposed approach, we take the training of deep neural networks (DNN) as an example, and show that by using the learning to teach techniques, we are able to use much less training data and fewer iterations to achieve almost the same accuracy for different kinds of DNN models (e.g., multi-layer perceptron, convolutional neural networks and recurrent neural networks) under various machine learning tasks (e.g., image classification and text understanding).

* ICLR 2018

Via

Access Paper or Ask Questions

Differential Equations for Modeling Asynchronous Algorithms

May 08, 2018
Li He, Qi Meng, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu

Figure 1 for Differential Equations for Modeling Asynchronous Algorithms

Asynchronous stochastic gradient descent (ASGD) is a popular parallel optimization algorithm in machine learning. Most theoretical analysis on ASGD take a discrete view and prove upper bounds for their convergence rates. However, the discrete view has its intrinsic limitations: there is no characterization of the optimization path and the proof techniques are induction-based and thus usually complicated. Inspired by the recent successful adoptions of stochastic differential equations (SDE) to the theoretical analysis of SGD, in this paper, we study the continuous approximation of ASGD by using stochastic differential delay equations (SDDE). We introduce the approximation method and study the approximation error. Then we conduct theoretical analysis on the convergence rates of ASGD algorithm based on the continuous approximation. There are two methods: moment estimation and energy function minimization can be used to analyze the convergence rates. Moment estimation depends on the specific form of the loss function, while energy function minimization only leverages the convex property of the loss function, and does not depend on its specific form. In addition to the convergence analysis, the continuous view also helps us derive better convergence rates. All of this clearly shows the advantage of taking the continuous view in gradient descent algorithms.

Via

Access Paper or Ask Questions

Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling

May 03, 2018
Chenyan Xiong, Zhengzhong Liu, Jamie Callan, Tie-Yan Liu

Figure 1 for Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling

Figure 2 for Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling

Figure 3 for Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling

Figure 4 for Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling

This paper presents a Kernel Entity Salience Model (KESM) that improves text understanding and retrieval by better estimating entity salience (importance) in documents. KESM represents entities by knowledge enriched distributed representations, models the interactions between entities and words by kernels, and combines the kernel scores to estimate entity salience. The whole model is learned end-to-end using entity salience labels. The salience model also improves ad hoc search accuracy, providing effective ranking features by modeling the salience of query entities in candidate documents. Our experiments on two entity salience corpora and two TREC ad hoc search datasets demonstrate the effectiveness of KESM over frequency-based and feature-based methods. We also provide examples showing how KESM conveys its text understanding ability learned from entity salience to search.

* In proceedings of SIGIR 2018

Via

Access Paper or Ask Questions

Conditional Image-to-Image Translation

May 01, 2018
Jianxin Lin, Yingce Xia, Tao Qin, Zhibo Chen, Tie-Yan Liu

Figure 1 for Conditional Image-to-Image Translation

Figure 2 for Conditional Image-to-Image Translation

Figure 3 for Conditional Image-to-Image Translation

Figure 4 for Conditional Image-to-Image Translation

Image-to-image translation tasks have been widely investigated with Generative Adversarial Networks (GANs) and dual learning. However, existing models lack the ability to control the translated results in the target domain and their results usually lack of diversity in the sense that a fixed image usually leads to (almost) deterministic translation result. In this paper, we study a new problem, conditional image-to-image translation, which is to translate an image from the source domain to the target domain conditioned on a given image in the target domain. It requires that the generated image should inherit some domain-specific features of the conditional image from the target domain. Therefore, changing the conditional image in the target domain will lead to diverse translation results for a fixed input image from the source domain, and therefore the conditional input image helps to control the translation results. We tackle this problem with unpaired data based on GANs and dual learning. We twist two conditional translation models (one translation from A domain to B domain, and the other one from B domain to A domain) together for inputs combination and reconstruction while preserving domain independent features. We carry out experiments on men's faces from-to women's faces translation and edges to shoes&bags translations. The results demonstrate the effectiveness of our proposed method.

* 9 pages, 9 figures, IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Via

Access Paper or Ask Questions

Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction

Mar 16, 2018
Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, Tie-Yan Liu

Figure 1 for Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction

Figure 2 for Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction

Figure 3 for Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction

Figure 4 for Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction

Stock trend prediction plays a critical role in seeking maximized profit from stock investment. However, precise trend prediction is very difficult since the highly volatile and non-stationary nature of stock market. Exploding information on Internet together with advancing development of natural language processing and text mining techniques have enable investors to unveil market trends and volatility from online content. Unfortunately, the quality, trustworthiness and comprehensiveness of online content related to stock market varies drastically, and a large portion consists of the low-quality news, comments, or even rumors. To address this challenge, we imitate the learning process of human beings facing such chaotic online news, driven by three principles: sequential content dependency, diverse influence, and effective and efficient learning. In this paper, to capture the first two principles, we designed a Hybrid Attention Networks to predict the stock trend based on the sequence of recent related news. Moreover, we apply the self-paced learning mechanism to imitate the third principle. Extensive experiments on real-world stock market data demonstrate the effectiveness of our approach.

* (1) The MSRA(the organization of the author) planned to apply the patent for this technology, and this paper didn't include the corresponding acknowledge, so we need to withdraw it for a while. (2) The experiment details are not complete and may be confusing to readers, we need to refine the details to avoid unnecessary trouble to readers

Via

Access Paper or Ask Questions

Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

Feb 27, 2018
Huishuai Zhang, Wei Chen, Tie-Yan Liu

Figure 1 for Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

Figure 2 for Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

Figure 3 for Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

Figure 4 for Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

Stochastic gradient descent (SGD) has achieved great success in training deep neural network, where the gradient is computed through back-propagation. However, the back-propagated values of different layers vary dramatically. This inconsistence of gradient magnitude across different layers renders optimization of deep neural network with a single learning rate problematic. We introduce the back-matching propagation which computes the backward values on the layer's parameter and the input by matching backward values on the layer's output. This leads to solving a bunch of least-squares problems, which requires high computational cost. We then reduce the back-matching propagation with approximations and propose an algorithm that turns to be the regular SGD with a layer-wise adaptive learning rate strategy. This allows an easy implementation of our algorithm in current machine learning frameworks equipped with auto-differentiation. We apply our algorithm in training modern deep neural networks and achieve favorable results over SGD.

* 12 pages, 3 figures

Via

Access Paper or Ask Questions