Interactive Fiction (IF) games are complex textual decision making problems. This paper introduces NAIL, an autonomous agent for general parser-based IF games. NAIL won the 2018 Text Adventure AI Competition, where it was evaluated on twenty unseen games. This paper describes the architecture, development, and insights underpinning NAIL's performance.
Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline \emph{tensor program} that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized. From our framework follows (1) the convergence of random neural networks to Gaussian processes for architectures such as recurrent neural networks, convolutional neural networks, residual networks, attention, and any combination thereof, with or without batch normalization; (2) conditions under which the \emph{gradient independence assumption} -- that weights in backpropagation can be assumed to be independent from weights in the forward pass -- leads to correct computation of gradient dynamics, and corrections when it does not; (3) the convergence of the Neural Tangent Kernel, a recently proposed kernel used to predict training dynamics of neural networks under gradient descent, at initialization for all architectures in (1) without batch normalization. Mathematically, our framework is general enough to rederive classical random matrix results such as the semicircle and the Marchenko-Pastur laws, as well as recent results in neural network Jacobian singular values. We hope our work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.
Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of the most successful RNN architectures, the LSTM and the GRU, do exhibit modest improvements over vanilla RNN cells, but they still suffer from instabilities when trained on very long sequences. In this work, we develop a mean field theory of signal propagation in LSTMs and GRUs that enables us to calculate the time scales for signal propagation as well as the spectral properties of the state-to-state Jacobians. By optimizing these quantities in terms of the initialization hyperparameters, we derive a novel initialization scheme that eliminates or reduces training instabilities. We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower. We also observe a beneficial effect on generalization performance using this new initialization.
We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the "edge of chaos" hypothesis, these subexponential and polynomial laws allow residual networks to "hover over the boundary between stability and chaos," thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind.
In computational complexity, a complexity class is given by a set of problems or functions, and a basic challenge is to show separations of complexity classes $A \not= B$ especially when $A$ is known to be a subset of $B$. In this paper we introduce a homological theory of functions that can be used to establish complexity separations, while also providing other interesting consequences. We propose to associate a topological space $S_A$ to each class of functions $A$, such that, to separate complexity classes $A \subseteq B'$, it suffices to observe a change in "the number of holes", i.e. homology, in $S_A$ as a subclass $B$ of $B'$ is added to $A$. In other words, if the homologies of $S_A$ and $S_{A \cup B}$ are different, then $A \not= B'$. We develop the underlying theory of functions based on combinatorial and homological commutative algebra and Stanley-Reisner theory, and recover Minsky and Papert's 1969 result that parity cannot be computed by nonmaximal degree polynomial threshold functions. In the process, we derive a "maximal principle" for polynomial threshold functions that is used to extend this result further to arbitrary symmetric functions. A surprising coincidence is demonstrated, where the maximal dimension of "holes" in $S_A$ upper bounds the VC dimension of $A$, with equality for common computational cases such as the class of polynomial threshold functions or the class of linear functionals in $\mathbb F_2$, or common algebraic cases such as when the Stanley-Reisner ring of $S_A$ is Cohen-Macaulay. As another interesting application of our theory, we prove a result that a priori has nothing to do with complexity separation: it characterizes when a vector subspace intersects the positive cone, in terms of homological conditions. By analogy to Farkas' result doing the same with *linear conditions*, we call our theorem the Homological Farkas Lemma.
External neural memory structures have recently become a popular tool for algorithmic deep learning (Graves et al. 2014, Weston et al. 2014). These models generally utilize differentiable versions of traditional discrete memory-access structures (random access, stacks, tapes) to provide the storage necessary for computational tasks. In this work, we argue that these neural memory systems lack specific structure important for relative indexing, and propose an alternative model, Lie-access memory, that is explicitly designed for the neural setting. In this paradigm, memory is accessed using a continuous head in a key-space manifold. The head is moved via Lie group actions, such as shifts or rotations, generated by a controller, and memory access is performed by linear smoothing in key space. We argue that Lie groups provide a natural generalization of discrete memory structures, such as Turing machines, as they provide inverse and identity operators while maintaining differentiability. To experiment with this approach, we implement a simplified Lie-access neural Turing machine (LANTM) with different Lie groups. We find that this approach is able to perform well on a range of algorithmic tasks.
Following the recent trend in explicit neural memory structures, we present a new design of an external memory, wherein memories are stored in an Euclidean key space $\mathbb R^n$. An LSTM controller performs read and write via specialized read and write heads. It can move a head by either providing a new address in the key space (aka random access) or moving from its previous position via a Lie group action (aka Lie access). In this way, the "L" and "R" instructions of a traditional Turing Machine are generalized to arbitrary elements of a fixed Lie group action. For this reason, we name this new model the Lie Access Neural Turing Machine, or LANTM. We tested two different configurations of LANTM against an LSTM baseline in several basic experiments. We found the right configuration of LANTM to outperform the baseline in all of our experiments. In particular, we trained LANTM on addition of $k$-digit numbers for $2 \le k \le 16$, but it was able to generalize almost perfectly to $17 \le k \le 32$, all with the number of parameters 2 orders of magnitude below the LSTM baseline.
The $\epsilon$-logic (which is called $\epsilon$E-logic in this paper) of Kuyper and Terwijn is a variant of first order logic with the same syntax, in which the models are equipped with probability measures and in which the $\forall x$ quantifier is interpreted as "there exists a set $A$ of measure $\ge 1 - \epsilon$ such that for each $x \in A$, ...." Previously, Kuyper and Terwijn proved that the general satisfiability and validity problems for this logic are, i) for rational $\epsilon \in (0, 1)$, respectively $\Sigma^1_1$-complete and $\Pi^1_1$-hard, and ii) for $\epsilon = 0$, respectively decidable and $\Sigma^0_1$-complete. The adjective "general" here means "uniformly over all languages." We extend these results in the scenario of finite models. In particular, we show that the problems of satisfiability by and validity over finite models in $\epsilon$E-logic are, i) for rational $\epsilon \in (0, 1)$, respectively $\Sigma^0_1$- and $\Pi^0_1$-complete, and ii) for $\epsilon = 0$, respectively decidable and $\Pi^0_1$-complete. Although partial results toward the countable case are also achieved, the computability of $\epsilon$E-logic over countable models still remains largely unsolved. In addition, most of the results, of this paper and of Kuyper and Terwijn, do not apply to individual languages with a finite number of unary predicates. Reducing this requirement continues to be a major point of research. On the positive side, we derive the decidability of the corresponding problems for monadic relational languages --- equality- and function-free languages with finitely many unary and zero other predicates. This result holds for all three of the unrestricted, the countable, and the finite model cases. Applications in computational learning theory, weighted graphs, and neural networks are discussed in the context of these decidability and undecidability results.