Humans (and many vertebrates) face the problem of fusing together multiple fixations of a scene in order to obtain a representation of the whole, where each fixation uses a high-resolution fovea and decreasing resolution in the periphery. In this paper we explicitly represent the retinal transformation of a fixation as a linear downsampling of a high-resolution latent image of the scene, exploiting the known geometry. This linear transformation allows us to carry out exact inference for the latent variables in factor analysis (FA) and mixtures of FA models of the scene. Further, this allows us to formulate and solve the choice of "where to look next" as a Bayesian experimental design problem using the Expected Information Gain criterion. Experiments on the Frey faces and MNIST datasets demonstrate the effectiveness of our models.