Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

William T. Freeman

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Aug 09, 2018
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein

Figure 1 for Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Figure 2 for Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Figure 3 for Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Figure 4 for Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

* ACM Trans. Graph. 37(4): 112:1-112:11 (2018)
* Accepted to SIGGRAPH 2018. Project webpage: https://looking-to-listen.github.io

Via

Access Paper or Ask Questions

3D Shape Perception from Monocular Vision, Touch, and Shape Priors

Aug 09, 2018
Shaoxiong Wang, Jiajun Wu, Xingyuan Sun, Wenzhen Yuan, William T. Freeman, Joshua B. Tenenbaum, Edward H. Adelson

Figure 1 for 3D Shape Perception from Monocular Vision, Touch, and Shape Priors

Figure 2 for 3D Shape Perception from Monocular Vision, Touch, and Shape Priors

Figure 3 for 3D Shape Perception from Monocular Vision, Touch, and Shape Priors

Figure 4 for 3D Shape Perception from Monocular Vision, Touch, and Shape Priors

Perceiving accurate 3D object shape is important for robots to interact with the physical world. Current research along this direction has been primarily relying on visual observations. Vision, however useful, has inherent limitations due to occlusions and the 2D-3D ambiguities, especially for perception with a monocular camera. In contrast, touch gets precise local shape information, though its efficiency for reconstructing the entire shape could be low. In this paper, we propose a novel paradigm that efficiently perceives accurate 3D object shape by incorporating visual and tactile observations, as well as prior knowledge of common object shapes learned from large-scale shape repositories. We use vision first, applying neural networks with learned shape priors to predict an object's 3D shape from a single-view color image. We then use tactile sensing to refine the shape; the robot actively touches the object regions where the visual prediction has high uncertainty. Our method efficiently builds the 3D shape of common objects from a color image and a small number of tactile explorations (around 10). Our setup is easy to apply and has potentials to help robots better perform grasping or manipulation tasks on real-world objects.

* IROS 2018. The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Learning-based Video Motion Magnification

Aug 01, 2018
Tae-Hyun Oh, Ronnachai Jaroensri, Changil Kim, Mohamed Elgharib, Frédo Durand, William T. Freeman, Wojciech Matusik

Figure 1 for Learning-based Video Motion Magnification

Figure 2 for Learning-based Video Motion Magnification

Figure 3 for Learning-based Video Motion Magnification

Figure 4 for Learning-based Video Motion Magnification

Video motion magnification techniques allow us to see small motions previously invisible to the naked eyes, such as those of vibrating airplane wings, or swaying buildings under the influence of the wind. Because the motion is small, the magnification results are prone to noise or excessive blurring. The state of the art relies on hand-designed filters to extract representations that may not be optimal. In this paper, we seek to learn the filters directly from examples using deep convolutional neural networks. To make training tractable, we carefully design a synthetic dataset that captures small motion well, and use two-frame input for training. We show that the learned filters achieve high-quality results on real videos, with less ringing artifacts and better noise characteristics than previous methods. While our model is not trained with temporal filters, we found that the temporal filters can be used with our extracted representations up to a moderate magnification, enabling a frequency-based motion selection. Finally, we analyze the learned filters and show that they behave similarly to the derivative filters used in previous works. Our code, trained model, and datasets will be available online.

* Accepted as ECCV 2018 Oral. The 1st and 2nd authors equally contributed. Video result: https://youtu.be/GrMLeEcSNzY , Project page: http://people.csail.mit.edu/tiam/deepmag/ Some bibliography information was fixed

Via

Access Paper or Ask Questions

Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks

Jul 25, 2018
Tianfan Xue, Jiajun Wu, Katherine L. Bouman, William T. Freeman

Figure 1 for Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks

Figure 2 for Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks

Figure 3 for Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks

Figure 4 for Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks

We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods that have tackled this problem in a deterministic or non-parametric way, we propose to model future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. To synthesize realistic movement of objects, we propose a novel network structure, namely a Cross Convolutional Network; this network encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, and on real-world video frames. We present analyses of the learned network representations, showing it is implicitly learning a compact encoding of object appearance and motion. We also demonstrate a few of its applications, including visual analogy-making and video extrapolation.

* IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018
* Journal preprint of arXiv:1607.02586 (IEEE TPAMI, in press). The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Unsupervised Training for 3D Morphable Model Regression

Jun 15, 2018
Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, William T. Freeman

Figure 1 for Unsupervised Training for 3D Morphable Model Regression

Figure 2 for Unsupervised Training for 3D Morphable Model Regression

Figure 3 for Unsupervised Training for 3D Morphable Model Regression

Figure 4 for Unsupervised Training for 3D Morphable Model Regression

We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objectives: a batch distribution loss that encourages the output distribution to match the distribution of the morphable model, a loopback loss that ensures the network can correctly reinterpret its own output, and a multi-view identity loss that compares the features of the predicted 3D face and the input photograph from multiple viewing angles. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results.

* Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8377-8386
* CVPR 2018 version with supplemental material (http://openaccess.thecvf.com/content_cvpr_2018/html/Genova_Unsupervised_Training_for_CVPR_2018_paper.html)

Via

Access Paper or Ask Questions

Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling

Apr 12, 2018
Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B. Tenenbaum, William T. Freeman

Figure 1 for Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling

Figure 2 for Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling

Figure 3 for Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling

Figure 4 for Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling

We study 3D shape modeling from a single image and make contributions to it in three aspects. First, we present Pix3D, a large-scale benchmark of diverse image-shape pairs with pixel-level 2D-3D alignment. Pix3D has wide applications in shape-related tasks including reconstruction, retrieval, viewpoint estimation, etc. Building such a large-scale dataset, however, is highly challenging; existing datasets either contain only synthetic data, or lack precise alignment between 2D images and 3D shapes, or only have a small number of images. Second, we calibrate the evaluation criteria for 3D shape reconstruction through behavioral studies, and use them to objectively and systematically benchmark cutting-edge reconstruction algorithms on Pix3D. Third, we design a novel model that simultaneously performs 3D reconstruction and pose estimation; our multi-task learning approach achieves state-of-the-art performance on both tasks.

* CVPR 2018. The first two authors contributed equally to this work. Project page: http://pix3d.csail.mit.edu

Via

Access Paper or Ask Questions

Smart, Sparse Contours to Represent and Edit Images

Apr 09, 2018
Tali Dekel, Chuang Gan, Dilip Krishnan, Ce Liu, William T. Freeman

Figure 1 for Smart, Sparse Contours to Represent and Edit Images

Figure 2 for Smart, Sparse Contours to Represent and Edit Images

Figure 3 for Smart, Sparse Contours to Represent and Edit Images

Figure 4 for Smart, Sparse Contours to Represent and Edit Images

We study the problem of reconstructing an image from information stored at contour locations. We show that high-quality reconstructions with high fidelity to the source image can be obtained from sparse input, e.g., comprising less than $6\%$ of image pixels. This is a significant improvement over existing contour-based reconstruction methods that require much denser input to capture subtle texture information and to ensure image quality. Our model, based on generative adversarial networks, synthesizes texture and details in regions where no input information is provided. The semantic knowledge encoded into our model and the sparsity of the input allows to use contours as an intuitive interface for semantically-aware image manipulation: local edits in contour domain translate to long-range and coherent changes in pixel space. We can perform complex structural changes such as changing facial expression by simple edits of contours. Our experiments demonstrate that humans as well as a face recognition system mostly cannot distinguish between our reconstructions and the source images.

* Accepted to CVPR'18; Project page: contour2im.github.io

Via

Access Paper or Ask Questions

3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Apr 03, 2018
Jiajun Wu, Tianfan Xue, Joseph J. Lim, Yuandong Tian, Joshua B. Tenenbaum, Antonio Torralba, William T. Freeman

Figure 1 for 3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Figure 2 for 3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Figure 3 for 3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Figure 4 for 3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Understanding 3D object structure from a single image is an important but challenging task in computer vision, mostly due to the lack of 3D object annotations to real images. Previous research tackled this problem by either searching for a 3D shape that best explains 2D annotations, or training purely on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Networks (3D-INN), an end-to-end trainable framework that sequentially estimates 2D keypoint heatmaps and 3D object skeletons and poses. Our system learns from both 2D-annotated real images and synthetic 3D data. This is made possible mainly by two technical innovations. First, heatmaps of 2D keypoints serve as an intermediate representation to connect real and synthetic data. 3D-INN is trained on real images to estimate 2D keypoint heatmaps from an input image; it then predicts 3D object structure from heatmaps using knowledge learned from synthetic 3D shapes. By doing so, 3D-INN benefits from the variation and abundance of synthetic 3D objects, without suffering from the domain difference between real and synthesized images, often due to imperfect rendering. Second, we propose a Projection Layer, mapping estimated 3D structure back to 2D. During training, it ensures 3D-INN to predict 3D structure whose projection is consistent with the 2D annotations to real images. Experiments show that the proposed system performs well on both 2D keypoint estimation and 3D structure recovery. We also demonstrate that the recovered 3D information has wide vision applications, such as image retrieval.

* International Journal of Computer Vision, 2018
* Journal preprint of arXiv:1604.08685 (IJCV, in press). The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Reconstructing Video from Interferometric Measurements of Time-Varying Sources

Feb 01, 2018
Katherine L. Bouman, Michael D. Johnson, Adrian V. Dalca, Andrew A. Chael, Freek Roelofs, Sheperd S. Doeleman, William T. Freeman

Figure 1 for Reconstructing Video from Interferometric Measurements of Time-Varying Sources

Figure 2 for Reconstructing Video from Interferometric Measurements of Time-Varying Sources

Figure 3 for Reconstructing Video from Interferometric Measurements of Time-Varying Sources

Figure 4 for Reconstructing Video from Interferometric Measurements of Time-Varying Sources

Very long baseline interferometry (VLBI) makes it possible to recover images of astronomical sources with extremely high angular resolution. Most recently, the Event Horizon Telescope (EHT) has extended VLBI to short millimeter wavelengths with a goal of achieving angular resolution sufficient for imaging the event horizons of nearby supermassive black holes. VLBI provides measurements related to the underlying source image through a sparse set spatial frequencies. An image can then be recovered from these measurements by making assumptions about the underlying image. One of the most important assumptions made by conventional imaging methods is that over the course of a night's observation the image is static. However, for quickly evolving sources, such as the galactic center's supermassive black hole (Sgr A*) targeted by the EHT, this assumption is violated and these conventional imaging approaches fail. In this work we propose a new way to model VLBI measurements that allows us to recover both the appearance and dynamics of an evolving source by reconstructing a video rather than a static image. By modeling VLBI measurements using a Gaussian Markov Model, we are able to propagate information across observations in time to reconstruct a video, while simultaneously learning about the dynamics of the source's emission region. We demonstrate our proposed Expectation-Maximization (EM) algorithm, StarWarps, on realistic synthetic observations of black holes, and show how it substantially improves results compared to conventional imaging algorithms. Additionally, we demonstrate StarWarps on real VLBI data of the M87 Jet from the VLBA.

* Submitted to Transactions on Computational Imaging

Via

Access Paper or Ask Questions

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Dec 20, 2017
Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba

Figure 1 for Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Figure 2 for Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Figure 3 for Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Figure 4 for Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. 2016, with additional experiments and discussion.

* Journal preprint of arXiv:1608.07017 (unpublished submission to IJCV)

Via

Access Paper or Ask Questions