Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrew Owens

Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation

Sep 12, 2024

Samanta Rodriguez, Yiming Dou, Miquel Oller, Andrew Owens, Nima Fazeli

Figure 1 for Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation

Figure 2 for Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation

Figure 3 for Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation

Figure 4 for Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation

Abstract:Today's touch sensors come in many shapes and sizes. This has made it challenging to develop general-purpose touch processing methods since models are generally tied to one specific sensor design. We address this problem by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, we use a generative model to estimate how the same physical contact would be perceived by another sensor. This allows us to apply sensor-specific methods to the generated signal. We implement this idea by training a diffusion model to translate between the popular GelSlim and Soft Bubble sensors. As a downstream task, we perform in-hand object pose estimation using GelSlim sensors while using an algorithm that operates only on Soft Bubble signals. The dataset, the code, and additional details can be found at https://www.mmintlab.com/research/touch2touch/.

Via

Access Paper or Ask Questions

Images that Sound: Composing Images and Sounds on a Single Canvas

May 20, 2024

Ziyang Chen, Daniel Geng, Andrew Owens

Figure 1 for Images that Sound: Composing Images and Sounds on a Single Canvas

Figure 2 for Images that Sound: Composing Images and Sounds on a Single Canvas

Figure 3 for Images that Sound: Composing Images and Sounds on a Single Canvas

Figure 4 for Images that Sound: Composing Images and Sounds on a Single Canvas

Abstract:Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/

* Project site: https://ificl.github.io/images-that-sound/

Via

Access Paper or Ask Questions

Efficient Vision-Language Pre-training by Cluster Masking

May 14, 2024

Zihao Wei, Zixuan Pan, Andrew Owens

Figure 1 for Efficient Vision-Language Pre-training by Cluster Masking

Figure 2 for Efficient Vision-Language Pre-training by Cluster Masking

Figure 3 for Efficient Vision-Language Pre-training by Cluster Masking

Figure 4 for Efficient Vision-Language Pre-training by Cluster Masking

Abstract:We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.

* CVPR 2024, Project page: https://zxp46.github.io/cluster-masking/ , Code: https://github.com/Zi-hao-Wei/Efficient-Vision-Language-Pre-training-by-Cluster-Masking

Via

Access Paper or Ask Questions

Tactile-Augmented Radiance Fields

May 07, 2024

Yiming Dou, Fengyu Yang, Yi Liu, Antonio Loquercio, Andrew Owens

Figure 1 for Tactile-Augmented Radiance Fields

Figure 2 for Tactile-Augmented Radiance Fields

Figure 3 for Tactile-Augmented Radiance Fields

Figure 4 for Tactile-Augmented Radiance Fields

Abstract:We present a scene representation, which we call a tactile-augmented radiance field (TaRF), that brings vision and touch into a shared 3D space. This representation can be used to estimate the visual and tactile signals for a given 3D position within a scene. We capture a scene's TaRF from a collection of photos and sparsely sampled touch probes. Our approach makes use of two insights: (i) common vision-based touch sensors are built on ordinary cameras and thus can be registered to images using methods from multi-view geometry, and (ii) visually and structurally similar regions of a scene share the same tactile features. We use these insights to register touch signals to a captured visual scene, and to train a conditional diffusion model that, provided with an RGB-D image rendered from a neural radiance field, generates its corresponding tactile signal. To evaluate our approach, we collect a dataset of TaRFs. This dataset contains more touch samples than previous real-world datasets, and it provides spatially aligned visual signals for each captured touch signal. We demonstrate the accuracy of our cross-modal generative model and the utility of the captured visual-tactile data on several downstream tasks. Project page: https://dou-yiming.github.io/TaRF

* CVPR 2024, Project page: https://dou-yiming.github.io/TaRF, Code: https://github.com/Dou-Yiming/TaRF/

Via

Access Paper or Ask Questions

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

Apr 17, 2024

Daniel Geng, Inbum Park, Andrew Owens

Abstract:Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. For example, we can decompose an image into low and high spatial frequencies and condition these components on different text prompts. This produces hybrid images, which change appearance depending on viewing distance. By decomposing an image into three frequency subbands, we can generate hybrid images with three prompts. We also use a decomposition into grayscale and color components to produce images whose appearance changes when they are viewed in grayscale, a phenomena that naturally occurs under dim lighting. And we explore a decomposition by a motion blur kernel, which produces images that change appearance under motion blurring. Our method works by denoising with a composite noise estimate, built from the components of noise estimates conditioned on different prompts. We also show that for certain decompositions, our method recovers prior approaches to compositional generation and spatial control. Finally, we show that we can extend our approach to generate hybrid images from real images. We do this by holding one component fixed and generating the remaining components, effectively solving an inverse problem.

Via

Access Paper or Ask Questions

Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

Mar 27, 2024

Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard

Abstract:We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques. Demos and datasets are available on our project page: https://facebookresearch.github.io/real-acoustic-fields/

* Accepted to CVPR 2024. Project site: https://facebookresearch.github.io/real-acoustic-fields/

Via

Access Paper or Ask Questions

Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

Jan 31, 2024

Daniel Geng, Andrew Owens

Figure 1 for Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

Figure 2 for Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

Figure 3 for Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

Figure 4 for Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

Abstract:Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.

Via

Access Paper or Ask Questions

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Jan 31, 2024

Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens(+1 more)

Figure 1 for Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Figure 2 for Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Figure 3 for Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Figure 4 for Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Abstract:The ability to associate touch with other modalities has huge implications for humans and computational systems. However, multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens, allowing the model to learn from a set of heterogeneous tactile sensors, all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting, from robot grasping prediction to touch image question answering. To the best of our knowledge, UniTouch is the first to demonstrate such capabilities. Project page: https://cfeng16.github.io/UniTouch/

Via

Access Paper or Ask Questions

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Nov 29, 2023

Daniel Geng, Inbum Park, Andrew Owens

Abstract:We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation, such as a flip or rotation. We propose a simple, zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process, we estimate the noise from different views of a noisy image. We then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram--an image that changes appearance under some rearrangement of pixels. This includes rotations and flips, but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/

Via

Access Paper or Ask Questions

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow

Nov 28, 2023

Zhaoying Pan, Daniel Geng, Andrew Owens

Abstract:This paper presents a simple, self-supervised method for magnifying subtle motions in video: given an input video and a magnification factor, we manipulate the video such that its new optical flow is scaled by the desired amount. To train our model, we propose a loss function that estimates the optical flow of the generated video and penalizes how far if deviates from the given magnification factor. Thus, training involves differentiating through a pretrained optical flow network. Since our model is self-supervised, we can further improve its performance through test-time adaptation, by finetuning it on the input video. It can also be easily extended to magnify the motions of only user-selected objects. Our approach avoids the need for synthetic magnification datasets that have been used to train prior learning-based approaches. Instead, it leverages the existing capabilities of off-the-shelf motion estimators. We demonstrate the effectiveness of our method through evaluations of both visual quality and quantitative metrics on a range of real-world and synthetic videos, and we show our method works for both supervised and unsupervised optical flow methods.

* Thirty-seventh Conference on Neural Information Processing Systems (2023)

Via

Access Paper or Ask Questions