Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yale Song

Contrastive Learning of Global and Local Audio-Visual Representations

Apr 07, 2021
Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Figure 1 for Contrastive Learning of Global and Local Audio-Visual Representations

Figure 2 for Contrastive Learning of Global and Local Audio-Visual Representations

Figure 3 for Contrastive Learning of Global and Local Audio-Visual Representations

Figure 4 for Contrastive Learning of Global and Local Audio-Visual Representations

Contrastive learning has delivered impressive results in many audio-visual representation learning scenarios. However, existing approaches optimize for learning either \textit{global} representations useful for tasks such as classification, or \textit{local} representations useful for tasks such as audio-visual source localization and separation. While they produce satisfactory results in their intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose a versatile self-supervised approach to learn audio-visual representations that generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require fine-grained spatio-temporal information (e.g. localization). We achieve this by optimizing two cross-modal contrastive objectives that together encourage our model to learn discriminative global-local visual information given audio signals. To show that our approach learns generalizable video representations, we evaluate it on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.

Via

Access Paper or Ask Questions

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Feb 14, 2021
Tsu-Jui Fu, William Yang Wang, Daniel McDuff, Yale Song

Figure 1 for DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Figure 2 for DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Figure 3 for DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Figure 4 for DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner. Can machines learn to emulate this laborious process? We present a novel task and approach for document-to-slide generation. Solving this involves document summarization, image and text retrieval, slide structure and layout prediction to arrange key elements in a form suitable for presentation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides. To help accelerate research in this domain, we release a dataset about 6K paired documents and slide decks used in our experiments. We show that our approach outperforms strong baselines and produces slides with rich content and aligned imagery.

Via

Access Paper or Ask Questions

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Jan 26, 2021
Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song

Figure 1 for Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Figure 2 for Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Figure 3 for Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Figure 4 for Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Large-scale datasets are the cornerstone of self-supervised representation learning. Existing algorithms extract learning signals by making certain assumptions about the data, e.g., spatio-temporal continuity and multimodal correspondence. Unfortunately, finding a large amount of data that satisfies such assumptions is sometimes not straightforward. This restricts the community to rely on datasets that require laborious annotation and/or manual filtering processes. In this paper, we describe a subset optimization approach for automatic dataset curation. Focusing on the scenario of audio-visual representation learning, we pose the problem as finding a subset that maximizes the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales. The most significant benefit of our approach is scalability. We release the largest video dataset for audio-visual research collected automatically using our approach.

Via

Access Paper or Ask Questions

Learning to Transfer Visual Effects from Videos to Images

Dec 17, 2020
Christopher Thomas, Yale Song, Adriana Kovashka

Figure 1 for Learning to Transfer Visual Effects from Videos to Images

Figure 2 for Learning to Transfer Visual Effects from Videos to Images

Figure 3 for Learning to Transfer Visual Effects from Videos to Images

Figure 4 for Learning to Transfer Visual Effects from Videos to Images

We study the problem of animating images by transferring spatio-temporal visual effects (such as melting) from a collection of videos. We tackle two primary challenges in visual effect transfer: 1) how to capture the effect we wish to distill; and 2) how to ensure that only the effect, rather than content or artistic style, is transferred from the source videos to the input image. To address the first challenge, we evaluate five loss functions; the most promising one encourages the generated animations to have similar optical flow and texture motions as the source videos. To address the second challenge, we only allow our model to move existing image pixels from the previous frame, rather than predicting unconstrained pixel values. This forces any visual effects to occur using the input image's pixels, preventing unwanted artistic style or content from the source video from appearing in the output. We evaluate our method in objective and subjective settings, and show interesting qualitative results which demonstrate objects undergoing atypical transformations, such as making a face melt or a deer bloom.

Via

Access Paper or Ask Questions

Parameter Efficient Multimodal Transformers for Video Representation Learning

Dec 08, 2020
Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song

Figure 1 for Parameter Efficient Multimodal Transformers for Video Representation Learning

Figure 2 for Parameter Efficient Multimodal Transformers for Video Representation Learning

Figure 3 for Parameter Efficient Multimodal Transformers for Video Representation Learning

Figure 4 for Parameter Efficient Multimodal Transformers for Video Representation Learning

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the weights of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters up to 80$\%$, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.

Via

Access Paper or Ask Questions

Learning Audio-Visual Representations with Active Contrastive Coding

Aug 31, 2020
Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Figure 1 for Learning Audio-Visual Representations with Active Contrastive Coding

Figure 2 for Learning Audio-Visual Representations with Active Contrastive Coding

Figure 3 for Learning Audio-Visual Representations with Active Contrastive Coding

Figure 4 for Learning Audio-Visual Representations with Active Contrastive Coding

Contrastive coding has achieved promising results in self-supervised representation learning. However, there are practical challenges given that obtaining a tight lower bound on mutual information (MI) requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that 'random negative sampling' leads to a highly redundant dictionary, which could result in representations that are suboptimal for downstream tasks. In this paper, we propose an active contrastive coding approach that builds an 'actively sampled' dictionary with diverse and informative items, which improves the quality of negative samples and achieves substantially improved results on tasks where there is high mutual information in the data, e.g., video classification. Our model achieves state-of-the-art performance on multiple challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50.

Via

Access Paper or Ask Questions

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

Oct 25, 2019
Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song

Figure 1 for Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

Figure 2 for Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

Figure 3 for Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

Figure 4 for Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

Current multi-reference style transfer models for Text-to-Speech (TTS) perform sub-optimally on disjoints datasets, where one dataset contains only a single style class for one of the style dimensions. These models generally fail to produce style transfer for the dimension that is underrepresented in the dataset. In this paper, we propose an adversarial cycle consistency training scheme with paired and unpaired triplets to ensure the use of information from all style dimensions. During training, we incorporate unpaired triplets with randomly selected reference audio samples and encourage the synthesized speech to preserve the appropriate styles using adversarial cycle consistency. We use this method to transfer emotion from a dataset containing four emotions to a dataset with only a single emotion. This results in a 78% improvement in style transfer (based on emotion classification) with minimal reduction in fidelity and naturalness. In subjective evaluations our method was consistently rated as closer to the reference style than the baseline. Synthesized speech samples are available at: https://sites.google.com/view/adv-cycle-consistent-tts

Via

Access Paper or Ask Questions

Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

Aug 19, 2019
Shuang Ma, Daniel McDuff, Yale Song

Figure 1 for Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

Figure 2 for Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

Figure 3 for Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

Figure 4 for Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

Deep generative models have led to significant advances in cross-modal generation such as text-to-image synthesis. Training these models typically requires paired data with direct correspondence between modalities. We introduce the novel problem of translating instances from one modality to another without paired data by leveraging an intermediate modality shared by the two other modalities. To demonstrate this, we take the problem of translating images to speech. In this case, one could leverage disjoint datasets with one shared modality, e.g., image-text pairs and text-speech pairs, with text as the shared modality. We call this problem "skip-modal generation" because the shared modality is skipped during the generation process. We propose a multimodal information bottleneck approach that learns the correspondence between modalities from unpaired data (image and speech) by leveraging the shared modality (text). We address fundamental challenges of skip-modal generation: 1) learning multimodal representations using a single model, 2) bridging the domain gap between two unrelated datasets, and 3) learning the correspondence between modalities from unpaired data. We show qualitative results on image-to-speech synthesis; this is the first time such results have been reported in the literature. We also show that our approach improves performance on traditional cross-modal generation, suggesting that it improves data efficiency in solving individual tasks.

* ICCV 2019

Via

Access Paper or Ask Questions

Image to Video Domain Adaptation Using Web Supervision

Aug 05, 2019
Andrew Kae, Yale Song

Figure 1 for Image to Video Domain Adaptation Using Web Supervision

Figure 2 for Image to Video Domain Adaptation Using Web Supervision

Figure 3 for Image to Video Domain Adaptation Using Web Supervision

Figure 4 for Image to Video Domain Adaptation Using Web Supervision

Training deep neural networks typically requires large amounts of labeled data which may be scarce or expensive to obtain for a particular target domain. As an alternative, we can leverage webly-supervised data (i.e. results from a public search engine) which are relatively plentiful but may contain noisy results. In this work, we propose a novel two-stage approach to learn a video classifier using webly-supervised data. We argue that learning appearance features and then temporal features sequentially, rather than simultaneously, is an easier optimization for this task. We show this by first learning an image model from web images, which is used to initialize and train a video model. Our model applies domain adaptation to account for potential domain shift present between the source domain (webly-supervised data) and target domain and also accounts for noise by adding a novel attention component. We report results competitive with state-of-the-art for webly-supervised approaches on UCF-101 (while simplifying the training process) and also evaluate on Kinetics for comparison.

Via

Access Paper or Ask Questions

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Jul 17, 2019
Yale Song, Mohammad Soleymani

Figure 1 for Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Figure 2 for Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Figure 3 for Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Figure 4 for Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focuses on image-text data. Here, we also tackle a more challenging case of video-text retrieval. To facilitate further research in video-text retrieval, we release a new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when). We demonstrate our approach on both image-text and video-text retrieval scenarios using MS-COCO, TGIF, and our new MRW dataset.

* CVPR 2019. Includes supplementary material. Have updated results on TGIF and MRW

Via

Access Paper or Ask Questions