Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael S. Ryoo

Evolving Losses for Unlabeled Video Representation Learning

Jun 07, 2019

AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

Figure 1 for Evolving Losses for Unlabeled Video Representation Learning

Figure 2 for Evolving Losses for Unlabeled Video Representation Learning

Figure 3 for Evolving Losses for Unlabeled Video Representation Learning

Figure 4 for Evolving Losses for Unlabeled Video Representation Learning

Abstract:We present a new method to learn video representations from unlabeled data. Given large-scale unlabeled video data, the objective is to benefit from such data by learning a generic and transferable representation space that can be directly used for a new task such as zero/few-shot learning. We formulate our unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are also shared across different modalities via distillation. Further, we also introduce the concept of finding a better loss function to train such multi-task multi-modal representation space using an evolutionary algorithm; our method automatically searches over different combinations of loss functions capturing multiple (self-supervised) tasks and modalities. Our formulation allows for the distillation of audio, optical flow and temporal information into a single, RGB-based convolutional neural network. We also compare the effects of using additional unlabeled video data and evaluate our representation learning on standard public video datasets.

* Non-archival abstract for CVPR Workshop on Learning from Unlabeled Videos

Via

Access Paper or Ask Questions

AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

May 30, 2019

Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, Anelia Angelova

Figure 1 for AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Figure 2 for AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Figure 3 for AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Figure 4 for AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Abstract:Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to a third dimension (using a limited number of space-time modules such as 3D convolutions) or by introducing a handcrafted two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream space-time convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin.

Via

Access Paper or Ask Questions

Early Detection of Injuries in MLB Pitchers from Video

Apr 18, 2019

AJ Piergiovanni, Michael S. Ryoo

Figure 1 for Early Detection of Injuries in MLB Pitchers from Video

Figure 2 for Early Detection of Injuries in MLB Pitchers from Video

Figure 3 for Early Detection of Injuries in MLB Pitchers from Video

Figure 4 for Early Detection of Injuries in MLB Pitchers from Video

Abstract:Injuries are a major cost in sports. Teams spend millions of dollars every year on players who are hurt and unable to play, resulting in lost games, decreased fan interest and additional wages for replacement players. Modern convolutional neural networks have been successfully applied to many video recognition tasks. In this paper, we introduce the problem of injury detection/prediction in MLB pitchers and experimentally evaluate the ability of such convolutional models to detect and predict injuries in pitches only from video data. We conduct experiments on a large dataset of TV broadcast MLB videos of 20 different pitchers who were injured during the 2017 season. We experimentally evaluate the model's performance on each individual pitcher, how well it generalizes to new pitchers, how it performs for various injuries, and how early it can predict or detect an injury.

* CVPR Workshop on Computer Vision in Sports 2019

Via

Access Paper or Ask Questions

Learning Differentiable Grammars for Continuous Data

Feb 01, 2019

AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

Figure 1 for Learning Differentiable Grammars for Continuous Data

Figure 2 for Learning Differentiable Grammars for Continuous Data

Figure 3 for Learning Differentiable Grammars for Continuous Data

Figure 4 for Learning Differentiable Grammars for Continuous Data

Abstract:This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos or other streaming data. Learning latent terminals, non-terminals, and productions rules directly from streaming data allows the construction of a generative model capturing sequential structures with multiple possibilities. Our model is fully differentiable, and provides easily interpretable results which are important in order to understand the learned structures. It outperforms the state-of-the-art on several challenging datasets and is more accurate for forecasting future activities in videos. We plan to open-source the code.

Via

Access Paper or Ask Questions

Evolving Space-Time Neural Architectures for Videos

Nov 26, 2018

AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo

Figure 1 for Evolving Space-Time Neural Architectures for Videos

Figure 2 for Evolving Space-Time Neural Architectures for Videos

Figure 3 for Evolving Space-Time Neural Architectures for Videos

Figure 4 for Evolving Space-Time Neural Architectures for Videos

Abstract:In this paper, we present a new method for evolving video CNN models to find architectures that more optimally captures rich spatio-temporal information in videos. Previous work, taking advantage of 3D convolutional layers, obtained promising results by manually designing CNN architectures for videos. We here develop an evolutionary algorithm that automatically explores models with different types and combinations of space-time convolutional layers to jointly capture various spatial and temporal aspects of video representations. We further propose a new key component in video model evolution, the iTGM layer, which more efficiently utilizes its parameters to allow learning of space-time interactions over longer time horizons. The experiments confirm the advantages of our video CNN architecture evolution, with results outperforming previous state-of-the-art models. Our algorithm discovers new and interesting video architecture structures.

Via

Access Paper or Ask Questions

Representation Flow for Action Recognition

Oct 02, 2018

AJ Piergiovanni, Michael S. Ryoo

Figure 1 for Representation Flow for Action Recognition

Figure 2 for Representation Flow for Action Recognition

Figure 3 for Representation Flow for Action Recognition

Figure 4 for Representation Flow for Action Recognition

Abstract:In this paper, we propose a convolutional layer inspired by optical flow algorithms to learn motion representations. Our representation flow layer is a fully-differentiable layer designed to optimally capture the `flow' of any representation channel within a convolutional neural network. Its parameters for iterative flow optimization are learned in an end-to-end fashion together with the other model parameters, maximizing the action recognition performance. Furthermore, we newly introduce the concept of learning `flow of flow' representations by stacking multiple representation flow layers. We conducted extensive experimental evaluations, confirming its advantages over previous recognition models using traditional optical flows in both computational speed and performance.

Via

Access Paper or Ask Questions

Unseen Action Recognition with Multimodal Learning

Oct 01, 2018

AJ Piergiovanni, Michael S. Ryoo

Figure 1 for Unseen Action Recognition with Multimodal Learning

Figure 2 for Unseen Action Recognition with Multimodal Learning

Figure 3 for Unseen Action Recognition with Multimodal Learning

Figure 4 for Unseen Action Recognition with Multimodal Learning

Abstract:In this paper, we present a method to learn a joint multimodal representation space that allows for the recognition of unseen activities in videos. We compare the effect of placing various constraints on the embedding space using paired text and video data. Additionally, we propose a method to improve the joint embedding space using an adversarial formulation with unpaired text and video data. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that learning such shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning.

Via

Access Paper or Ask Questions

Temporal Gaussian Mixture Layer for Videos

Oct 01, 2018

AJ Piergiovanni, Michael S. Ryoo

Figure 1 for Temporal Gaussian Mixture Layer for Videos

Figure 2 for Temporal Gaussian Mixture Layer for Videos

Figure 3 for Temporal Gaussian Mixture Layer for Videos

Figure 4 for Temporal Gaussian Mixture Layer for Videos

Abstract:We introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer and present how it can be used to efficiently capture longer-term temporal information in continuous activity videos. The TGM layer is a temporal convolutional layer governed by a much smaller set of parameters (e.g., location/variance of Gaussians) that are fully differentiable. We present our fully convolutional video models with multiple TGM layers for activity detection. The experiments on multiple datasets including Charades and MultiTHUMOS confirm the effectiveness of TGM layers, outperforming the state-of-the-arts.

Via

Access Paper or Ask Questions

Learning Real-World Robot Policies by Dreaming

Sep 11, 2018

AJ Piergiovanni, Alan Wu, Michael S. Ryoo

Figure 1 for Learning Real-World Robot Policies by Dreaming

Figure 2 for Learning Real-World Robot Policies by Dreaming

Figure 3 for Learning Real-World Robot Policies by Dreaming

Figure 4 for Learning Real-World Robot Policies by Dreaming

Abstract:Learning to control robots directly based on images is a primary challenge in robotics. However, many existing reinforcement learning approaches require iteratively obtaining millions of samples to learn a policy which can take significant time. In this paper, we focus on the problem of learning real-world robot action policies solely based on a few random off-policy initial samples. We learn a realistic dreaming model that can emulate samples equivalent to a sequence of images from the actual environment, and make the agent learn action policies by interacting with the dreaming model rather than the real-world. We experimentally confirm that our dreaming model can learn realistic policies that transfer to the real-world.

Via

Access Paper or Ask Questions

Forecasting Hands and Objects in Future Frames

Aug 23, 2018

Chenyou Fan, Jangwon Lee, Michael S. Ryoo

Figure 1 for Forecasting Hands and Objects in Future Frames

Figure 2 for Forecasting Hands and Objects in Future Frames

Figure 3 for Forecasting Hands and Objects in Future Frames

Figure 4 for Forecasting Hands and Objects in Future Frames

Abstract:This paper presents an approach to forecast future presence and location of human hands and objects. Given an image frame, the goal is to predict what objects will appear in the future frame (e.g., 5 seconds later) and where they will be located at, even when they are not visible in the current frame. The key idea is that (1) an intermediate representation of a convolutional object recognition model abstracts scene information in its frame and that (2) we can predict (i.e., regress) such representations corresponding to the future frames based on that of the current frame. We design a new two-stream convolutional neural network (CNN) architecture for videos by extending the state-of-the-art convolutional object detection network, and present a new fully convolutional regression network for predicting future scene representations. Our experiments confirm that combining the regressed future representation with our detection network allows reliable estimation of future hands and objects in videos. We obtain much higher accuracy compared to the state-of-the-art future object presence forecast method on a public dataset.

Via

Access Paper or Ask Questions