Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael S. Ryoo

Recognizing Actions in Videos from Unseen Viewpoints

Mar 30, 2021
AJ Piergiovanni, Michael S. Ryoo

Figure 1 for Recognizing Actions in Videos from Unseen Viewpoints

Figure 2 for Recognizing Actions in Videos from Unseen Viewpoints

Figure 3 for Recognizing Actions in Videos from Unseen Viewpoints

Figure 4 for Recognizing Actions in Videos from Unseen Viewpoints

Standard methods for video recognition use large CNNs designed to capture spatio-temporal data. However, training these models requires a large amount of labeled training data, containing a wide variety of actions, scenes, settings and camera viewpoints. In this paper, we show that current convolutional neural network models are unable to recognize actions from camera viewpoints not present in their training data (i.e., unseen view action recognition). To address this, we develop approaches based on 3D representations and introduce a new geometric convolutional layer that can learn viewpoint invariant representations. Further, we introduce a new, challenging dataset for unseen view recognition and show the approaches ability to learn viewpoint invariant representations.

* CVPR 2021

Via

Access Paper or Ask Questions

Visionary: Vision architecture discovery for robot learning

Mar 26, 2021
Iretiayo Akinola, Anelia Angelova, Yao Lu, Yevgen Chebotar, Dmitry Kalashnikov, Jacob Varley, Julian Ibarz, Michael S. Ryoo

Figure 1 for Visionary: Vision architecture discovery for robot learning

Figure 2 for Visionary: Vision architecture discovery for robot learning

Figure 3 for Visionary: Vision architecture discovery for robot learning

Figure 4 for Visionary: Vision architecture discovery for robot learning

We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs. Our approach automatically designs architectures while training on the task - discovering novel ways of combining and attending image feature representations with actions as well as features from previous layers. The obtained new architectures demonstrate better task success rates, in some cases with a large margin, compared to a recent high performing baseline. Our real robot experiments also confirm that it improves grasping performance by 6%. This is the first approach to demonstrate a successful neural architecture search and attention connectivity search for a real-robot task.

* ICRA 2021

Via

Access Paper or Ask Questions

Reducing Inference Latency with Concurrent Architectures for Image Recognition

Nov 13, 2020
Ramyad Hadidi, Jiashen Cao, Michael S. Ryoo, Hyesoon Kim

Figure 1 for Reducing Inference Latency with Concurrent Architectures for Image Recognition

Figure 2 for Reducing Inference Latency with Concurrent Architectures for Image Recognition

Figure 3 for Reducing Inference Latency with Concurrent Architectures for Image Recognition

Figure 4 for Reducing Inference Latency with Concurrent Architectures for Image Recognition

Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a single-chain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one inference among devices). Such single-chain dependencies are so widespread that even implicitly biases recent neural architecture search (NAS) studies. In this visionary paper, we draw attention to an entirely new space of NAS that relaxes the single-chain dependency to provide higher concurrency and distribution opportunities. To quantitatively compare these architectures, we propose a score that encapsulates crucial metrics such as communication, concurrency, and load balancing. Additionally, we propose a new generator and transformation block that consistently deliver superior architectures compared to current state-of-the-art methods. Finally, our preliminary results show that these new architectures reduce the inference latency and deserve more attention.

Via

Access Paper or Ask Questions

AssembleNet++: Assembling Modality Representations via Attention Connections

Aug 18, 2020
Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

Figure 1 for AssembleNet++: Assembling Modality Representations via Attention Connections

Figure 2 for AssembleNet++: Assembling Modality Representations via Attention Connections

Figure 3 for AssembleNet++: Assembling Modality Representations via Attention Connections

Figure 4 for AssembleNet++: Assembling Modality Representations via Attention Connections

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or input modality. Even without pre-training, our models outperform the previous work on standard public activity recognition datasets with continuous videos, establishing new state-of-the-art. We also confirm that our findings of having neural connections from the object modality and the use of peer-attention is generally applicable for different existing architectures, improving their performances. We name our model explicitly as AssembleNet++. The code will be available at: https://sites.google.com/corp/view/assemblenet/

* ECCV 2020
* ECCV 2020 camera-ready version

Via

Access Paper or Ask Questions

Adversarial Generative Grammars for Human Activity Prediction

Aug 14, 2020
AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo

Figure 1 for Adversarial Generative Grammars for Human Activity Prediction

Figure 2 for Adversarial Generative Grammars for Human Activity Prediction

Figure 3 for Adversarial Generative Grammars for Human Activity Prediction

Figure 4 for Adversarial Generative Grammars for Human Activity Prediction

In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work.

* ECCV 2020 (Oral)

Via

Access Paper or Ask Questions

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Jul 31, 2020
Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

Figure 1 for AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Figure 2 for AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Figure 3 for AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Figure 4 for AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Convolutional operations have two limitations: (1) do not explicitly model where to focus as the same filter is applied to all the positions, and (2) are unsuitable for modeling long-range dependencies as they only operate on a small neighborhood. While both limitations can be alleviated by attention operations, many design choices remain to be determined to use attention, especially when applying attention to videos. Towards a principled way of applying attention to videos, we address the task of spatiotemporal attention cell search. We propose a novel search space for spatiotemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell. The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets. The discovered attention cells outperform non-local blocks on both datasets, and demonstrate strong generalization across different modalities, backbones, and datasets. Inserting our attention cells into I3D-R50 yields state-of-the-art performance on both datasets.

* ECCV 2020

Via

Access Paper or Ask Questions

AViD Dataset: Anonymized Videos from Diverse Countries

Jul 10, 2020
AJ Piergiovanni, Michael S. Ryoo

Figure 1 for AViD Dataset: Anonymized Videos from Diverse Countries

Figure 2 for AViD Dataset: Anonymized Videos from Diverse Countries

Figure 3 for AViD Dataset: Anonymized Videos from Diverse Countries

Figure 4 for AViD Dataset: Anonymized Videos from Diverse Countries

We introduce a new public video dataset for action recognition: Anonymized Videos from Diverse countries (AViD). Unlike existing public video datasets, AViD is a collection of action videos from many different countries. The motivation is to create a public dataset that would benefit training and pretraining of action recognition models for everybody, rather than making it useful for limited countries. Further, all the face identities in the AViD videos are properly anonymized to protect their privacy. It also is a static dataset where each video is licensed with the creative commons license. We confirm that most of the existing video datasets are statistically biased to only capture action videos from a limited number of countries. We experimentally illustrate that models trained with such biased datasets do not transfer perfectly to action videos from the other countries, and show that AViD addresses such problem. We also confirm that the new AViD dataset could serve as a good dataset for pretraining the models, performing comparably or better than prior datasets.

* https://github.com/piergiaj/AViD

Via

Access Paper or Ask Questions

Edge-Tailored Perception: Fast Inferencing in-the-Edge with Efficient Model Distribution

Mar 13, 2020
Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Hyojong Kim, Michael S. Ryoo, Hyesoon Kim

Figure 1 for Edge-Tailored Perception: Fast Inferencing in-the-Edge with Efficient Model Distribution

Figure 2 for Edge-Tailored Perception: Fast Inferencing in-the-Edge with Efficient Model Distribution

Figure 3 for Edge-Tailored Perception: Fast Inferencing in-the-Edge with Efficient Model Distribution

Figure 4 for Edge-Tailored Perception: Fast Inferencing in-the-Edge with Efficient Model Distribution

The rise of deep neural networks (DNNs) is inspiring new studies in myriad of edge use cases with robots, autonomous agents, and Internet-of-things (IoT) devices. However, in-the-edge inferencing of DNNs is still a severe challenge mainly because of the contradiction between the inherent intensive resource requirements and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices is not an effective solution in edge domains. Therefore, to benefit from available compute resources with low communication overhead, we propose new edge-tailored perception (ETP) models that consist of several almost-independent and narrow branches. ETP models offer close-to-minimum communication overheads with better distribution opportunities while significantly reducing memory and computation footprints, all with a trivial accuracy loss for not accuracy-critical tasks. To show the benefits, we deploy ETP models on two real systems, Raspberry Pis and edge-level PYNQ FPGAs. Additionally, we share our insights about tailoring a systolic-based architecture for edge computing with FPGA implementations. ETP models created based on LeNet, CifarNet, VGG-S/16, AlexNet, and ResNets and trained on MNIST, CIFAR10/100, Flower102, and ImageNet, achieve a maximum and average speedups of 56x and 7x, compared to originals. ETP is an addition to existing single-device optimizations for embedded devices by enabling the exploitation of multiple devices. As an example, we show applying pruning and quantization on ETP models improves the average speedup to 33x.

Via

Access Paper or Ask Questions

Evolving Losses for Unsupervised Video Representation Learning

Feb 26, 2020
AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

Figure 1 for Evolving Losses for Unsupervised Video Representation Learning

Figure 2 for Evolving Losses for Unsupervised Video Representation Learning

Figure 3 for Evolving Losses for Unsupervised Video Representation Learning

Figure 4 for Evolving Losses for Unsupervised Video Representation Learning

We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets.

* CVPR 2020
* arXiv admin note: text overlap with arXiv:1906.03248

Via

Access Paper or Ask Questions

Password-conditioned Anonymization and Deanonymization with Face Identity Transformers

Nov 29, 2019
Xiuye Gu, Weixin Luo, Michael S. Ryoo, Yong Jae Lee

Figure 1 for Password-conditioned Anonymization and Deanonymization with Face Identity Transformers

Figure 2 for Password-conditioned Anonymization and Deanonymization with Face Identity Transformers

Figure 3 for Password-conditioned Anonymization and Deanonymization with Face Identity Transformers

Figure 4 for Password-conditioned Anonymization and Deanonymization with Face Identity Transformers

Cameras are prevalent in our daily lives, and enable many useful systems built upon computer vision technologies such as smart cameras and home robots for service applications. However, there is also an increasing societal concern as the captured images/videos may contain privacy-sensitive information (e.g., face identity). We propose a novel face identity transformer which enables automated photo-realistic password-based anonymization as well as deanonymization of human faces appearing in visual data. Our face identity transformer is trained to (1) remove face identity information after anonymization, (2) make the recovery of the original face possible when given the correct password, and (3) return a wrong--but photo-realistic--face given a wrong password. Extensive experiments show that our approach enables multimodal password-conditioned face anonymizations and deanonymizations, without sacrificing privacy compared to existing anonymization approaches.

Via

Access Paper or Ask Questions