Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianbo Shi

Object Detection in Video with Spatiotemporal Sampling Networks

Jul 24, 2018

Gedas Bertasius, Lorenzo Torresani, Jianbo Shi

Figure 1 for Object Detection in Video with Spatiotemporal Sampling Networks

Figure 2 for Object Detection in Video with Spatiotemporal Sampling Networks

Figure 3 for Object Detection in Video with Spatiotemporal Sampling Networks

Figure 4 for Object Detection in Video with Spatiotemporal Sampling Networks

Abstract:We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as it optimizes sampling locations directly with respect to object detection performance. Our STSN outperforms the state-of-the-art on the ImageNet VID dataset and compared to prior video object detection methods it uses a simpler design, and does not require optical flow data for training.

Via

Access Paper or Ask Questions

Adversarial Structure Matching Loss for Image Segmentation

May 18, 2018

Jyh-Jing Hwang, Tsung-Wei Ke, Jianbo Shi, Stella X. Yu

Figure 1 for Adversarial Structure Matching Loss for Image Segmentation

Figure 2 for Adversarial Structure Matching Loss for Image Segmentation

Figure 3 for Adversarial Structure Matching Loss for Image Segmentation

Figure 4 for Adversarial Structure Matching Loss for Image Segmentation

Abstract:The per-pixel cross-entropy loss (CEL) has been widely used in structured output prediction tasks as a spatial extension of generic image classification. However, its i.i.d. assumption neglects the structural regularity present in natural images. Various attempts have been made to incorporate structural reasoning mostly through structure priors in a cooperative way where co-occuring patterns are encouraged. We, on the other hand, approach this problem from an opposing angle and propose a new framework for training such structured prediction networks via an adversarial process, in which we train a structure analyzer that provides the supervisory signals, the adversarial structure matching loss (ASML). The structure analyzer is trained to maximize ASML, or to exaggerate recurring structural mistakes usually among co-occurring patterns. On the contrary, the structured output prediction network is trained to reduce those mistakes and is thus enabled to distinguish fine-grained structures. As a result, training structured output prediction networks using ASML reduces contextual confusion among objects and improves boundary localization. We demonstrate that ASML outperforms its counterpart CEL especially in context and boundary aspects on figure-ground segmentation and semantic segmentation tasks with various base architectures, such as FCN, U-Net, DeepLab, and PSPNet.

Via

Access Paper or Ask Questions

Egocentric Basketball Motion Planning from a Single First-Person Image

Mar 04, 2018

Gedas Bertasius, Aaron Chan, Jianbo Shi

Figure 1 for Egocentric Basketball Motion Planning from a Single First-Person Image

Figure 2 for Egocentric Basketball Motion Planning from a Single First-Person Image

Figure 3 for Egocentric Basketball Motion Planning from a Single First-Person Image

Figure 4 for Egocentric Basketball Motion Planning from a Single First-Person Image

Abstract:We present a model that uses a single first-person image to generate an egocentric basketball motion sequence in the form of a 12D camera configuration trajectory, which encodes a player's 3D location and 3D head orientation throughout the sequence. To do this, we first introduce a future convolutional neural network (CNN) that predicts an initial sequence of 12D camera configurations, aiming to capture how real players move during a one-on-one basketball game. We also introduce a goal verifier network, which is trained to verify that a given camera configuration is consistent with the final goals of real one-on-one basketball players. Next, we propose an inverse synthesis procedure to synthesize a refined sequence of 12D camera configurations that (1) sufficiently matches the initial configurations predicted by the future CNN, while (2) maximizing the output of the goal verifier network. Finally, by following the trajectory resulting from the refined camera configuration sequence, we obtain the complete 12D motion sequence. Our model generates realistic basketball motion sequences that capture the goals of real players, outperforming standard deep learning approaches such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and generative adversarial networks (GANs).

* CVPR 2018

Via

Access Paper or Ask Questions

Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

Sep 05, 2017

Gedas Bertasius, Jianbo Shi

Figure 1 for Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

Figure 2 for Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

Figure 3 for Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

Figure 4 for Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

Abstract:We present a first-person method for cooperative basketball intention prediction: we predict with whom the camera wearer will cooperate in the near future from unlabeled first-person images. This is a challenging task that requires inferring the camera wearer's visual attention, and decoding the social cues of other players. Our key observation is that a first-person view provides strong cues to infer the camera wearer's momentary visual attention, and his/her intentions. We exploit this observation by proposing a new cross-model EgoSupervision learning scheme that allows us to predict with whom the camera wearer will cooperate in the near future, without using manually labeled intention labels. Our cross-model EgoSupervision operates by transforming the outputs of a pretrained pose-estimation network, into pseudo ground truth labels, which are then used as a supervisory signal to train a new network for a cooperative intention task. We evaluate our method, and show that it achieves similar or even better accuracy than the fully supervised methods do.

Via

Access Paper or Ask Questions

Am I a Baller? Basketball Performance Assessment from First-Person Videos

Aug 02, 2017

Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Figure 1 for Am I a Baller? Basketball Performance Assessment from First-Person Videos

Figure 2 for Am I a Baller? Basketball Performance Assessment from First-Person Videos

Figure 3 for Am I a Baller? Basketball Performance Assessment from First-Person Videos

Figure 4 for Am I a Baller? Basketball Performance Assessment from First-Person Videos

Abstract:This paper presents a method to assess a basketball player's performance from his/her first-person video. A key challenge lies in the fact that the evaluation metric is highly subjective and specific to a particular evaluator. We leverage the first-person camera to address this challenge. The spatiotemporal visual semantics provided by a first-person view allows us to reason about the camera wearer's actions while he/she is participating in an unscripted basketball game. Our method takes a player's first-person video and provides a player's performance measure that is specific to an evaluator's preference. To achieve this goal, we first use a convolutional LSTM network to detect atomic basketball events from first-person videos. Our network's ability to zoom-in to the salient regions addresses the issue of a severe camera wearer's head movement in first-person videos. The detected atomic events are then passed through the Gaussian mixtures to construct a highly non-linear visual spatiotemporal basketball assessment feature. Finally, we use this feature to learn a basketball assessment model from pairs of labeled first-person basketball videos, for which a basketball expert indicates, which of the two players is better. We demonstrate that despite not knowing the basketball evaluator's criterion, our model learns to accurately assess the players in real-world games. Furthermore, our model can also discover basketball events that contribute positively and negatively to a player's performance.

Via

Access Paper or Ask Questions

Unsupervised Learning of Important Objects from First-Person Videos

Aug 02, 2017

Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Figure 1 for Unsupervised Learning of Important Objects from First-Person Videos

Figure 2 for Unsupervised Learning of Important Objects from First-Person Videos

Figure 3 for Unsupervised Learning of Important Objects from First-Person Videos

Figure 4 for Unsupervised Learning of Important Objects from First-Person Videos

Abstract:A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person wearing the camera can provide the importance labels. Such a constraint makes the annotation process costly and limited in scalability. In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers. We formulate an important detection problem as an interplay between the 1) segmentation and 2) recognition agents. The segmentation agent first proposes a possible important object segmentation mask for each image, and then feeds it to the recognition agent, which learns to predict an important object mask using visual semantics and spatial features. We implement such an interplay between both agents via an alternating cross-pathway supervision scheme inside our proposed Visual-Spatial Network (VSN). Our VSN consists of spatial ("where") and visual ("what") pathways, one of which learns common visual semantics while the other focuses on the spatial location cues. Our unsupervised learning is accomplished via a cross-pathway supervision, where one pathway feeds its predictions to a segmentation agent, which proposes a candidate important object segmentation mask that is then used by the other pathway as a supervisory signal. We show our method's success on two different important object datasets, where our method achieves similar or better results as the supervised methods.

Via

Access Paper or Ask Questions

First Person Action-Object Detection with EgoNet

Jun 10, 2017

Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Figure 1 for First Person Action-Object Detection with EgoNet

Figure 2 for First Person Action-Object Detection with EgoNet

Figure 3 for First Person Action-Object Detection with EgoNet

Figure 4 for First Person Action-Object Detection with EgoNet

Abstract:Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person's visual sensorimotor object interactions from up close. In this paper, we study the tight interplay between our momentary visual attention and motor action with objects from a first-person camera. We propose a concept of action-objects---the objects that capture person's conscious visual (watching a TV) or tactile (taking a cup) interactions. Action-objects may be task-dependent but since many tasks share common person-object spatial configurations, action-objects exhibit a characteristic 3D spatial distance and orientation with respect to the person. We design a predictive model that detects action-objects using EgoNet, a joint two-stream network that holistically integrates visual appearance (RGB) and 3D spatial layout (depth and height) cues to predict per-pixel likelihood of action-objects. Our network also incorporates a first-person coordinate embedding, which is designed to learn a spatial distribution of the action-objects in the first-person data. We demonstrate EgoNet's predictive power, by showing that it consistently outperforms previous baseline approaches. Furthermore, EgoNet also exhibits a strong generalization ability, i.e., it predicts semantically meaningful objects in novel first-person datasets. Our method's ability to effectively detect action-objects could be used to improve robots' understanding of human-object interactions.

Via

Access Paper or Ask Questions

Convolutional Random Walk Networks for Semantic Image Segmentation

May 08, 2017

Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, Jianbo Shi

Figure 1 for Convolutional Random Walk Networks for Semantic Image Segmentation

Figure 2 for Convolutional Random Walk Networks for Semantic Image Segmentation

Figure 3 for Convolutional Random Walk Networks for Semantic Image Segmentation

Figure 4 for Convolutional Random Walk Networks for Semantic Image Segmentation

Abstract:Most current semantic segmentation methods rely on fully convolutional networks (FCNs). However, their use of large receptive fields and many pooling layers cause low spatial resolution inside the deep layers. This leads to predictions with poor localization around the boundaries. Prior work has attempted to address this issue by post-processing predictions with CRFs or MRFs. But such models often fail to capture semantic relationships between objects, which causes spatially disjoint predictions. To overcome these problems, recent methods integrated CRFs or MRFs into an FCN framework. The downside of these new models is that they have much higher complexity than traditional FCNs, which renders training and testing more challenging. In this work we introduce a simple, yet effective Convolutional Random Walk Network (RWN) that addresses the issues of poor boundary localization and spatially fragmented predictions with very little increase in model complexity. Our proposed RWN jointly optimizes the objectives of pixelwise affinity and semantic segmentation. It combines these two objectives via a novel random walk layer that enforces consistent spatial grouping in the deep layers of the network. Our RWN is implemented using standard convolution and matrix multiplication. This allows an easy integration into existing FCN frameworks and it enables end-to-end training of the whole network via standard back-propagation. Our implementation of RWN requires just $131$ additional parameters compared to the traditional FCNs, and yet it consistently produces an improvement over the FCNs on semantic segmentation and scene labeling.

Via

Access Paper or Ask Questions

Customizing First Person Image Through Desired Actions

Apr 01, 2017

Shan Su, Jianbo Shi, Hyun Soo Park

Figure 1 for Customizing First Person Image Through Desired Actions

Figure 2 for Customizing First Person Image Through Desired Actions

Figure 3 for Customizing First Person Image Through Desired Actions

Figure 4 for Customizing First Person Image Through Desired Actions

Abstract:This paper studies a problem of inverse visual path planning: creating a visual scene from a first person action. Our conjecture is that the spatial arrangement of a first person visual scene is deployed to afford an action, and therefore, the action can be inversely used to synthesize a new scene such that the action is feasible. As a proof-of-concept, we focus on linking visual experiences induced by walking. A key innovation of this paper is a concept of ActionTunnel---a 3D virtual tunnel along the future trajectory encoding what the wearer will visually experience as moving into the scene. This connects two distinctive first person images through similar walking paths. Our method takes a first person image with a user defined future trajectory and outputs a new image that can afford the future motion. The image is created by combining present and future ActionTunnels in 3D where the missing pixels in adjoining area are computed by a generative adversarial network. Our work can provide a travel across different first person experiences in diverse real world scenes.

Via

Access Paper or Ask Questions

Social Behavior Prediction from First Person Videos

Nov 29, 2016

Shan Su, Jung Pyo Hong, Jianbo Shi, Hyun Soo Park

Figure 1 for Social Behavior Prediction from First Person Videos

Figure 2 for Social Behavior Prediction from First Person Videos

Figure 3 for Social Behavior Prediction from First Person Videos

Figure 4 for Social Behavior Prediction from First Person Videos

Abstract:This paper presents a method to predict the future movements (location and gaze direction) of basketball players as a whole from their first person videos. The predicted behaviors reflect an individual physical space that affords to take the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is to use the 3D reconstruction of multiple first person cameras to automatically annotate each other's the visual semantics of social configurations. We leverage two learning signals uniquely embedded in first person videos. Individually, a first person video records the visual semantics of a spatial and social layout around a person that allows associating with past similar situations. Collectively, first person videos follow joint attention that can link the individuals to a group. We learn the egocentric visual semantics of group movements using a Siamese neural network to retrieve future trajectories. We consolidate the retrieved trajectories from all players by maximizing a measure of social compatibility---the gaze alignment towards joint attention predicted by their social formation, where the dynamics of joint attention is learned by a long-term recurrent convolutional network. This allows us to characterize which social configuration is more plausible and predict future group trajectories.

Via

Access Paper or Ask Questions