Classically, visual object tracking involves following a target object throughout a given video, and it provides us the motion trajectory of the object. However, for many practical applications, this output is often insufficient since additional semantic information is required to act on the video material. Example applications of this are surveillance and target-specific video summarization, where the target needs to be monitored with respect to certain predefined constraints, e.g., 'when standing near a yellow car'. This paper explores, tracking visual objects subjected to additional lingual constraints. Differently from Li et al., we impose additional lingual constraints upon tracking, which enables new applications of tracking. Whereas in their work the goal is to improve and extend upon tracking itself. To perform benchmarks and experiments, we contribute two datasets: c-MOT16 and c-LaSOT, curated through appending additional constraints to the frames of the original LaSOT and MOT16 datasets. We also experiment with two deep models SiamCT-DFG and SiamCT-CA, obtained through extending a recent state-of-the-art Siamese tracking method and adding modules inspired from the fields of natural language processing and visual question answering. Through experimental results, we show that the proposed model SiamCT-CA can significantly outperform its counterparts. Furthermore, our method enables the selective compression of videos, based on the validity of the constraint.
Human-object interaction recognition aims for identifying the relationship between a human subject and an object. Researchers incorporate global scene context into the early layers of deep Convolutional Neural Networks as a solution. They report a significant increase in the performance since generally interactions are correlated with the scene (\ie riding bicycle on the city street). However, this approach leads to the following problems. It increases the network size in the early layers, therefore not efficient. It leads to noisy filter responses when the scene is irrelevant, therefore not accurate. It only leverages scene context whereas human-object interactions offer a multitude of contexts, therefore incomplete. To circumvent these issues, in this work, we propose Self-Selective Context (SSC). SSC operates on the joint appearance of human-objects and context to bring the most discriminative context(s) into play for recognition. We devise novel contextual features that model the locality of human-object interactions and show that SSC can seamlessly integrate with the State-of-the-art interaction recognition models. Our experiments show that SSC leads to an important increase in interaction recognition performance, while using much fewer parameters.
This paper introduces data augmentation for point clouds by interpolation between examples. Data augmentation by interpolation has shown to be a simple and effective approach in the image domain. Such a mixup is however not directly transferable to point clouds, as we do not have a one-to-one correspondence between the points of two different objects. In this paper, we define data augmentation between point clouds as a shortest path linear interpolation. To that end, we introduce PointMixup, an interpolation method that generates new examples through an optimal assignment of the path function between two point clouds. We prove that our PointMixup finds the shortest path between two point clouds and that the interpolation is assignment invariant and linear. With the definition of interpolation, PointMixup allows to introduce strong interpolation-based regularizers such as mixup and manifold mixup to the point cloud domain. Experimentally, we show the potential of PointMixup for point cloud classification, especially when examples are scarce, as well as increased robustness to noise and geometric transformations to points. The code for PointMixup and the experimental details are publicly available.
Occlusion is one of the most difficult challenges in object tracking to model. This is because unlike other challenges, where data augmentation can be of help, occlusion is hard to simulate as the occluding object can be anything in any shape. In this paper, we propose a simple solution to simulate the effects of occlusion in the latent space. Specifically, we present structured dropout to mimick the change in latent codes under occlusion. We present three forms of dropout (channel dropout, segment dropout and slice dropout) with the various forms of occlusion in mind. To demonstrate its effectiveness, the dropouts are incorporated into two modern Siamese trackers (SiamFC and SiamRPN++). The outputs from multiple dropouts are combined using an encoder network to obtain the final prediction. Experiments on several tracking benchmarks show the benefits of structured dropouts, while due to their simplicity requiring only small changes to the existing tracker models.
Despite their popularity, to date, the application of normalizing flows on categorical data stays limited. The current practice of using dequantization to map discrete data to a continuous space is inapplicable as categorical data has no intrinsic order. Instead, categorical data have complex and latent relations that must be inferred, like the synonymy between words. In this paper, we investigate Categorical Normalizing Flows, that is normalizing flows for categorical data. By casting the encoding of categorical data in continuous space as a variational inference problem, we jointly optimize the continuous representation and the model likelihood. To maintain unique decoding, we learn a partitioning of the latent space by factorizing the posterior. Meanwhile, the complex relations between the categorical variables are learned by the ensuing normalizing flow, thus maintaining a close-to exact likelihood estimate and making it possible to scale up to a large number of categories. Based on Categorical Normalizing Flows, we propose GraphCNF a permutation-invariant generative model on graphs, outperforming both one-shot and autoregressive flow-based state-of-the-art on molecule generation.
Neural operations as convolutions, self-attention, and vector aggregation are the go-to choices for recognizing short-range actions. However, they have three limitations in modeling long-range activities. This paper presents PIC, Permutation Invariant Convolution, a novel neural layer to model the temporal structure of long-range activities. It has three desirable properties. i. Unlike standard convolution, PIC is invariant to the temporal permutations of features within its receptive field, qualifying it to model the weak temporal structures. ii. Different from vector aggregation, PIC respects local connectivity, enabling it to learn long-range temporal abstractions using cascaded layers. iii. In contrast to self-attention, PIC uses shared weights, making it more capable of detecting the most discriminant visual evidence across long and noisy videos. We study the three properties of PIC and demonstrate its effectiveness in recognizing the long-range activities of Charades, Breakfast, and MultiThumos.
U-Net and its variants have been demonstrated to work sufficiently well in biological cell tracking and segmentation. However, these methods still suffer in the presence of complex processes such as collision of cells, mitosis and apoptosis. In this paper, we augment U-Net with Siamese matching-based tracking and propose to track individual nuclei over time. By modelling the behavioural pattern of the cells, we achieve improved segmentation and tracking performances through a re-segmentation procedure. Our preliminary investigations on the Fluo-N2DH-SIM+ and Fluo-N2DH-GOWT1 datasets demonstrate that absolute improvements of up to 3.8 % and 3.4% can be obtained in segmentation and tracking accuracy, respectively.
Navigating complex urban environments safely is a key to realize fully autonomous systems. Predicting future locations of vulnerable road users, such as pedestrians and cyclists, thus, has received a lot of attention in the recent years. While previous works have addressed modeling interactions with the static (obstacles) and dynamic (humans) environment agents, we address an important gap in trajectory prediction. We propose SafeCritic, a model that synergizes generative adversarial networks for generating multiple "real" trajectories with reinforcement learning to generate "safe" trajectories. The Discriminator evaluates the generated candidates on whether they are consistent with the observed inputs. The Critic network is environmentally aware to prune trajectories that are in collision or are in violation with the environment. The auto-encoding loss stabilizes training and prevents mode-collapse. We demonstrate results on two large scale data sets with a considerable improvement over state-of-the-art. We also show that the Critic is able to classify the safety of trajectories.
Learning suitable latent representations for observed, high-dimensional data is an important research topic underlying many recent advances in machine learning. While traditionally the Gaussian normal distribution has been the go-to latent parameterization, recently a variety of works have successfully proposed the use of manifold-valued latents. In one such work (Davidson et al., 2018), the authors empirically show the potential benefits of using a hyperspherical von Mises-Fisher (vMF) distribution in low dimensionality. However, due to the unique distributional form of the vMF, expressivity in higher dimensional space is limited as a result of its scalar concentration parameter leading to a 'hyperspherical bottleneck'. In this work we propose to extend the usability of hyperspherical parameterizations to higher dimensions using a product-space instead, showing improved results on a selection of image datasets.
A key challenge for RGB-D segmentation is how to effectively incorporate 3D geometric information from the depth channel into 2D appearance features. We propose to model the effective receptive field of 2D convolution based on the scale and locality from the 3D neighborhood. Standard convolutions are local in the image space ($u, v$), often with a fixed receptive field of 3x3 pixels. We propose to define convolutions local with respect to the corresponding point in the 3D real-world space ($x, y, z$), where the depth channel is used to adapt the receptive field of the convolution, which yields the resulting filters invariant to scale and focusing on the certain range of depth. We introduce 3D Neighborhood Convolution (3DN-Conv), a convolutional operator around 3D neighborhoods. Further, we can use estimated depth to use our RGB-D based semantic segmentation model from RGB input. Experimental results validate that our proposed 3DN-Conv operator improves semantic segmentation, using either ground-truth depth (RGB-D) or estimated depth (RGB).