Human-centric video frame interpolation has great potential for improving people's entertainment experiences and finding commercial applications in the sports analysis industry, e.g., synthesizing slow-motion videos. Although there are multiple benchmark datasets available in the community, none of them is dedicated for human-centric scenarios. To bridge this gap, we introduce SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($\geq$720p) slow-motion sports videos crawled from YouTube. We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. It highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve the accuracy, we introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. The loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results validate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: https://neu-vi.github.io/SportsSlomo/.
Although we have witnessed significant progress in human-object interaction (HOI) detection with increasingly high mAP (mean Average Precision), a single mAP score is too concise to obtain an informative summary of a model's performance and to understand why one approach is better than another. In this paper, we introduce a diagnosis toolbox for analyzing the error sources of the existing HOI detection models. We first conduct holistic investigations in the pipeline of HOI detection, consisting of human-object pair detection and then interaction classification. We define a set of errors and the oracles to fix each of them. By measuring the mAP improvement obtained from fixing an error using its oracle, we can have a detailed analysis of the significance of different errors. We then delve into the human-object detection and interaction classification, respectively, and check the model's behavior. For the first detection task, we investigate both recall and precision, measuring the coverage of ground-truth human-object pairs as well as the noisiness level in the detections. For the second classification task, we compute mAP for interaction classification only, without considering the detection scores. We also measure the performance of the models in differentiating human-object pairs with and without actual interactions using the AP (Average Precision) score. Our toolbox is applicable for different methods across different datasets and available at https://github.com/neu-vi/Diag-HOI.
Although deep neural networks endow the downsampled superpoints with discriminative feature representations, directly matching them is usually not used alone in state-of-the-art methods, mainly for two reasons. First, the correspondences are inevitably noisy, so RANSAC-like refinement is usually adopted. Such ad hoc postprocessing, however, is slow and not differentiable, which can not be jointly optimized with feature learning. Second, superpoints are sparse and thus more RANSAC iterations are needed. Existing approaches use the coarse-to-fine strategy to propagate the superpoints correspondences to the point level, which are not discriminative enough and further necessitates the postprocessing refinement. In this paper, we present a simple yet effective approach to extract correspondences by directly matching superpoints using a global softmax layer in an end-to-end manner, which are used to determine the rigid transformation between the source and target point cloud. Compared with methods that directly predict corresponding points, by leveraging the rich information from the superpoints matchings, we can obtain more accurate estimation of the transformation and effectively filter out outliers without any postprocessing refinement. As a result, our approach is not only fast, but also achieves state-of-the-art results on the challenging ModelNet and 3DMatch benchmarks. Our code and model weights will be publicly released.
Obstacle detection is a safety-critical problem in robot navigation, where stereo matching is a popular vision-based approach. While deep neural networks have shown impressive results in computer vision, most of the previous obstacle detection works only leverage traditional stereo matching techniques to meet the computational constraints for real-time feedback. This paper proposes a computationally efficient method that leverages a deep neural network to detect occupancy from stereo images directly. Instead of learning the point cloud correspondence from the stereo data, our approach extracts the compact obstacle distribution based on volumetric representations. In addition, we prune the computation of safety irrelevant spaces in a coarse-to-fine manner based on octrees generated by the decoder. As a result, we achieve real-time performance on the onboard computer (NVIDIA Jetson TX2). Our approach detects obstacles accurately in the range of 32 meters and achieves better IoU (Intersection over Union) and CD (Chamfer Distance) scores with only 2% of the computation cost of the state-of-the-art stereo model. Furthermore, we validate our method's robustness and real-world feasibility through autonomous navigation experiments with a real robot. Hence, our work contributes toward closing the gap between the stereo-based system in robot perception and state-of-the-art stereo models in computer vision. To counter the scarcity of high-quality real-world indoor stereo datasets, we collect a 1.36 hours stereo dataset with a Jackal robot which is used to fine-tune our model. The dataset, the code, and more visualizations are available at https://lhy.xyz/stereovoxelnet/
We present PlanarRecon -- a novel framework for globally coherent detection and reconstruction of 3D planes from a posed monocular video. Unlike previous works that detect planes in 2D from a single image, PlanarRecon incrementally detects planes in 3D for each video fragment, which consists of a set of key frames, from a volumetric representation of the scene using neural networks. A learning-based tracking and fusion module is designed to merge planes from previous fragments to form a coherent global plane reconstruction. Such design allows PlanarRecon to integrate observations from multiple views within each fragment and temporal information across different ones, resulting in an accurate and coherent reconstruction of the scene abstraction with low-polygonal geometry. Experiments show that the proposed approach achieves state-of-the-art performances on the ScanNet dataset while being real-time.
A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts. We introduce Bongard-HOI, a new visual reasoning benchmark that focuses on compositional learning of human-object interactions (HOIs) from natural images. It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning. We carefully curate the few-shot instances with hard negatives, where positive and negative images only disagree on action labels, making mere recognition of object categories insufficient to complete our benchmarks. We also design multiple test sets to systematically study the generalization of visual learning models, where we vary the overlap of the HOI concepts between the training and test sets of few-shot instances, from partial to no overlaps. Bongard-HOI presents a substantial challenge to today's visual recognition models. The state-of-the-art HOI detection model achieves only 62% accuracy on few-shot binary prediction while even amateur human testers on MTurk have 91% accuracy. With the Bongard-HOI benchmark, we hope to further advance research efforts in visual reasoning, especially in holistic perception-reasoning systems and better representation learning.
Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i.e., systematic generalization. In this work, we use vision transformers (ViTs) as our base model for visual reasoning and make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. Specifically, we introduce a novel concept-feature dictionary to allow flexible image feature retrieval at training time with concept keys. This dictionary enables two new concept-guided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) significantly outperforms prior approaches on HICO and GQA by 16% and 13% in the original split, and by 43% and 18% in the systematic split. Our ablation analyses also reveal our model's compatibility with multiple ViT variants and robustness to hyper-parameters.
The cost volume, capturing the similarity of possible correspondences across two input images, is a key ingredient in state-of-the-art optical flow approaches. When sampling for correspondences to build the cost volume, a large neighborhood radius is required to deal with large displacements, introducing a significant computational burden. To address this, a sequential strategy is usually adopted, where correspondence sampling in a local neighborhood with a small radius suffices. However, such sequential approaches, instantiated by either a pyramid structure over a deep neural network's feature hierarchy or by a recurrent neural network, are slow due to the inherent need for sequential processing of cost volumes. In this paper, we propose dilated cost volumes to capture small and large displacements simultaneously, allowing optical flow estimation without the need for the sequential estimation strategy. To process the cost volume to get pixel-wise optical flow, existing approaches employ 2D or separable 4D convolutions, which we show either suffer from high GPU memory consumption, inferior accuracy, or large model size. Therefore, we propose using 3D convolutions for cost volume filtering to address these issues. By combining the dilated cost volumes and 3D convolutions, our proposed model DCVNet not only exhibits real-time inference (71 fps on a mid-end 1080ti GPU) but is also compact and obtains comparable accuracy to existing approaches.
Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this paper, we revisit grid features for VQA and find they can work surprisingly well-running more than an order of magnitude faster with the same accuracy. Through extensive experiments, we verify that this observation holds true across different VQA models, datasets, and generalizes well to other tasks like image captioning. As grid features make the model design and training process much simpler, this enables us to train them end-to-end and also use a more flexible network design. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training. We hope our findings help further improve the scientific understanding and the practical application of VQA. Code and features will be made available.
We introduce a compact network for holistic scene flow estimation, called SENSE, which shares common encoder features among four closely-related tasks: optical flow estimation, disparity estimation from stereo, occlusion estimation, and semantic segmentation. Our key insight is that sharing features makes the network more compact, induces better feature representations, and can better exploit interactions among these tasks to handle partially labeled data. With a shared encoder, we can flexibly add decoders for different tasks during training. This modular design leads to a compact and efficient model at inference time. Exploiting the interactions among these tasks allows us to introduce distillation and self-supervised losses in addition to supervised losses, which can better handle partially labeled real-world data. SENSE achieves state-of-the-art results on several optical flow benchmarks and runs as fast as networks specifically designed for optical flow. It also compares favorably against the state of the art on stereo and scene flow, while consuming much less memory.