A comprehensive understanding of interested human-to-human interactions in video streams, such as queuing, handshaking, fighting and chasing, is of immense importance to the surveillance of public security in regions like campuses, squares and parks. Different from conventional human interaction recognition, which uses choreographed videos as inputs, neglects concurrent interactive groups, and performs detection and recognition in separate stages, we introduce a new task named human-to-human interaction detection (HID). HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model. First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I), by adding annotations on interactive relations in a frame-by-frame manner. AVA-I consists of 85,254 frames and 86,338 interactive groups, and each image includes up to 4 concurrent interactive groups. Second, we present a novel baseline approach SaMFormer for HID, containing a visual feature extractor, a split stage which leverages a Transformer-based model to decode action instances and interactive groups, and a merging stage which reconstructs the relationship between instances and groups. All SaMFormer components are jointly trained in an end-to-end manner. Extensive experiments on AVA-I validate the superiority of SaMFormer over representative methods. The dataset and code will be made public to encourage more follow-up studies.
This paper presents a novel approach for estimating human body shape and pose from monocular images that effectively addresses the challenges of occlusions and depth ambiguity. Our proposed method BoPR, the Body-aware Part Regressor, first extracts features of both the body and part regions using an attention-guided mechanism. We then utilize these features to encode extra part-body dependency for per-part regression, with part features as queries and body feature as a reference. This allows our network to infer the spatial relationship of occluded parts with the body by leveraging visible parts and body reference information. Our method outperforms existing state-of-the-art methods on two benchmark datasets, and our experiments show that it significantly surpasses existing methods in terms of depth ambiguity and occlusion handling. These results provide strong evidence of the effectiveness of our approach.The code and data are available for research purposes at https://github.com/cyk990422/BoPR.
Low-rankness is important in the hyperspectral image (HSI) denoising tasks. The tensor nuclear norm (TNN), defined based on the tensor singular value decomposition, is a state-of-the-art method to describe the low-rankness of HSI. However, TNN ignores some of the physical meanings of HSI in tackling the denoising tasks, leading to suboptimal denoising performance. In this paper, we propose the multi-modal and frequency-weighted tensor nuclear norm (MFWTNN) and the non-convex MFWTNN for HSI denoising tasks. Firstly, we investigate the physical meaning of frequency components and reconsider their weights to improve the low-rank representation ability of TNN. Meanwhile, we also consider the correlation among two spatial dimensions and the spectral dimension of HSI and combine the above improvements to TNN to propose MFWTNN. Secondly, we use non-convex functions to approximate the rank function of the frequency tensor and propose the NonMFWTNN to relax the MFWTNN better. Besides, we adaptively choose bigger weights for slices mainly containing noise information and smaller weights for slices containing profile information. Finally, we develop the efficient alternating direction method of multiplier (ADMM) based algorithm to solve the proposed models, and the effectiveness of our models are substantiated in simulated and real HSI datasets.
Low-rank tensor completion has been widely used in computer vision and machine learning. This paper develops a tensor low-rank decomposition method together with a tensor low-rankness measure (MCTF) and a better nonconvex relaxation form of it (NonMCTF). This is the first method that can accurately restore the clean data of intrinsic low-rank structure based on few known inputs. This metric encodes low-rank insights for general tensors provided by Tucker and T-SVD. Furthermore, we studied the MCTF and NonMCTF regularization minimization problem, and designed an effective BSUM algorithm to solve the problem. This efficient solver can extend MCTF to various tasks, such as tensor completion and tensor robust principal component analysis. A series of experiments, including hyperspectral image (HSI) denoising, video completion and MRI restoration, confirmed the superior performance of the proposed method
Higher-order low-rank tensor arises in many data processing applications and has attracted great interests. Inspired by low-rank approximation theory, researchers have proposed a series of effective tensor completion methods. However, most of these methods directly consider the global low-rankness of underlying tensors, which is not sufficient for a low sampling rate; in addition, the single nuclear norm or its relaxation is usually adopted to approximate the rank function, which would lead to suboptimal solution deviated from the original one. To alleviate the above problems, in this paper, we propose a novel low-rank approximation of tensor multi-modes (LRATM), in which a double nonconvex $L_{\gamma}$ norm is designed to represent the underlying joint-manifold drawn from the modal factorization factors of the underlying tensor. A block successive upper-bound minimization method-based algorithm is designed to efficiently solve the proposed model, and it can be demonstrated that our numerical scheme converges to the coordinatewise minimizers. Numerical results on three types of public multi-dimensional datasets have tested and shown that our algorithm can recover a variety of low-rank tensors with significantly fewer samples than the compared methods.
Hyperspectral image (HSI) denoising aims to restore clean HSI from the noise-contaminated one. Noise contamination can often be caused during data acquisition and conversion. In this paper, we propose a novel spatial-spectral total variation (SSTV) regularized nonconvex local low-rank (LR) tensor approximation method to remove mixed noise in HSIs. From one aspect, the clean HSI data have its underlying local LR tensor property, even though the real HSI data may not be globally low-rank due to out-liers and non-Gaussian noise. According to this fact, we propose a novel tensor $L_{\gamma}$-norm to formulate the local LR prior. From another aspect, HSIs are assumed to be piecewisely smooth in the global spatial and spectral domains. Instead of traditional bandwise total variation, we use the SSTV regularization to simultaneously consider global spatial structure and spectral correlation of neighboring bands. Results on simulated and real HSI datasets indicate that the use of local LR tensor penalty and global SSTV can boost the preserving of local details and overall structural information in HSIs.
Higher-order low-rank tensor arises in many data processing applications and has attracted great interests. In this paper, we propose a new low rank model for higher-order tensor completion task based on the double nonconvex $L_{\gamma}$ norm, which is used to better approximate the rank minimization of tensor mode-matrix. An block successive upper-bound minimization method-based algorithm is designed to efficiently solve the proposed model, and it can be demonstrated that our numerical scheme converge to the coordinatewise minimizers. Numerical results on three types of public multi-dimensional datasets have tested and shown that our algorithms can recover a variety of low-rank tensors with significantly fewer samples than the compared methods.
Several bandwise total variation (TV) regularized low-rank (LR)-based models have been proposed to remove mixed noise in hyperspectral images (HSIs). Conventionally, the rank of LR matrix is approximated using nuclear norm (NN). The NN is defined by adding all singular values together, which is essentially a $L_1$-norm of the singular values. It results in non-negligible approximation errors and thus the resulting matrix estimator can be significantly biased. Moreover, these bandwise TV-based methods exploit the spatial information in a separate manner. To cope with these problems, we propose a spatial-spectral TV (SSTV) regularized non-convex local LR matrix approximation (NonLLRTV) method to remove mixed noise in HSIs. From one aspect, local LR of HSIs is formulated using a non-convex $L_{\gamma}$-norm, which provides a closer approximation to the matrix rank than the traditional NN. From another aspect, HSIs are assumed to be piecewisely smooth in the global spatial domain. The TV regularization is effective in preserving the smoothness and removing Gaussian noise. These facts inspire the integration of the NonLLR with TV regularization. To address the limitations of bandwise TV, we use the SSTV regularization to simultaneously consider global spatial structure and spectral correlation of neighboring bands. Experiment results indicate that the use of local non-convex penalty and global SSTV can boost the preserving of spatial piecewise smoothness and overall structural information.
In this paper, we propose a novel model to recover a low-rank tensor by simultaneously performing double nuclear norm regularized low-rank matrix factorizations to the all-mode matricizations of the underlying tensor. An block successive upper-bound minimization algorithm is applied to solve the model. Subsequence convergence of our algorithm can be established, and our algorithm converges to the coordinate-wise minimizers in some mild conditions. Several experiments on three types of public data sets show that our algorithm can recover a variety of low-rank tensors from significantly fewer samples than the other testing tensor completion methods.