The objective of action quality assessment is to score sports videos. However, most existing works focus only on video dynamic information (i.e., motion information) but ignore the specific postures that an athlete is performing in a video, which is important for action assessment in long videos. In this work, we present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. To learn more discriminative representations for videos, we not only learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames, which represent the action quality at certain moments, along with the help of the proposed hybrid dynamic-static architecture. Moreover, we leverage a context-aware attention module consisting of a temporal instance-wise graph convolutional network unit and an attention unit for both streams to extract more robust stream features, where the former is for exploring the relations between instances and the latter for assigning a proper weight to each instance. Finally, we combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts. Additionally, we have collected and annotated the new Rhythmic Gymnastics dataset, which contains videos of four different types of gymnastics routines, for evaluation of action quality assessment in long videos. Extensive experimental results validate the efficacy of our proposed method, which outperforms related approaches. The codes and dataset are available at \url{https://github.com/lingan1996/ACTION-NET}.
We address the weakly supervised video highlight detection problem for learning to detect segments that are more attractive in training videos given their video event label but without expensive supervision of manually annotating highlight segments. While manually averting localizing highlight segments, weakly supervised modeling is challenging, as a video in our daily life could contain highlight segments with multiple event types, e.g., skiing and surfing. In this work, we propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning. We consider each video as a bag of segments, and therefore, the proposed MINI-Net learns to enforce a higher highlight score for a positive bag that contains highlight segments of a specific event than those for negative bags that are irrelevant. In particular, we form a max-max ranking loss to acquire a reliable relative comparison between the most likely positive segment instance and the hardest negative segment instance. With this max-max ranking loss, our MINI-Net effectively leverages all segment information to acquire a more distinct video feature representation for localizing the highlight segments of a specific event in a video. The extensive experimental results on three challenging public benchmarks clearly validate the efficacy of our multiple instance ranking approach for solving the problem.
Interaction modeling is important for video action analysis. Recently, several works design specific structures to model interactions in videos. However, their structures are manually designed and non-adaptive, which require structures design efforts and more importantly could not model interactions adaptively. In this paper, we automate the process of structures design to learn adaptive structures for interaction modeling. We propose to search the network structures with differentiable architecture search mechanism, which learns to construct adaptive structures for different videos to facilitate adaptive interaction modeling. To this end, we first design the search space with several basic graph operations that explicitly capture different relations in videos. We experimentally demonstrate that our architecture search framework learns to construct adaptive interaction modeling structures, which provides more understanding about the relations between the structures and some interaction characteristics, and also releases the requirement of structures design efforts. Additionally, we show that the designed basic graph operations in the search space are able to model different interactions in videos. The experiments on two interaction datasets show that our method achieves competitive performance with state-of-the-arts.
Important people detection is to automatically detect the individuals who play the most important roles in a social event image, which requires the designed model to understand a high-level pattern. However, existing methods rely heavily on supervised learning using large quantities of annotated image samples, which are more costly to collect for important people detection than for individual entity recognition (eg, object recognition). To overcome this problem, we propose learning important people detection on partially annotated images. Our approach iteratively learns to assign pseudo-labels to individuals in un-annotated images and learns to update the important people detection model based on data with both labels and pseudo-labels. To alleviate the pseudo-labelling imbalance problem, we introduce a ranking strategy for pseudo-label estimation, and also introduce two weighting strategies: one for weighting the confidence that individuals are important people to strengthen the learning on important people and the other for neglecting noisy unlabelled images (ie, images without any important people). We have collected two large-scale datasets for evaluation. The extensive experimental results clearly confirm the efficacy of our method attained by leveraging unlabelled images for improving the performance of important people detection.
Unsupervised learning of identity-discriminative visual feature is appealing in real-world tasks where manual labelling is costly. However, the images of an identity can be visually discrepant when images are taken under different states, e.g. different camera views and poses. This visual discrepancy leads to great difficulty in unsupervised discriminative learning. Fortunately, in real-world tasks we could often know the states without human annotation, e.g. we can easily have the camera view labels in person re-identification and facial pose labels in face recognition. In this work we propose utilizing the state information as weak supervision to address the visual discrepancy caused by different states. We formulate a simple pseudo label model and utilize the state information in an attempt to refine the assigned pseudo labels by the weakly supervised decision boundary rectification and weakly supervised feature drift regularization. We evaluate our model on unsupervised person re-identification and pose-invariant face recognition. Despite the simplicity of our method, it could outperform the state-of-the-art results on Duke-reID, MultiPIE and CFP datasets with a standard ResNet-50 backbone. We also find our model could perform comparably with the standard supervised fine-tuning results on the three datasets. Code is available at https://github.com/KovenYu/state-information
Person re-identification (re-id), the process of matching pedestrian images across different camera views, is an important task in visual surveillance. Substantial development of re-id has recently been observed, and the majority of existing models are largely dependent on color appearance and assume that pedestrians do not change their clothes across camera views. This limitation, however, can be an issue for re-id when tracking a person at different places and at different time if that person (e.g., a criminal suspect) changes his/her clothes, causing most existing methods to fail, since they are heavily relying on color appearance and thus they are inclined to match a person to another person wearing similar clothes. In this work, we call the person re-id under clothing change the "cross-clothes person re-id". In particular, we consider the case when a person only changes his clothes moderately as a first attempt at solving this problem based on visible light images; that is we assume that a person wears clothes of a similar thickness, and thus the shape of a person would not change significantly when the weather does not change substantially within a short period of time. We perform cross-clothes person re-id based on a contour sketch of person image to take advantage of the shape of the human body instead of color information for extracting features that are robust to moderate clothing change. Due to the lack of a large-scale dataset for cross-clothes person re-id, we contribute a new dataset that consists of 33698 images from 221 identities. Our experiments illustrate the challenges of cross-clothes person re-id and demonstrate the effectiveness of our proposed method.
The scalability problem caused by the difficulty in annotating Person Re-identification(Re-ID) datasets has become a crucial bottleneck in the development of Re-ID.To address this problem, many unsupervised Re-ID methods have recently been proposed.Nevertheless, most of these models require transfer from another auxiliary fully supervised dataset, which is still expensive to obtain.In this work, we propose a Re-ID model based on Weakly Supervised Tracklets(WST) data from various camera views, which can be inexpensively acquired by combining the fragmented tracklets of the same person in the same camera view over a period of time.We formulate our weakly supervised tracklets Re-ID model by a novel method, named deep feature-wise mutual learning(DFML), which consists of Mutual Learning on Feature Extractors (MLFE) and Mutual Learning on Feature Classifiers (MLFC).We propose MLFE by leveraging two feature extractors to learn from each other to extract more robust and discriminative features.On the other hand, we propose MLFC by adapting discriminative features from various camera views to each classifier. Extensive experiments demonstrate the superiority of our proposed DFML over the state-of-the-art unsupervised models and even some supervised models on three Re-ID benchmark datasets.
Exposure correction is one of the fundamental tasks in image processing and computational photography. While various methods have been proposed, they either fail to produce visually pleasing results, or only work well for limited types of image (e.g., underexposed images). In this paper, we present a novel automatic exposure correction method, which is able to robustly produce high-quality results for images of various exposure conditions (e.g., underexposed, overexposed, and partially under- and over-exposed). At the core of our approach is the proposed dual illumination estimation, where we separately cast the under- and over-exposure correction as trivial illumination estimation of the input image and the inverted input image. By performing dual illumination estimation, we obtain two intermediate exposure correction results for the input image, with one fixes the underexposed regions and the other one restores the overexposed regions. A multi-exposure image fusion technique is then employed to adaptively blend the visually best exposed parts in the two intermediate exposure correction images and the input image into a globally well-exposed image. Experiments on a number of challenging images demonstrate the effectiveness of the proposed approach and its superiority over the state-of-the-art methods and popular automatic exposure correction tools.
This paper studies the problem of aligning a set of face images of the same individual into a normalized image while removing the outliers like partial occlusion, extreme facial expression as well as significant illumination variation. Our model seeks an optimal image domain transformation such that the matrix of misaligned images can be decomposed as the sum of a sparse matrix of noise and a rank-one matrix of aligned images. The image transformation is learned in an unsupervised manner, which means that ground-truth aligned images are unnecessary for our model. Specifically, we make use of the remarkable non-linear transforming ability of generative adversarial network(GAN) and guide it with low-rank generation as well as sparse noise constraint to achieve the face alignment. We verify the efficacy of the proposed model with extensive experiments on real-world face databases, demonstrating higher accuracy and efficiency than existing methods.