Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Long Zhao

Rutgers University

COMPOSER: Compositional Learning of Group Activity in Videos

Dec 11, 2021

Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, Hans Peter Graf

Figure 1 for COMPOSER: Compositional Learning of Group Activity in Videos

Figure 2 for COMPOSER: Compositional Learning of Group Activity in Videos

Figure 3 for COMPOSER: Compositional Learning of Group Activity in Videos

Figure 4 for COMPOSER: Compositional Learning of Group Activity in Videos

Abstract:Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip. The task requires the compositional understanding of scene entities and relational reasoning between them. We approach GAR by modeling the video as a series of tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, we only use the keypoint modality which reduces scene biases and improves the generalization ability of the model. We improve the multi-scale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and novel data augmentations (e.g., Actor Dropout) to aid model training. We demonstrate the model's strength and interpretability on the challenging Volleyball dataset. COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality. COMPOSER outperforms the latest GAR methods that rely on RGB signals, and performs favorably compared against methods that exploit multiple modalities. Our code will be available.

Via

Access Paper or Ask Questions

Out-of-domain Generalization from a Single Source: A Uncertainty Quantification Approach

Aug 05, 2021

Xi Peng, Fengchun Qiao, Long Zhao

Figure 1 for Out-of-domain Generalization from a Single Source: A Uncertainty Quantification Approach

Figure 2 for Out-of-domain Generalization from a Single Source: A Uncertainty Quantification Approach

Figure 3 for Out-of-domain Generalization from a Single Source: A Uncertainty Quantification Approach

Figure 4 for Out-of-domain Generalization from a Single Source: A Uncertainty Quantification Approach

Abstract:We study a worst-case scenario in generalization: Out-of-domain generalization from a single source. The goal is to learn a robust model from a single source and expect it to generalize over many unknown distributions. This challenging problem has been seldom investigated while existing solutions suffer from various limitations such as the ignorance of uncertainty assessment and label augmentation. In this paper, we propose uncertainty-guided domain generalization to tackle the aforementioned limitations. The key idea is to augment the source capacity in both feature and label spaces, while the augmentation is guided by uncertainty assessment. To the best of our knowledge, this is the first work to (1) quantify the generalization uncertainty from a single source and (2) leverage it to guide both feature and label augmentation for robust generalization. The model training and deployment are effectively organized in a Bayesian meta-learning framework. We conduct extensive comparisons and ablation study to validate our approach. The results prove our superior performance in a wide scope of tasks including image classification, semantic segmentation, text classification, and speech recognition.

* 14 pages, 12 figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (under review)

Via

Access Paper or Ask Questions

Improved Transformer for High-Resolution GANs

Jun 14, 2021

Long Zhao, Zizhao Zhang, Ting Chen, Dimitris N. Metaxas, Han Zhang

Figure 1 for Improved Transformer for High-Resolution GANs

Figure 2 for Improved Transformer for High-Resolution GANs

Figure 3 for Improved Transformer for High-Resolution GANs

Figure 4 for Improved Transformer for High-Resolution GANs

Abstract:Attention-based models, exemplified by the Transformer, can effectively model long range dependency, but suffer from the quadratic complexity of self-attention operation, making them difficult to be adopted for high-resolution image generation based on Generative Adversarial Networks (GANs). In this paper, we introduce two key ingredients to Transformer to address this challenge. First, in low-resolution stages of the generative process, standard global self-attention is replaced with the proposed multi-axis blocked self-attention which allows efficient mixing of local and global attention. Second, in high-resolution stages, we drop self-attention while only keeping multi-layer perceptrons reminiscent of the implicit neural function. To further improve the performance, we introduce an additional self-modulation component based on cross-attention. The resulting model, denoted as HiT, has a linear computational complexity with respect to the image size and thus directly scales to synthesizing high definition images. We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 \times 128$ and FFHQ $256 \times 256$, respectively, with a reasonable throughput. We believe the proposed HiT is an important milestone for generators in GANs which are completely free of convolutions.

* Preprint

Via

Access Paper or Ask Questions

Aggregating Nested Transformers

May 26, 2021

Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Tomas Pfister

Figure 1 for Aggregating Nested Transformers

Figure 2 for Aggregating Nested Transformers

Figure 3 for Aggregating Nested Transformers

Figure 4 for Aggregating Nested Transformers

Abstract:Although hierarchical structures are popular in recent vision transformers, they require sophisticated designs and massive datasets to work well. In this work, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture with minor code changes upon the original vision transformer and obtains improved performance compared to existing methods. Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization. For example, a NesT with 68M parameters trained on ImageNet for 100/300 epochs achieves $82.3\%/83.8\%$ accuracy evaluated on $224\times 224$ image size, outperforming previous methods with up to $57\%$ parameter reduction. Training a NesT with 6M parameters from scratch on CIFAR10 achieves $96\%$ accuracy using a single GPU, setting a new state of the art for vision transformers. Beyond image classification, we extend the key idea to image generation and show NesT leads to a strong decoder that is 8$\times$ faster than previous transformer based generators. Furthermore, we also propose a novel method for visually interpreting the learned model.

* Preprint

Via

Access Paper or Ask Questions

More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

May 20, 2021

Yuxiao Chen, Jianbo Yuan, Long Zhao, Rui Luo, Larry Davis, Dimitris N. Metaxas

Figure 1 for More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

Figure 2 for More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

Figure 3 for More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

Figure 4 for More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints

Abstract:Attention mechanisms have been widely applied to cross-modal tasks such as image captioning and information retrieval, and have achieved remarkable improvements due to its capability to learn fine-grained relevance across different modalities. However, existing attention models could be sub-optimal and lack preciseness because there is no direct supervision involved during training. In this work, we propose Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints to address such limitation. These constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Additionally, we introduce three metrics, namely Attention Precision, Recall and F1-Score, to quantitatively evaluate the attention quality. We evaluate the proposed constraints with cross-modal retrieval (image-text matching) task. The experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance in terms of both retrieval accuracy and attention metrics.

Via

Access Paper or Ask Questions

SMIL: Multimodal Learning with Severely Missing Modality

Mar 09, 2021

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, Xi Peng

Figure 1 for SMIL: Multimodal Learning with Severely Missing Modality

Figure 2 for SMIL: Multimodal Learning with Severely Missing Modality

Figure 3 for SMIL: Multimodal Learning with Severely Missing Modality

Figure 4 for SMIL: Multimodal Learning with Severely Missing Modality

Abstract:A common assumption in multimodal learning is the completeness of training data, i.e., full modalities are available in all training examples. Although there exists research endeavor in developing novel methods to tackle the incompleteness of testing data, e.g., modalities are partially missing in testing examples, few of them can handle incomplete training modalities. The problem becomes even more challenging if considering the case of severely missing, e.g., 90% training examples may have incomplete modalities. For the first time in the literature, this paper formally studies multimodal learning with missing modality in terms of flexibility (missing modalities in training, testing, or both) and efficiency (most training data have incomplete modality). Technically, we propose a new method named SMIL that leverages Bayesian meta-learning in uniformly achieving both objectives. To validate our idea, we conduct a series of experiments on three popular benchmarks: MM-IMDb, CMU-MOSI, and avMNIST. The results prove the state-of-the-art performance of SMIL over existing methods and generative baselines including autoencoders and generative adversarial networks. Our code is available at https://github.com/mengmenm/SMIL.

* In AAAI 2021 (9 pages)

Via

Access Paper or Ask Questions

Box Re-Ranking: Unsupervised False Positive Suppression for Domain Adaptive Pedestrian Detection

Feb 01, 2021

Weijie Chen, Yilu Guo, Shicai Yang, Zhaoyang Li, Zhenxin Ma, Binbin Chen, Long Zhao, Di Xie, Shiliang Pu, Yueting Zhuang

Figure 1 for Box Re-Ranking: Unsupervised False Positive Suppression for Domain Adaptive Pedestrian Detection

Figure 2 for Box Re-Ranking: Unsupervised False Positive Suppression for Domain Adaptive Pedestrian Detection

Figure 3 for Box Re-Ranking: Unsupervised False Positive Suppression for Domain Adaptive Pedestrian Detection

Figure 4 for Box Re-Ranking: Unsupervised False Positive Suppression for Domain Adaptive Pedestrian Detection

Abstract:False positive is one of the most serious problems brought by agnostic domain shift in domain adaptive pedestrian detection. However, it is impossible to label each box in countless target domains. Therefore, it yields our attention to suppress false positive in each target domain in an unsupervised way. In this paper, we model an object detection task into a ranking task among positive and negative boxes innovatively, and thus transform a false positive suppression problem into a box re-ranking problem elegantly, which makes it feasible to solve without manual annotation. An attached problem during box re-ranking appears that no labeled validation data is available for cherrypicking. Considering we aim to keep the detection of true positive unchanged, we propose box number alignment, a self-supervised evaluation metric, to prevent the optimized model from capacity degeneration. Extensive experiments conducted on cross-domain pedestrian detection datasets have demonstrated the effectiveness of our proposed framework. Furthermore, the extension to two general unsupervised domain adaptive object detection benchmarks also supports our superiority to other state-of-the-arts.

Via

Access Paper or Ask Questions

Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization

Dec 02, 2020

Long Zhao, Yuxiao Wang, Jiaping Zhao, Liangzhe Yuan, Jennifer J. Sun, Florian Schroff, Hartwig Adam, Xi Peng, Dimitris Metaxas, Ting Liu

Figure 1 for Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization

Figure 2 for Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization

Figure 3 for Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization

Figure 4 for Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization

Abstract:We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints in a contrastive learning manner. We further propose two regularization terms to ensure disentanglement and smoothness of the learned representations. The resulting pose representations can be used for cross-view action recognition. To evaluate the power of the learned representations, in addition to the conventional fully-supervised action recognition settings, we introduce a novel task called single-shot cross-view action recognition. This task trains models with actions from only one single viewpoint while models are evaluated on poses captured from all possible viewpoints. We evaluate the learned representations on standard benchmarks for action recognition, and show that (i) CV-MIM performs competitively compared with the state-of-the-art models in the fully-supervised scenarios; (ii) CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting; (iii) and the learned representations can significantly boost the performance when reducing the amount of supervised training data.

Via

Access Paper or Ask Questions

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Oct 23, 2020

Ting Liu, Jennifer J. Sun, Long Zhao, Jiaping Zhao, Liangzhe Yuan, Yuxiao Wang, Liang-Chieh Chen, Florian Schroff, Hartwig Adam

Figure 1 for View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Figure 2 for View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Figure 3 for View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Figure 4 for View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Abstract:Recognition of human poses and activities is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic mapping, and therefore we use probabilistic embeddings. In order to enable our embeddings to work with partially visible input keypoints, we further investigate different keypoint occlusion augmentation strategies during training. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We further show that keypoint occlusion augmentation during training significantly improves retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that our embeddings, without any additional training, achieves competitive performance relative to other models specifically trained for each task.

* Code is available at https://github.com/google-research/google-research/tree/master/poem . Video synchronization results are available at https://drive.google.com/corp/drive/folders/1nhPuEcX4Lhe6iK3nv84cvSCov2eJ52Xy. arXiv admin note: text overlap with arXiv:1912.01001

Via

Access Paper or Ask Questions

Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Oct 15, 2020

Long Zhao, Ting Liu, Xi Peng, Dimitris Metaxas

Figure 1 for Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Figure 2 for Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Figure 3 for Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Figure 4 for Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Abstract:Adversarial data augmentation has shown promise for training robust deep neural networks against unforeseen data shifts or corruptions. However, it is difficult to define heuristics to generate effective fictitious target distributions containing "hard" adversarial perturbations that are largely different from the source distribution. In this paper, we propose a novel and effective regularization term for adversarial data augmentation. We theoretically derive it from the information bottleneck principle, which results in a maximum-entropy formulation. Intuitively, this regularization term encourages perturbing the underlying source distribution to enlarge predictive uncertainty of the current model, so that the generated "hard" adversarial perturbations can improve the model robustness during training. Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin.

* Accepted to NeurIPS 2020. Code is available at https://github.com/garyzhao/ME-ADA

Via

Access Paper or Ask Questions