Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tao Mei

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Jan 11, 2022

Yehao Li, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, Tao Mei

Figure 1 for Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Figure 2 for Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Figure 3 for Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Figure 4 for Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Abstract:Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality, and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification (MOC), Masked Region Phrase Generation (MRPG), Image-Sentence Matching (ISM), and Masked Sentence Generation (MSG). In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.

* ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

Via

Access Paper or Ask Questions

Smart Director: An Event-Driven Directing System for Live Broadcasting

Jan 11, 2022

Yingwei Pan, Yue Chen, Qian Bao, Ning Zhang, Ting Yao, Jingen Liu, Tao Mei

Figure 1 for Smart Director: An Event-Driven Directing System for Live Broadcasting

Figure 2 for Smart Director: An Event-Driven Directing System for Live Broadcasting

Figure 3 for Smart Director: An Event-Driven Directing System for Live Broadcasting

Figure 4 for Smart Director: An Event-Driven Directing System for Live Broadcasting

Abstract:Live video broadcasting normally requires a multitude of skills and expertise with domain knowledge to enable multi-camera productions. As the number of cameras keep increasing, directing a live sports broadcast has now become more complicated and challenging than ever before. The broadcast directors need to be much more concentrated, responsive, and knowledgeable, during the production. To relieve the directors from their intensive efforts, we develop an innovative automated sports broadcast directing system, called Smart Director, which aims at mimicking the typical human-in-the-loop broadcasting process to automatically create near-professional broadcasting programs in real-time by using a set of advanced multi-view video analysis algorithms. Inspired by the so-called "three-event" construction of sports broadcast, we build our system with an event-driven pipeline consisting of three consecutive novel components: 1) the Multi-view Event Localization to detect events by modeling multi-view correlations, 2) the Multi-view Highlight Detection to rank camera views by the visual importance for view selection, 3) the Auto-Broadcasting Scheduler to control the production of broadcasting videos. To our best knowledge, our system is the first end-to-end automated directing system for multi-camera sports broadcasting, completely driven by the semantic understanding of sports events. It is also the first system to solve the novel problem of multi-view joint event detection by cross-view relation modeling. We conduct both objective and subjective evaluations on a real-world multi-camera soccer dataset, which demonstrate the quality of our auto-generated videos is comparable to that of the human-directed. Thanks to its faster response, our system is able to capture more fast-passing and short-duration events which are usually missed by human directors.

* ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

Via

Access Paper or Ask Questions

Boosting Video Representation Learning with Multi-Faceted Integration

Jan 11, 2022

Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xiao-Ping Zhang, Dong Wu, Tao Mei

Figure 1 for Boosting Video Representation Learning with Multi-Faceted Integration

Figure 2 for Boosting Video Representation Learning with Multi-Faceted Integration

Figure 3 for Boosting Video Representation Learning with Multi-Faceted Integration

Figure 4 for Boosting Video Representation Learning with Multi-Faceted Integration

Abstract:Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to capitalize on the intra-facet supervision between each video and its own label descriptions, and the second predicts the "semantic representation" of each video from the facets of other datasets as the inter-facet supervision. Extensive experiments demonstrate that learning 3D CNN via our MUFI framework on a union of four large-scale video datasets plus two image datasets leads to superior capability of video representation. The pre-learnt 3D CNN with MUFI also shows clear improvements over other approaches on several downstream video applications. More remarkably, MUFI achieves 98.1%/80.9% on UCF101/HMDB51 for action recognition and 101.5% in terms of CIDEr-D score on MSVD for video captioning.

* CVPR 2021

Via

Access Paper or Ask Questions

Condensing a Sequence to One Informative Frame for Video Recognition

Jan 11, 2022

Zhaofan Qiu, Ting Yao, Yan Shu, Chong-Wah Ngo, Tao Mei

Figure 1 for Condensing a Sequence to One Informative Frame for Video Recognition

Figure 2 for Condensing a Sequence to One Informative Frame for Video Recognition

Figure 3 for Condensing a Sequence to One Informative Frame for Video Recognition

Figure 4 for Condensing a Sequence to One Informative Frame for Video Recognition

Abstract:Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A valid question is how to define "useful information" and then distill it from a video sequence down to one synthetic frame. This paper presents a novel Informative Frame Synthesis (IFS) architecture that incorporates three objective tasks, i.e., appearance reconstruction, video categorization, motion estimation, and two regularizers, i.e., adversarial learning, color consistency. Each task equips the synthetic frame with one ability, while each regularizer enhances its visual quality. With these, by jointly learning the frame synthesis in an end-to-end manner, the generated frame is expected to encapsulate the required spatio-temporal information useful for video analysis. Extensive experiments are conducted on the large-scale Kinetics dataset. When comparing to baseline methods that map video sequence to a single image, IFS shows superior performance. More remarkably, IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks, and achieves comparable performance with the state-of-the-art methods with less computational cost.

* ICCV 2021

Via

Access Paper or Ask Questions

Optimization Planning for 3D ConvNets

Jan 11, 2022

Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei

Figure 1 for Optimization Planning for 3D ConvNets

Figure 2 for Optimization Planning for 3D ConvNets

Figure 3 for Optimization Planning for 3D ConvNets

Figure 4 for Optimization Planning for 3D ConvNets

Abstract:It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme. The most common hand-tuning process starts from learning 3D ConvNets using short video clips and then is followed by learning long-term temporal dependency using lengthy clips, while gradually decaying the learning rate from high to low as training progresses. The fact that such process comes along with several heuristic settings motivates the study to seek an optimal "path" to automate the entire training. In this paper, we decompose the path into a series of training "states" and specify the hyper-parameters, e.g., learning rate and the length of input clips, in each state. The estimation of the knee point on the performance-epoch curve triggers the transition from one state to another. We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path. Furthermore, we devise a new 3D ConvNets with a unique design of dual-head classifier to improve spatial and temporal discrimination. Extensive experiments on seven public video recognition benchmarks demonstrate the advantages of our proposal. With the optimization planning, our 3D ConvNets achieves superior results when comparing to the state-of-the-art recognition methods. More remarkably, we obtain the top-1 accuracy of 80.5% and 82.7% on Kinetics-400 and Kinetics-600 datasets, respectively. Source code is available at https://github.com/ZhaofanQiu/Optimization-Planning-for-3D-ConvNets.

* ICML 2021; Code is publicly available at: https://github.com/ZhaofanQiu/Optimization-Planning-for-3D-ConvNets

Via

Access Paper or Ask Questions

Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Dec 27, 2021

Mohan Zhou, Yalong Bai, Wei Zhang, Tiejun Zhao, Tao Mei

Figure 1 for Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Figure 2 for Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Figure 3 for Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Figure 4 for Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Abstract:Responsive listening during face-to-face conversations is a critical element of social interaction and is well established in psychological research. Through non-verbal signals response to the speakers' words, intonations, or behaviors in real-time, listeners show how they are engaged in dialogue. In this work, we build the Responsive Listener Dataset (RLD), a conversation video corpus collected from the public resources featuring 67 speakers, 76 listeners with three different attitudes. We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs, including the audio and visual signal of the speaker. Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields, including human-to-human interaction, video-to-video translation, cross-modal understanding, and generation. Furthermore, we release an attitude conditioned listening head generation baseline. Project page: \url{https://project.mhzhou.com/rld}.

* 12 pages, 9 figures

Via

Access Paper or Ask Questions

Putting People in their Place: Monocular Regression of 3D People in Depth

Dec 15, 2021

Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, Michael J. Black

Figure 1 for Putting People in their Place: Monocular Regression of 3D People in Depth

Figure 2 for Putting People in their Place: Monocular Regression of 3D People in Depth

Figure 3 for Putting People in their Place: Monocular Regression of 3D People in Depth

Figure 4 for Putting People in their Place: Monocular Regression of 3D People in Depth

Abstract:Given an image with multiple people, our goal is to directly regress the pose and shape of all the people as well as their relative depth. Inferring the depth of a person in an image, however, is fundamentally ambiguous without knowing their height. This is particularly problematic when the scene contains people of very different sizes, e.g. from infants to adults. To solve this, we need several things. First, we develop a novel method to infer the poses and depth of multiple people in a single image. While previous work that estimates multiple people does so by reasoning in the image plane, our method, called BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly reason about depth. BEV reasons simultaneously about body centers in the image and in depth and, by combing these, estimates 3D body position. Unlike prior work, BEV is a single-shot method that is end-to-end differentiable. Second, height varies with age, making it impossible to resolve depth without also estimating the age of people in the image. To do so, we exploit a 3D body model space that lets BEV infer shapes from infants to adults. Third, to train BEV, we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset that includes age labels and relative depth relationships between the people in the images. Extensive experiments on RH and AGORA demonstrate the effectiveness of the model and training scheme. BEV outperforms existing methods on depth reasoning, child shape estimation, and robustness to occlusion. The code and dataset will be released for research purposes.

* Code will be available at https://github.com/Arthur151/ROMP

Via

Access Paper or Ask Questions

Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration

Dec 15, 2021

Yu Wang, Jingyang Lin, Jingjing Zou, Yingwei Pan, Ting Yao, Tao Mei

Figure 1 for Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration

Figure 2 for Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration

Figure 3 for Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration

Figure 4 for Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration

Abstract:Our work reveals a structured shortcoming of the existing mainstream self-supervised learning methods. Whereas self-supervised learning frameworks usually take the prevailing perfect instance level invariance hypothesis for granted, we carefully investigate the pitfalls behind. Particularly, we argue that the existing augmentation pipeline for generating multiple positive views naturally introduces out-of-distribution (OOD) samples that undermine the learning of the downstream tasks. Generating diverse positive augmentations on the input does not always pay off in benefiting downstream tasks. To overcome this inherent deficiency, we introduce a lightweight latent variable model UOTA, targeting the view sampling issue for self-supervised learning. UOTA adaptively searches for the most important sampling region to produce views, and provides viable choice for outlier-robust self-supervised learning approaches. Our method directly generalizes to many mainstream self-supervised learning approaches, regardless of the loss's nature contrastive or not. We empirically show UOTA's advantage over the state-of-the-art self-supervised paradigms with evident margin, which well justifies the existence of the OOD sample issue embedded in the existing approaches. Especially, we theoretically prove that the merits of the proposal boil down to guaranteed estimator variance and bias reduction. Code is available: at https://github.com/ssl-codelab/uota.

* NeurIPS 2021; Code is publicly available at: https://github.com/ssl-codelab/uota

Via

Access Paper or Ask Questions

A Style and Semantic Memory Mechanism for Domain Generalization

Dec 14, 2021

Yang Chen, Yu Wang, Yingwei Pan, Ting Yao, Xinmei Tian, Tao Mei

Figure 1 for A Style and Semantic Memory Mechanism for Domain Generalization

Figure 2 for A Style and Semantic Memory Mechanism for Domain Generalization

Figure 3 for A Style and Semantic Memory Mechanism for Domain Generalization

Figure 4 for A Style and Semantic Memory Mechanism for Domain Generalization

Abstract:Mainstream state-of-the-art domain generalization algorithms tend to prioritize the assumption on semantic invariance across domains. Meanwhile, the inherent intra-domain style invariance is usually underappreciated and put on the shelf. In this paper, we reveal that leveraging intra-domain style invariance is also of pivotal importance in improving the efficiency of domain generalization. We verify that it is critical for the network to be informative on what domain features are invariant and shared among instances, so that the network sharpens its understanding and improves its semantic discriminative ability. Correspondingly, we also propose a novel "jury" mechanism, which is particularly effective in learning useful semantic feature commonalities among domains. Our complete model called STEAM can be interpreted as a novel probabilistic graphical model, for which the implementation requires convenient constructions of two kinds of memory banks: semantic feature bank and style feature bank. Empirical results show that our proposed framework surpasses the state-of-the-art methods by clear margins.

* ICCV 2021

Via

Access Paper or Ask Questions

Transferrable Contrastive Learning for Visual Domain Adaptation

Dec 14, 2021

Yang Chen, Yingwei Pan, Yu Wang, Ting Yao, Xinmei Tian, Tao Mei

Figure 1 for Transferrable Contrastive Learning for Visual Domain Adaptation

Figure 2 for Transferrable Contrastive Learning for Visual Domain Adaptation

Figure 3 for Transferrable Contrastive Learning for Visual Domain Adaptation

Figure 4 for Transferrable Contrastive Learning for Visual Domain Adaptation

Abstract:Self-supervised learning (SSL) has recently become the favorite among feature learning methodologies. It is therefore appealing for domain adaptation approaches to consider incorporating SSL. The intuition is to enforce instance-level feature consistency such that the predictor becomes somehow invariant across domains. However, most existing SSL methods in the regime of domain adaptation usually are treated as standalone auxiliary components, leaving the signatures of domain adaptation unattended. Actually, the optimal region where the domain gap vanishes and the instance level constraint that SSL peruses may not coincide at all. From this point, we present a particular paradigm of self-supervised learning tailored for domain adaptation, i.e., Transferrable Contrastive Learning (TCL), which links the SSL and the desired cross-domain transferability congruently. We find contrastive learning intrinsically a suitable candidate for domain adaptation, as its instance invariance assumption can be conveniently promoted to cross-domain class-level invariance favored by domain adaptation tasks. Based on particular memory bank constructions and pseudo label strategies, TCL then penalizes cross-domain intra-class domain discrepancy between source and target through a clean and novel contrastive loss. The free lunch is, thanks to the incorporation of contrastive learning, TCL relies on a moving-averaged key encoder that naturally achieves a temporally ensembled version of pseudo labels for target data, which avoids pseudo label error propagation at no extra cost. TCL therefore efficiently reduces cross-domain gaps. Through extensive experiments on benchmarks (Office-Home, VisDA-2017, Digits-five, PACS and DomainNet) for both single-source and multi-source domain adaptation tasks, TCL has demonstrated state-of-the-art performances.

* ACM Multimedia 2021

Via

Access Paper or Ask Questions