Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a deep benchmarking of different downstream tasks are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, our model achieves an average accuracy of 73.03%. For the image-text retrieval task,our model achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than the result of WenLan 2.0. More information can refer to https://wukong-dataset.github.io/wukong-dataset/.
The large-scale reflector array of programmable metasurfaces is capable of increasing the power efficiency of backscatter communications via passive beamforming and thus has the potential to revolutionize the low-data-rate nature of backscatter communications. In this paper, we propose to design the power-efficient higher-order constellation and reflection pattern under the amplitude constraint brought by backscatter communications. For constellation design, we adopt the amplitude and phase-shift keying (APSK) constellation and optimize the parameters of APSK such as ring number, ring radius, and inter-ring phase difference. Specifically, we derive closed-form solutions to the optimal ring radius and inter-ring phase difference for an arbitrary modulation order. For reflection pattern design, we propose to optimize the passive beamforming vector by solving a multi-objective optimization problem that maximizes reflection power and guarantees beam homogenization within the interested angle range. To solve the problem, we propose a constant-modulus power iteration method, which is proven to be monotonically increasing, to maximize the objective function in each iteration. Numerical results show that the proposed APSK constellation design and reflection pattern design outperform the existing modulation and beam pattern design in programmable metasurface-enabled backscatter communications.
In this paper, we introduce VCSL (Video Copy Segment Localization), a new comprehensive segment-level annotated video copy dataset. Compared with existing copy detection datasets restricted by either video-level annotation or small-scale, VCSL not only has two orders of magnitude more segment-level labelled data, with 160k realistic video copy pairs containing more than 280k localized copied segment pairs, but also covers a variety of video categories and a wide range of video duration. All the copied segments inside each collected video pair are manually extracted and accompanied by precisely annotated starting and ending timestamps. Alongside the dataset, we also propose a novel evaluation protocol that better measures the prediction accuracy of copy overlapping segments between a video pair and shows improved adaptability in different scenarios. By benchmarking several baseline and state-of-the-art segment-level video copy detection methods with the proposed dataset and evaluation metric, we provide a comprehensive analysis that uncovers the strengths and weaknesses of current approaches, hoping to open up promising directions for future works. The VCSL dataset, metric and benchmark codes are all publicly available at https://github.com/alipay/VCSL.
People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompose the co-speech motion into two complementary parts: pose modes and rhythmic dynamics. Accordingly, we introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture, i.e., a pose mode branch for primary posture generation, and a rhythmic motion branch for rhythmic dynamics synthesis. On one hand, diverse pose modes are generated by conditional sampling in a latent space, guided by speech semantics. On the other hand, rhythmic dynamics are synced with the speech prosody. Extensive experiments demonstrate the superior performance against several baselines, in terms of motion diversity, quality and syncing with speech. Code and pre-trained models will be publicly available through https://github.com/TheTempAccount/Co-Speech-Motion-Generation.
Pseudo-label-based semi-supervised learning (SSL) has achieved great success on raw data utilization. However, its training procedure suffers from confirmation bias due to the noise contained in self-generated artificial labels. Moreover, the model's judgment becomes noisier in real-world applications with extensive out-of-distribution data. To address this issue, we propose a general method named Class-aware Contrastive Semi-Supervised Learning (CCSSL), which is a drop-in helper to improve the pseudo-label quality and enhance the model's robustness in the real-world setting. Rather than treating real-world data as a union set, our method separately handles reliable in-distribution data with class-wise clustering for blending into downstream tasks and noisy out-of-distribution data with image-wise contrastive for better generalization. Furthermore, by applying target re-weighting, we successfully emphasize clean label learning and simultaneously reduce noisy label learning. Despite its simplicity, our proposed CCSSL has significant performance improvements over the state-of-the-art SSL methods on the standard datasets CIFAR100 and STL10. On the real-world dataset Semi-iNat 2021, we improve FixMatch by 9.80% and CoMatch by 3.18%.
In recent years, knowledge graphs have been widely applied as a uniform way to organize data and have enhanced many tasks requiring knowledge. In online shopping platform Taobao, we built a billion-scale e-commerce product knowledge graph. It organizes data uniformly and provides item knowledge services for various tasks such as item recommendation. Usually, such knowledge services are provided through triple data, while this implementation includes (1) tedious data selection works on product knowledge graph and (2) task model designing works to infuse those triples knowledge. More importantly, product knowledge graph is far from complete, resulting error propagation to knowledge enhanced tasks. To avoid these problems, we propose a Pre-trained Knowledge Graph Model (PKGM) for the billion-scale product knowledge graph. On the one hand, it could provide item knowledge services in a uniform way with service vectors for embedding-based and item-knowledge-related task models without accessing triple data. On the other hand, it's service is provided based on implicitly completed product knowledge graph, overcoming the common the incomplete issue. We also propose two general ways to integrate the service vectors from PKGM into downstream task models. We test PKGM in five knowledge-related tasks, item classification, item resolution, item recommendation, scene detection and sequential recommendation. Experimental results show that PKGM introduces significant performance gains on these tasks, illustrating the useful of service vectors from PKGM.
Depression is increasingly impacting individuals both physically and psychologically worldwide. It has become a global major public health problem and attracts attention from various research fields. Traditionally, the diagnosis of depression is formulated through semi-structured interviews and supplementary questionnaires, which makes the diagnosis heavily relying on physicians experience and is subject to bias. Mental health monitoring and cloud-based remote diagnosis can be implemented through an automated depression diagnosis system. In this article, we propose an attention-based multimodality speech and text representation for depression prediction. Our model is trained to estimate the depression severity of participants using the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) dataset. For the audio modality, we use the collaborative voice analysis repository (COVAREP) features provided by the dataset and employ a Bidirectional Long Short-Term Memory Network (Bi-LSTM) followed by a Time-distributed Convolutional Neural Network (T-CNN). For the text modality, we use global vectors for word representation (GloVe) to perform word embeddings and the embeddings are fed into the Bi-LSTM network. Results show that both audio and text models perform well on the depression severity estimation task, with best sequence level F1 score of 0.9870 and patient-level F1 score of 0.9074 for the audio model over five classes (healthy, mild, moderate, moderately severe, and severe), as well as sequence level F1 score of 0.9709 and patient-level F1 score of 0.9245 for the text model over five classes. Results are similar for the multimodality fused model, with the highest F1 score of 0.9580 on the patient-level depression detection task over five classes. Experiments show statistically significant improvements over previous works.
This paper studies capturability and push recovery for quadrupedal locomotion. Despite the rich literature on capturability analysis and push recovery control for legged robots, existing tools are developed mainly for bipeds or humanoids. Distinct quadrupedal features such as point contacts and multiple swinging legs prevent direct application of these methods. To address this gap, we propose a switched systems model for quadruped dynamics, and instantiate the abstract viability concept for quadrupedal locomotion with a time-based gait. Capturability is characterized through a novel specification of dynamically balanced states that addresses the time-varying nature of quadrupedal locomotion and balance. A linear inverted pendulum (LIP) model is adopted to demonstrate the theory and show how the newly developed quadrupedal capturability can be used in motion planning for quadrupedal push recovery. We formulate and solve an explicit model predictive control (EMPC) problem whose optimal solution fully characterizes quadrupedal capturability with the LIP. Given this analysis, an optimization-based planning scheme is devised for determining footsteps and center of mass references during push recovery. To validate the effectiveness of the overall framework, we conduct numerous simulation and hardware experiments. Simulation results illustrate the necessity of considering dynamic balance for quadrupedal capturability, and the significant improvement in disturbance rejection with the proposed strategy. Experimental validations on a replica of the Mini Cheetah quadruped demonstrate an up to 100% improvement as compared with state-of-the-art.
This paper presents the concept of "model-based neural network"(MNN), which is inspired by the classic artificial neural network (ANN) but for different usages. Instead of being used as a data-driven classifier, a MNN serves as a modeling tool with artfully defined inputs, outputs, and activation functions which have explicit physical meanings. Owing to the same layered form as an ANN, a MNN can also be optimized using the back-propagation (BP) algorithm. As an interesting application, the classic problem of line spectral estimation can be modeled by a MNN. We propose to first initialize the MNN by the fast Fourier transform (FFT) based spectral estimation, and then optimize the MNN by the BP algorithm, which automatically yields the maximum likelihood (ML) parameter estimation of the frequency spectrum. We also design a method of merging and pruning the hidden-layer nodes of the MNN, which can be used for model-order selection, i.e., to estimate the number of sinusoids. Numerical simulations verify the effectiveness of the proposed method.