Alert button
Picture for Ruili Wang

Ruili Wang

Alert button

Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

Sep 13, 2023
Zhenguang Liu, Xinyang Yu, Ruili Wang, Shuai Ye, Zhe Ma, Jianfeng Dong, Sifeng He, Feng Qian, Xiaobo Zhang, Roger Zimmermann, Lei Yang

Figure 1 for Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization
Figure 2 for Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization
Figure 3 for Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization
Figure 4 for Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

The self-media era provides us tremendous high quality videos. Unfortunately, frequent video copyright infringements are now seriously damaging the interests and enthusiasm of video creators. Identifying infringing videos is therefore a compelling task. Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to extract useful representations. Despite its simplicity, this paradigm heavily relies on the original entangled features and lacks constraints guaranteeing that useful task-relevant semantics are extracted from the features. In this paper, we seek to tackle the above challenges from two aspects: (1) We propose to disentangle an original high-dimensional feature into multiple sub-features, explicitly disentangling the feature into exclusive lower-dimensional components. We expect the sub-features to encode non-overlapping semantics of the original feature and remove redundant information. (2) On top of the disentangled sub-features, we further learn an auxiliary feature to enhance the sub-features. We theoretically analyzed the mutual information between the label and the disentangled features, arriving at a loss that maximizes the extraction of task-relevant information from the original feature. Extensive experiments on two large-scale benchmark datasets (i.e., SVD and VCSL) demonstrate that our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset. Our code and model have been released at https://github.com/yyyooooo/DMI/, hoping to contribute to the community.

* This paper is accepted by ACM MM 2023 
Viaarxiv icon

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

Sep 11, 2023
Yujun Ma, Benjia Zhou, Ruili Wang, Pichao Wang

Figure 1 for Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition
Figure 2 for Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition
Figure 3 for Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition
Figure 4 for Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi- Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.

* ACM MM'23 
Viaarxiv icon

A Novel Self-training Approach for Low-resource Speech Recognition

Aug 10, 2023
Satwinder Singh, Feng Hou, Ruili Wang

Figure 1 for A Novel Self-training Approach for Low-resource Speech Recognition
Figure 2 for A Novel Self-training Approach for Low-resource Speech Recognition
Figure 3 for A Novel Self-training Approach for Low-resource Speech Recognition
Figure 4 for A Novel Self-training Approach for Low-resource Speech Recognition

In this paper, we propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. While self-training approaches have been extensively developed and evaluated for high-resource languages such as English, their applications to low-resource languages like Punjabi have been limited, despite the language being spoken by millions globally. The scarcity of annotated data has hindered the development of accurate ASR systems, especially for low-resource languages (e.g., Punjabi and M\=aori languages). To address this issue, we propose an effective self-training approach that generates highly accurate pseudo-labels for unlabeled low-resource speech. Our experimental analysis demonstrates that our approach significantly improves word error rate, achieving a relative improvement of 14.94% compared to a baseline model across four real speech datasets. Further, our proposed approach reports the best results on the Common Voice Punjabi dataset.

* Accepted to Interspeech 2023 
Viaarxiv icon

How to Design Translation Prompts for ChatGPT: An Empirical Study

Apr 21, 2023
Yuan Gao, Ruili Wang, Feng Hou

Figure 1 for How to Design Translation Prompts for ChatGPT: An Empirical Study
Figure 2 for How to Design Translation Prompts for ChatGPT: An Empirical Study
Figure 3 for How to Design Translation Prompts for ChatGPT: An Empirical Study
Figure 4 for How to Design Translation Prompts for ChatGPT: An Empirical Study

The recently released ChatGPT has demonstrated surprising abilities in natural language understanding and natural language generation. Machine translation relies heavily on the abilities of language understanding and generation. Thus, in this paper, we explore how to assist machine translation with ChatGPT. We adopt several translation prompts on a wide range of translations. Our experimental results show that ChatGPT with designed translation prompts can achieve comparable or better performance over commercial translation systems for high-resource language translations. We further evaluate the translation quality using multiple references, and ChatGPT achieves superior performance compared to commercial systems. We also conduct experiments on domain-specific translations, the final results show that ChatGPT is able to comprehend the provided domain keyword and adjust accordingly to output proper translations. At last, we perform few-shot prompts that show consistent improvement across different base prompts. Our work provides empirical evidence that ChatGPT still has great potential in translations.

Viaarxiv icon

Unleashing the Power of ChatGPT for Translation: An Empirical Study

Apr 05, 2023
Yuan Gao, Ruili Wang, Feng Hou

Figure 1 for Unleashing the Power of ChatGPT for Translation: An Empirical Study
Figure 2 for Unleashing the Power of ChatGPT for Translation: An Empirical Study
Figure 3 for Unleashing the Power of ChatGPT for Translation: An Empirical Study
Figure 4 for Unleashing the Power of ChatGPT for Translation: An Empirical Study

The recently released ChatGPT has demonstrated surprising abilities in natural language understanding and natural language generation. Machine translation is an important and extensively studied task in the field of natural language processing, which heavily relies on the abilities of language understanding and generation. Thus, in this paper, we explore how to assist machine translation with ChatGPT. We adopt several translation prompts on a wide range of translations. Our experimental results show that ChatGPT with designed translation prompts can achieve comparable or better performance over professional translation systems for high-resource language translations but lags behind significantly on low-resource translations. We further evaluate the translation quality using multiple references, and ChatGPT achieves superior performance compared to the professional systems. We also conduct experiments on domain-specific translations, the final results show that ChatGPT is able to comprehend the provided domain keyword and adjust accordingly to output proper translations. At last, we perform few-shot prompts that show consistent improvement across different base prompts. Our work provides empirical evidence that ChatGPT still has great potential in translations.

Viaarxiv icon

Improved Meta Learning for Low Resource Speech Recognition

May 11, 2022
Satwinder Singh, Ruili Wang, Feng Hou

Figure 1 for Improved Meta Learning for Low Resource Speech Recognition
Figure 2 for Improved Meta Learning for Low Resource Speech Recognition
Figure 3 for Improved Meta Learning for Low Resource Speech Recognition
Figure 4 for Improved Meta Learning for Low Resource Speech Recognition

We propose a new meta learning based framework for low resource speech recognition that improves the previous model agnostic meta learning (MAML) approach. The MAML is a simple yet powerful meta learning approach. However, the MAML presents some core deficiencies such as training instabilities and slower convergence speed. To address these issues, we adopt multi-step loss (MSL). The MSL aims to calculate losses at every step of the inner loop of MAML and then combines them with a weighted importance vector. The importance vector ensures that the loss at the last step has more importance than the previous steps. Our empirical evaluation shows that MSL significantly improves the stability of the training procedure and it thus also improves the accuracy of the overall system. Our proposed system outperforms MAML based low resource ASR system on various languages in terms of character error rates and stable training behavior.

* ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4798-4802  
* Published in IEEE ICASSP 2022 
Viaarxiv icon

3D Human Motion Prediction: A Survey

Mar 07, 2022
Kedi Lyu, Haipeng Chen, Zhenguang Liu, Beiqi Zhang, Ruili Wang

Figure 1 for 3D Human Motion Prediction: A Survey
Figure 2 for 3D Human Motion Prediction: A Survey
Figure 3 for 3D Human Motion Prediction: A Survey
Figure 4 for 3D Human Motion Prediction: A Survey

3D human motion prediction, predicting future poses from a given sequence, is an issue of great significance and challenge in computer vision and machine intelligence, which can help machines in understanding human behaviors. Due to the increasing development and understanding of Deep Neural Networks (DNNs) and the availability of large-scale human motion datasets, the human motion prediction has been remarkably advanced with a surge of interest among academia and industrial community. In this context, a comprehensive survey on 3D human motion prediction is conducted for the purpose of retrospecting and analyzing relevant works from existing released literature. In addition, a pertinent taxonomy is constructed to categorize these existing approaches for 3D human motion prediction. In this survey, relevant methods are categorized into three categories: human pose representation, network structure design, and \textit{prediction target}. We systematically review all relevant journal and conference papers in the field of human motion prediction since 2015, which are presented in detail based on proposed categorizations in this survey. Furthermore, the outline for the public benchmark datasets, evaluation criteria, and performance comparisons are respectively presented in this paper. The limitations of the state-of-the-art methods are discussed as well, hoping for paving the way for future explorations.

Viaarxiv icon

Improving Entity Linking through Semantic Reinforced Entity Embeddings

Jun 16, 2021
Feng Hou, Ruili Wang, Jun He, Yi Zhou

Figure 1 for Improving Entity Linking through Semantic Reinforced Entity Embeddings
Figure 2 for Improving Entity Linking through Semantic Reinforced Entity Embeddings
Figure 3 for Improving Entity Linking through Semantic Reinforced Entity Embeddings
Figure 4 for Improving Entity Linking through Semantic Reinforced Entity Embeddings

Entity embeddings, which represent different aspects of each entity with a single vector like word embeddings, are a key component of neural entity linking models. Existing entity embeddings are learned from canonical Wikipedia articles and local contexts surrounding target entities. Such entity embeddings are effective, but too distinctive for linking models to learn contextual commonality. We propose a simple yet effective method, FGS2EE, to inject fine-grained semantic information into entity embeddings to reduce the distinctiveness and facilitate the learning of contextual commonality. FGS2EE first uses the embeddings of semantic type words to generate semantic embeddings, and then combines them with existing entity embeddings through linear aggregation. Extensive experiments show the effectiveness of such embeddings. Based on our entity embeddings, we achieved new sate-of-the-art performance on entity linking.

* Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020  
* 6 pages, 3 figures, ACL 2020 
Viaarxiv icon

Towards the Objective Speech Assessment of Smoking Status based on Voice Features: A Review of the Literature

Jun 15, 2021
Zhizhong Ma, Chris Bullen, Joanna Ting Wai Chu, Ruili Wang, Yingchun Wang, Satwinder Singh

Figure 1 for Towards the Objective Speech Assessment of Smoking Status based on Voice Features: A Review of the Literature
Figure 2 for Towards the Objective Speech Assessment of Smoking Status based on Voice Features: A Review of the Literature
Figure 3 for Towards the Objective Speech Assessment of Smoking Status based on Voice Features: A Review of the Literature
Figure 4 for Towards the Objective Speech Assessment of Smoking Status based on Voice Features: A Review of the Literature

In smoking cessation clinical research and practice, objective validation of self-reported smoking status is crucial for ensuring the reliability of the primary outcome, that is, smoking abstinence. Speech signals convey important information about a speaker, such as age, gender, body size, emotional state, and health state. We investigated (1) if smoking could measurably alter voice features, (2) if smoking cessation could lead to changes in voice, and therefore (3) if the voice-based smoking status assessment has the potential to be used as an objective smoking cessation validation method.

Viaarxiv icon