Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qing Wang

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

Jan 31, 2024

Xingning Dong, Qingpei Guo, Tian Gan, Qing Wang, Jianlong Wu, Xiangyuan Ren, Yuan Cheng, Wei Chu

Abstract:We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp.

* Accepted by TCSVT (IEEE Transactions on Circuits and Systems for Video Technology)

Via

Access Paper or Ask Questions

Dual-View Data Hallucination with Semantic Relation Guidance for Few-Shot Image Recognition

Jan 13, 2024

Hefeng Wu, Guangzhi Ye, Ziyang Zhou, Ling Tian, Qing Wang, Liang Lin

Figure 1 for Dual-View Data Hallucination with Semantic Relation Guidance for Few-Shot Image Recognition

Figure 2 for Dual-View Data Hallucination with Semantic Relation Guidance for Few-Shot Image Recognition

Figure 3 for Dual-View Data Hallucination with Semantic Relation Guidance for Few-Shot Image Recognition

Figure 4 for Dual-View Data Hallucination with Semantic Relation Guidance for Few-Shot Image Recognition

Abstract:Learning to recognize novel concepts from just a few image samples is very challenging as the learned model is easily overfitted on the few data and results in poor generalizability. One promising but underexplored solution is to compensate the novel classes by generating plausible samples. However, most existing works of this line exploit visual information only, rendering the generated data easy to be distracted by some challenging factors contained in the few available samples. Being aware of the semantic information in the textual modality that reflects human concepts, this work proposes a novel framework that exploits semantic relations to guide dual-view data hallucination for few-shot image recognition. The proposed framework enables generating more diverse and reasonable data samples for novel classes through effective information transfer from base classes. Specifically, an instance-view data hallucination module hallucinates each sample of a novel class to generate new data by employing local semantic correlated attention and global semantic feature fusion derived from base classes. Meanwhile, a prototype-view data hallucination module exploits semantic-aware measure to estimate the prototype of a novel class and the associated distribution from the few samples, which thereby harvests the prototype as a more stable sample and enables resampling a large number of samples. We conduct extensive experiments and comparisons with state-of-the-art methods on several popular few-shot benchmarks to verify the effectiveness of the proposed framework.

* 13 pages

Via

Access Paper or Ask Questions

Uplifting the Expressive Power of Graph Neural Networks through Graph Partitioning

Dec 14, 2023

Asela Hevapathige, Qing Wang

Abstract:Graph Neural Networks (GNNs) have paved its way for being a cornerstone in graph related learning tasks. From a theoretical perspective, the expressive power of GNNs is primarily characterised according to their ability to distinguish non-isomorphic graphs. It is a well-known fact that most of the conventional GNNs are upper-bounded by Weisfeiler-Lehman graph isomorphism test (1-WL). In this work, we study the expressive power of graph neural networks through the lens of graph partitioning. This follows from our observation that permutation invariant graph partitioning enables a powerful way of exploring structural interactions among vertex sets and subgraphs, and can help uplifting the expressive power of GNNs efficiently. Based on this, we first establish a theoretical connection between graph partitioning and graph isomorphism. Then we introduce a novel GNN architecture, namely Graph Partitioning Neural Networks (GPNNs). We theoretically analyse how a graph partitioning scheme and different kinds of structural interactions relate to the k-WL hierarchy. Empirically, we demonstrate its superior performance over existing GNN models in a variety of graph benchmark tasks.

Via

Access Paper or Ask Questions

Large Language Models are Complex Table Parsers

Dec 13, 2023

Bowen Zhao, Changkai Ji, Yuejie Zhang, Wen He, Yingwen Wang, Qing Wang, Rui Feng, Xiaobo Zhang

Figure 1 for Large Language Models are Complex Table Parsers

Figure 2 for Large Language Models are Complex Table Parsers

Figure 3 for Large Language Models are Complex Table Parsers

Figure 4 for Large Language Models are Complex Table Parsers

Abstract:With the Generative Pre-trained Transformer 3.5 (GPT-3.5) exhibiting remarkable reasoning and comprehension abilities in Natural Language Processing (NLP), most Question Answering (QA) research has primarily centered around general QA tasks based on GPT, neglecting the specific challenges posed by Complex Table QA. In this paper, we propose to incorporate GPT-3.5 to address such challenges, in which complex tables are reconstructed into tuples and specific prompt designs are employed for dialogues. Specifically, we encode each cell's hierarchical structure, position information, and content as a tuple. By enhancing the prompt template with an explanatory description of the meaning of each tuple and the logical reasoning process of the task, we effectively improve the hierarchical structure awareness capability of GPT-3.5 to better parse the complex tables. Extensive experiments and results on Complex Table QA datasets, i.e., the open-domain dataset HiTAB and the aviation domain dataset AIT-QA show that our approach significantly outperforms previous work on both datasets, leading to state-of-the-art (SOTA) performance.

* EMNLP 2023 Main

Via

Access Paper or Ask Questions

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Dec 07, 2023

Huan Zhao, Li Zhang, Yue Li, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie

Figure 1 for Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Figure 2 for Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Figure 3 for Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Figure 4 for Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Abstract:The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.

Via

Access Paper or Ask Questions

Improving Unsupervised Relation Extraction by Augmenting Diverse Sentence Pairs

Dec 01, 2023

Qing Wang, Kang Zhou, Qiao Qiao, Yuepei Li, Qi Li

Figure 1 for Improving Unsupervised Relation Extraction by Augmenting Diverse Sentence Pairs

Figure 2 for Improving Unsupervised Relation Extraction by Augmenting Diverse Sentence Pairs

Figure 3 for Improving Unsupervised Relation Extraction by Augmenting Diverse Sentence Pairs

Figure 4 for Improving Unsupervised Relation Extraction by Augmenting Diverse Sentence Pairs

Abstract:Unsupervised relation extraction (URE) aims to extract relations between named entities from raw text without requiring manual annotations or pre-existing knowledge bases. In recent studies of URE, researchers put a notable emphasis on contrastive learning strategies for acquiring relation representations. However, these studies often overlook two important aspects: the inclusion of diverse positive pairs for contrastive learning and the exploration of appropriate loss functions. In this paper, we propose AugURE with both within-sentence pairs augmentation and augmentation through cross-sentence pairs extraction to increase the diversity of positive pairs and strengthen the discriminative power of contrastive learning. We also identify the limitation of noise-contrastive estimation (NCE) loss for relation representation learning and propose to apply margin loss for sentence pairs. Experiments on NYT-FB and TACRED datasets demonstrate that the proposed relation representation learning and a simple K-Means clustering achieves state-of-the-art performance.

* Accepted by EMNLP 2023 Main Conference

Via

Access Paper or Ask Questions

CoRec: An Easy Approach for Coordination Recognition

Nov 30, 2023

Qing Wang, Haojie Jia, Wenfei Song, Qi Li

Abstract:In this paper, we observe and address the challenges of the coordination recognition task. Most existing methods rely on syntactic parsers to identify the coordinators in a sentence and detect the coordination boundaries. However, state-of-the-art syntactic parsers are slow and suffer from errors, especially for long and complicated sentences. To better solve the problems, we propose a pipeline model COordination RECognizer (CoRec). It consists of two components: coordinator identifier and conjunct boundary detector. The experimental results on datasets from various domains demonstrate the effectiveness and efficiency of the proposed method. Further experiments show that CoRec positively impacts downstream tasks, improving the yield of state-of-the-art Open IE models.

* Accepted by EMNLP 2023 Main Conference (oral presentation)

Via

Access Paper or Ask Questions

HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

Oct 02, 2023

Xin Huang, Ruizhi Shao, Qi Zhang, Hongwen Zhang, Ying Feng, Yebin Liu, Qing Wang

Figure 1 for HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

Figure 2 for HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

Figure 3 for HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

Figure 4 for HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

Abstract:Recent text-to-3D methods employing diffusion models have made significant advancements in 3D human generation. However, these approaches face challenges due to the limitations of the text-to-image diffusion model, which lacks an understanding of 3D structures. Consequently, these methods struggle to achieve high-quality human generation, resulting in smooth geometry and cartoon-like appearances. In this paper, we observed that fine-tuning text-to-image diffusion models with normal maps enables their adaptation into text-to-normal diffusion models, which enhances the 2D perception of 3D geometry while preserving the priors learned from large-scale datasets. Therefore, we propose HumanNorm, a novel approach for high-quality and realistic 3D human generation by learning the normal diffusion model including a normal-adapted diffusion model and a normal-aligned diffusion model. The normal-adapted diffusion model can generate high-fidelity normal maps corresponding to prompts with view-dependent text. The normal-aligned diffusion model learns to generate color images aligned with the normal maps, thereby transforming physical geometry details into realistic appearance. Leveraging the proposed normal diffusion model, we devise a progressive geometry generation strategy and coarse-to-fine texture generation strategy to enhance the efficiency and robustness of 3D human generation. Comprehensive experiments substantiate our method's ability to generate 3D humans with intricate geometry and realistic appearances, significantly outperforming existing text-to-3D methods in both geometry and texture quality. The project page of HumanNorm is https://humannorm.github.io/.

* The project page of HumanNorm is https://humannorm.github.io/

Via

Access Paper or Ask Questions

PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

Sep 28, 2023

Xiang Lyu, Yuhang Cao, Qing Wang, Jingjing Yin, Yuguang Yang, Pengpeng Zou, Yanni Hu, Heng Lu

Figure 1 for PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

Figure 2 for PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

Figure 3 for PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

Figure 4 for PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

Abstract:Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability, background noise, and reverberation. In this study, we propose PP-MeT system, a real-world personalized prompt based meeting transcription system, which consists of a clustering system, target-speaker voice activity detection (TS-VAD), and TS-ASR. Specifically, we utilize target-speaker embedding as a prompt in TS-VAD and TS-ASR modules in our proposed system. In constrast with previous system, we fully leverage pre-trained models for system initialization, thereby bestowing our approach with heightened generalizability and precision. Experiments on M2MeT2.0 Challenge dataset show that our system achieves a cp-CER of 11.27% on the test set, ranking first in both fixed and open training conditions.

Via

Access Paper or Ask Questions

MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

Sep 23, 2023

Rongxuan Tan, Qing Wang, Xueyan Wang, Chao Yan, Yang Sun, Youyang Feng

Figure 1 for MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

Figure 2 for MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

Figure 3 for MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

Figure 4 for MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

Abstract:Significant strides have been made in enhancing the accuracy of Multi-View Stereo (MVS)-based 3D reconstruction. However, untextured areas with unstable photometric consistency often remain incompletely reconstructed. In this paper, we propose a resilient and effective multi-view stereo approach (MP-MVS). We design a multi-scale windows PatchMatch (mPM) to obtain reliable depth of untextured areas. In contrast with other multi-scale approaches, which is faster and can be easily extended to PatchMatch-based MVS approaches. Subsequently, we improve the existing checkerboard sampling schemes by limiting our sampling to distant regions, which can effectively improve the efficiency of spatial propagation while mitigating outlier generation. Finally, we introduce and improve planar prior assisted PatchMatch of ACMP. Instead of relying on photometric consistency, we utilize geometric consistency information between multi-views to select reliable triangulated vertices. This strategy can obtain a more accurate planar prior model to rectify photometric consistency measurements. Our approach has been tested on the ETH3D High-res multi-view benchmark with several state-of-the-art approaches. The results demonstrate that our approach can reach the state-of-the-art. The associated codes will be accessible at https://github.com/RongxuanTan/MP-MVS.

Via

Access Paper or Ask Questions