Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Min Zhang

Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Oct 31, 2022
Lei Zhang, Shilin Zhou, Chen Gong, Zhenghua Li, Zhefeng Wang, Baoxing Huai, Min Zhang

Figure 1 for Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Figure 2 for Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Figure 3 for Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Figure 4 for Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Chinese word segmentation (CWS) models have achieved very high performance when the training data is sufficient and in-domain. However, the performance drops drastically when shifting to cross-domain and low-resource scenarios due to data sparseness issues. Considering that constructing large-scale manually annotated data is time-consuming and labor-intensive, in this work, we for the first time propose to mine word boundary information from pauses in speech to efficiently obtain large-scale CWS naturally annotated data. We present a simple yet effective complete-then-train method to utilize these natural annotations from speech for CWS model training. Extensive experiments demonstrate that the CWS performance in cross-domain and low-resource scenarios can be significantly improved by leveraging our naturally annotated data extracted from speech.

* Submitted to ICASSP2023

Via

Access Paper or Ask Questions

Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change

Oct 31, 2022
Zhaochen Su, Zecheng Tang, Xinyan Guan, Juntao Li, Lijun Wu, Min Zhang

Figure 1 for Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change

Figure 2 for Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change

Figure 3 for Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change

Figure 4 for Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change

Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability, i.e., the language model pre-trained on static data from past years performs worse over time on emerging data. Existing methods mainly perform continual training to mitigate such a misalignment. While effective to some extent but is far from being addressed on both the language modeling and downstream tasks. In this paper, we empirically observe that temporal generalization is closely affiliated with lexical semantic change, which is one of the essential phenomena of natural languages. Based on this observation, we propose a simple yet effective lexical-level masking strategy to post-train a converged language model. Experiments on two pre-trained language models, two different classification tasks, and four benchmark datasets demonstrate the effectiveness of our proposed method over existing temporal adaptation methods, i.e., continual training with new data. Our code is available at \url{https://github.com/zhaochen0110/LMLM}.

* EMNLP 2022, Long paper

Via

Access Paper or Ask Questions

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

Oct 27, 2022
Peijie Jiang, Dingkun Long, Yanzhao Zhang, Pengjun Xie, Meishan Zhang, Min Zhang

Figure 1 for Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

Figure 2 for Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

Figure 3 for Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

Figure 4 for Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

Boundary information is critical for various Chinese language processing tasks, such as word segmentation, part-of-speech tagging, and named entity recognition. Previous studies usually resorted to the use of a high-quality external lexicon, where lexicon items can offer explicit boundary information. However, to ensure the quality of the lexicon, great human effort is always necessary, which has been generally ignored. In this work, we suggest unsupervised statistical boundary information instead, and propose an architecture to encode the information directly into pre-trained language models, resulting in Boundary-Aware BERT (BABERT). We apply BABERT for feature induction of Chinese sequence labeling tasks. Experimental results on ten benchmarks of Chinese sequence labeling demonstrate that BABERT can provide consistent improvements on all datasets. In addition, our method can complement previous supervised lexicon exploration, where further improvements can be achieved when integrated with external lexicon information.

* 12 pages, 2 figures, 7 tables, EMNLP 2022

Via

Access Paper or Ask Questions

Extending Phrase Grounding with Pronouns in Visual Dialogues

Oct 23, 2022
Panzhong Lu, Xin Zhang, Meishan Zhang, Min Zhang

Figure 1 for Extending Phrase Grounding with Pronouns in Visual Dialogues

Figure 2 for Extending Phrase Grounding with Pronouns in Visual Dialogues

Figure 3 for Extending Phrase Grounding with Pronouns in Visual Dialogues

Figure 4 for Extending Phrase Grounding with Pronouns in Visual Dialogues

Conventional phrase grounding aims to localize noun phrases mentioned in a given caption to their corresponding image regions, which has achieved great success recently. Apparently, sole noun phrase grounding is not enough for cross-modal visual language understanding. Here we extend the task by considering pronouns as well. First, we construct a dataset of phrase grounding with both noun phrases and pronouns to image regions. Based on the dataset, we test the performance of phrase grounding by using a state-of-the-art literature model of this line. Then, we enhance the baseline grounding model with coreference information which should help our task potentially, modeling the coreference structures with graph convolutional networks. Experiments on our dataset, interestingly, show that pronouns are easier to ground than noun phrases, where the possible reason might be that these pronouns are much less ambiguous. Additionally, our final model with coreference information can significantly boost the grounding performance of both noun phrases and pronouns.

* Accepted by EMNLP 2022

Via

Access Paper or Ask Questions

SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser

Oct 22, 2022
Yue Zhang, Bo Zhang, Zhenghua Li, Zuyi Bao, Chen Li, Min Zhang

Figure 1 for SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser

Figure 2 for SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser

Figure 3 for SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser

Figure 4 for SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser

This work proposes a syntax-enhanced grammatical error correction (GEC) approach named SynGEC that effectively incorporates dependency syntactic information into the encoder part of GEC models. The key challenge for this idea is that off-the-shelf parsers are unreliable when processing ungrammatical sentences. To confront this challenge, we propose to build a tailored GEC-oriented parser (GOPar) using parallel GEC training data as a pivot. First, we design an extended syntax representation scheme that allows us to represent both grammatical errors and syntax in a unified tree structure. Then, we obtain parse trees of the source incorrect sentences by projecting trees of the target correct sentences. Finally, we train GOPar with such projected trees. For GEC, we employ the graph convolution network to encode source-side syntactic information produced by GOPar, and fuse them with the outputs of the Transformer encoder. Experiments on mainstream English and Chinese GEC datasets show that our proposed SynGEC approach consistently and substantially outperforms strong baselines and achieves competitive performance. Our code and data are all publicly available at https://github.com/HillZhang1999/SynGEC.

* Accepted by EMNLP2022 (main conference)

Via

Access Paper or Ask Questions

Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Oct 19, 2022
Hongqiu Wu, Ruixue Ding, Hai Zhao, Boli Chen, Pengjun Xie, Fei Huang, Min Zhang

Figure 1 for Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Figure 2 for Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Figure 3 for Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Figure 4 for Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Multiple pre-training objectives fill the vacancy of the understanding capability of single-objective language modeling, which serves the ultimate purpose of pre-trained language models (PrLMs), generalizing well on a mass of scenarios. However, learning multiple training objectives in a single model is challenging due to the unknown relative significance as well as the potential contrariety between them. Empirical studies have shown that the current objective sampling in an ad-hoc manual setting makes the learned language representation barely converge to the desired optimum. Thus, we propose \textit{MOMETAS}, a novel adaptive sampler based on meta-learning, which learns the latent sampling pattern on arbitrary pre-training objectives. Such a design is lightweight with negligible additional training overhead. To validate our approach, we adopt five objectives and conduct continual pre-training with BERT-base and BERT-large models, where MOMETAS demonstrates universal performance gain over other rule-based sampling strategies on 14 natural language processing tasks.

* EMNLP 2022 (findings)

Via

Access Paper or Ask Questions

Massive MIMO Evolution Towards 3GPP Release 18

Oct 15, 2022
Huangping Jin, Kunpeng Liu, Gilwon Lee, Emad J. Farag, Min Zhang, Dalin Zhu, Leiming Zhang, Eko Onggosanusi, Mansoor Shafi, Harsh Tataria

Figure 1 for Massive MIMO Evolution Towards 3GPP Release 18

Figure 2 for Massive MIMO Evolution Towards 3GPP Release 18

Figure 3 for Massive MIMO Evolution Towards 3GPP Release 18

Figure 4 for Massive MIMO Evolution Towards 3GPP Release 18

Since the introduction of fifth-generation new radio (5G-NR) in Third Generation Partnership Project (3GPP) Release 15, swift progress has been made to evolve 5G with 3GPP Release 18 emerging. A critical aspect is the design of massive multiple-input multiple-output (MIMO) technology. In this line, this paper makes several important contributions: We provide a comprehensive overview of the evolution of standardized massive MIMO features from 3GPP Release 15 to 17 for both time/frequency-division duplex operation across bands FR-1 and FR-2. We analyze the progress on channel state information (CSI) frameworks, beam management frameworks and present enhancements for uplink CSI. We shed light on emerging 3GPP Release 18 problems requiring imminent attention. These include advanced codebook design and sounding reference signal design for coherent joint transmission (CJT) with multiple transmission/reception points (multi- TRPs). We discuss advancements in uplink demodulation reference signal design, enhancements for mobility to provide accurate CSI estimates, and unified transmission configuration indicator framework tailored for FR-2 bands. For each concept, we provide system level simulation results to highlight their performance benefits. Via field trials in an outdoor environment at Shanghai Jiaotong University, we demonstrate the gains of multi-TRP CJT relative to single TRP at 3.7 GHz.

* 23 pages, 37 Figures, one fig in the annex

Via

Access Paper or Ask Questions

SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

Oct 11, 2022
Dan Qiao, Chenchen Dai, Yuyang Ding, Juntao Li, Qiang Chen, Wenliang Chen, Min Zhang

Figure 1 for SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

Figure 2 for SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

Figure 3 for SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

Figure 4 for SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

The conventional success of textual classification relies on annotated data, and the new paradigm of pre-trained language models (PLMs) still requires a few labeled data for downstream tasks. However, in real-world applications, label noise inevitably exists in training data, damaging the effectiveness, robustness, and generalization of the models constructed on such data. Recently, remarkable achievements have been made to mitigate this dilemma in visual data, while only a few explore textual data. To fill this gap, we present SelfMix, a simple yet effective method, to handle label noise in text classification tasks. SelfMix uses the Gaussian Mixture Model to separate samples and leverages semi-supervised learning. Unlike previous works requiring multiple models, our method utilizes the dropout mechanism on a single model to reduce the confirmation bias in self-training and introduces a textual-level mixup training strategy. Experimental results on three text classification benchmarks with different types of text show that the performance of our proposed method outperforms these strong baselines designed for both textual and visual data under different noise ratios and noise types. Our code is available at \url{https://github.com/noise-learning/SelfMix}.

* COLING-2022, oral presentation

Via

Access Paper or Ask Questions

Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Sep 16, 2022
Min Zhang, Hongyao Tang, Jianye Hao, Yan Zheng

Figure 1 for Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Figure 2 for Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Figure 3 for Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Figure 4 for Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.

* Preprint version

Via

Access Paper or Ask Questions

SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

Aug 26, 2022
Saihao Huang, Lijie Wang, Zhenghua Li, Zeyang Liu, Chenhui Dou, Fukang Yan, Xinyan Xiao, Hua Wu, Min Zhang

Figure 1 for SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

Figure 2 for SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

Figure 3 for SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

Figure 4 for SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

As the first session-level Chinese dataset, CHASE contains two separate parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and 3,456 sessions translated from English SParC (CHASE-T). We find the two parts are highly discrepant and incompatible as training and evaluation data. In this work, we present SeSQL, yet another large-scale session-level text-to-SQL dataset in Chinese, consisting of 5,028 sessions all manually constructed from scratch. In order to guarantee data quality, we adopt an iterative annotation workflow to facilitate intense and in-time review of previous-round natural language (NL) questions and SQL queries. Moreover, by completing all context-dependent NL questions, we obtain 27,012 context-independent question/SQL pairs, allowing SeSQL to be used as the largest dataset for single-round multi-DB text-to-SQL parsing. We conduct benchmark session-level text-to-SQL parsing experiments on SeSQL by employing three competitive session-level parsers, and present detailed analysis.

* 12 pages,4 figures

Via

Access Paper or Ask Questions