Alert button
Picture for Xiaodong Shi

Xiaodong Shi

Alert button

How many preprints have actually been printed and why: a case study of computer science preprints on arXiv

Aug 03, 2023
Jialiang Lin, Yao Yu, Yu Zhou, Zhiyang Zhou, Xiaodong Shi

Preprints play an increasingly critical role in academic communities. There are many reasons driving researchers to post their manuscripts to preprint servers before formal submission to journals or conferences, but the use of preprints has also sparked considerable controversy, especially surrounding the claim of priority. In this paper, a case study of computer science preprints submitted to arXiv from 2008 to 2017 is conducted to quantify how many preprints have eventually been printed in peer-reviewed venues. Among those published manuscripts, some are published under different titles and without an update to their preprints on arXiv. In the case of these manuscripts, the traditional fuzzy matching method is incapable of mapping the preprint to the final published version. In view of this issue, we introduce a semantics-based mapping method with the employment of Bidirectional Encoder Representations from Transformers (BERT). With this new mapping method and a plurality of data sources, we find that 66% of all sampled preprints are published under unchanged titles and 11% are published under different titles and with other modifications. A further analysis was then performed to investigate why these preprints but not others were accepted for publication. Our comparison reveals that in the field of computer science, published preprints feature adequate revisions, multiple authorship, detailed abstract and introduction, extensive and authoritative references and available source code.

* Scientometrics, Vol. 124, Issue 1, pp. 555-574 (2020)  
* Please cite the version of Scientometrics 
Viaarxiv icon

Layer-wise Representation Fusion for Compositional Generalization

Jul 20, 2023
Yafang Zheng, Lei Lin, Zhaohong Lai, Binling Wang, Shan Liu, Biao Fu, Wenhao Rao, Peigen Ye, Yidong Chen, Xiaodong Shi

Despite successes across a broad range of applications, sequence-to-sequence models' construct of solutions are argued to be less compositional than human-like generalization. There is mounting evidence that one of the reasons hindering compositional generalization is representations of the encoder and decoder uppermost layer are entangled. In other words, the syntactic and semantic representations of sequences are twisted inappropriately. However, most previous studies mainly concentrate on enhancing token-level semantic information to alleviate the representations entanglement problem, rather than composing and using the syntactic and semantic representations of sequences appropriately as humans do. In addition, we explain why the entanglement problem exists from the perspective of recent studies about training deeper Transformer, mainly owing to the ``shallow'' residual connections and its simple, one-step operations, which fails to fuse previous layers' information effectively. Starting from this finding and inspired by humans' strategies, we propose \textsc{FuSion} (\textbf{Fu}sing \textbf{S}yntactic and Semant\textbf{i}c Representati\textbf{on}s), an extension to sequence-to-sequence models to learn to fuse previous layers' information back into the encoding and decoding process appropriately through introducing a \emph{fuse-attention module} at each encoder and decoder layer. \textsc{FuSion} achieves competitive and even \textbf{state-of-the-art} results on two realistic benchmarks, which empirically demonstrates the effectiveness of our proposal.

* work in progress. arXiv admin note: substantial text overlap with arXiv:2305.12169 
Viaarxiv icon

Learn to Compose Syntactic and Semantic Representations Appropriately for Compositional Generalization

May 20, 2023
Lei Lin, Shuangtao Li, Biao Fu, Yafang Zheng, Shan Liu, Yidong Chen, Xiaodong Shi

Figure 1 for Learn to Compose Syntactic and Semantic Representations Appropriately for Compositional Generalization
Figure 2 for Learn to Compose Syntactic and Semantic Representations Appropriately for Compositional Generalization
Figure 3 for Learn to Compose Syntactic and Semantic Representations Appropriately for Compositional Generalization
Figure 4 for Learn to Compose Syntactic and Semantic Representations Appropriately for Compositional Generalization

Recent studies have shown that sequence-to-sequence (Seq2Seq) models are limited in solving the compositional generalization (CG) tasks, failing to systematically generalize to unseen compositions of seen components. There is mounting evidence that one of the reasons hindering CG is the representation of the encoder uppermost layer is entangled. In other words, the syntactic and semantic representations of sequences are twisted inappropriately. However, most previous studies mainly concentrate on enhancing semantic information at token-level, rather than composing the syntactic and semantic representations of sequences appropriately as humans do. In addition, we consider the representation entanglement problem they found is not comprehensive, and further hypothesize that source keys and values representations passing into different decoder layers are also entangled. Staring from this intuition and inspired by humans' strategies for CG, we propose COMPSITION (Compose Syntactic and Semantic Representations), an extension to Seq2Seq models to learn to compose representations of different encoder layers appropriately for generating different keys and values passing into different decoder layers through introducing a composed layer between the encoder and decoder. COMPSITION achieves competitive and even state-of-the-art results on two realistic benchmarks, which empirically demonstrates the effectiveness of our proposal.

* Work in progress 
Viaarxiv icon

LEAPT: Learning Adaptive Prefix-to-prefix Translation For Simultaneous Machine Translation

Mar 21, 2023
Lei Lin, Shuangtao Li, Xiaodong Shi

Figure 1 for LEAPT: Learning Adaptive Prefix-to-prefix Translation For Simultaneous Machine Translation
Figure 2 for LEAPT: Learning Adaptive Prefix-to-prefix Translation For Simultaneous Machine Translation
Figure 3 for LEAPT: Learning Adaptive Prefix-to-prefix Translation For Simultaneous Machine Translation
Figure 4 for LEAPT: Learning Adaptive Prefix-to-prefix Translation For Simultaneous Machine Translation

Simultaneous machine translation, which aims at a real-time translation, is useful in many live scenarios but very challenging due to the trade-off between accuracy and latency. To achieve the balance for both, the model needs to wait for appropriate streaming text (READ policy) and then generates its translation (WRITE policy). However, WRITE policies of previous work either are specific to the method itself due to the end-to-end training or suffer from the input mismatch between training and decoding for the non-end-to-end training. Therefore, it is essential to learn a generic and better WRITE policy for simultaneous machine translation. Inspired by strategies utilized by human interpreters and "wait" policies, we propose a novel adaptive prefix-to-prefix training policy called LEAPT, which allows our machine translation model to learn how to translate source sentence prefixes and make use of the future context. Experiments show that our proposed methods greatly outperform competitive baselines and achieve promising results.

* Accepted by ICASSP 2023 
Viaarxiv icon

Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

Mar 14, 2023
Biao Fu, Kai Fan, Minpeng Liao, Zhongqiang Huang, Boxing Chen, Yidong Chen, Xiaodong Shi

Figure 1 for Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference
Figure 2 for Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference
Figure 3 for Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference
Figure 4 for Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

A popular approach to streaming speech translation is to employ a single offline model with a \textit{wait-$k$} policy to support different latency requirements, which is simpler than training multiple online models with different latency constraints. However, there is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input. We demonstrate that speech representations extracted at the end of a streaming input are significantly different from those extracted from a complete utterance. To address this issue, we propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input. FAST includes a Future-Aware Inference (FAI) strategy that incorporates future context through a trainable masked embedding, and a Future-Aware Distillation (FAD) framework that transfers future context from an approximation of full speech to streaming input. Our experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks show that FAST achieves better trade-offs between translation quality and latency than strong baselines. Extensive analyses suggest that our methods effectively alleviate the aforementioned mismatch problem between offline training and online inference.

* work in progress 
Viaarxiv icon

MOPRD: A multidisciplinary open peer review dataset

Dec 09, 2022
Jialiang Lin, Jiaxin Song, Zhangping Zhou, Yidong Chen, Xiaodong Shi

Figure 1 for MOPRD: A multidisciplinary open peer review dataset
Figure 2 for MOPRD: A multidisciplinary open peer review dataset
Figure 3 for MOPRD: A multidisciplinary open peer review dataset
Figure 4 for MOPRD: A multidisciplinary open peer review dataset

Open peer review is a growing trend in academic publications. Public access to peer review data can benefit both the academic and publishing communities. It also serves as a great support to studies on review comment generation and further to the realization of automated scholarly paper review. However, most of the existing peer review datasets do not provide data that cover the whole peer review process. Apart from this, their data are not diversified enough as they are mainly collected from the field of computer science. These two drawbacks of the currently available peer review datasets need to be addressed to unlock more opportunities for related studies. In response to this problem, we construct MOPRD, a multidisciplinary open peer review dataset. This dataset consists of paper metadata, multiple version manuscripts, review comments, meta-reviews, author's rebuttal letters, and editorial decisions. Moreover, we design a modular guided review comment generation method based on MOPRD. Experiments show that our method delivers better performance indicated by both automatic metrics and human evaluation. We also explore other potential applications of MOPRD, including meta-review generation, editorial decision prediction, author rebuttal generation, and scientometric analysis. MOPRD is a strong endorsement for further studies in peer review-related research and other applications.

Viaarxiv icon

Towards Better Document-level Relation Extraction via Iterative Inference

Nov 26, 2022
Liang Zhang, Jinsong Su, Yidong Chen, Zhongjian Miao, Zijun Min, Qingguo Hu, Xiaodong Shi

Figure 1 for Towards Better Document-level Relation Extraction via Iterative Inference
Figure 2 for Towards Better Document-level Relation Extraction via Iterative Inference
Figure 3 for Towards Better Document-level Relation Extraction via Iterative Inference
Figure 4 for Towards Better Document-level Relation Extraction via Iterative Inference

Document-level relation extraction (RE) aims to extract the relations between entities from the input document that usually containing many difficultly-predicted entity pairs whose relations can only be predicted through relational inference. Existing methods usually directly predict the relations of all entity pairs of input document in a one-pass manner, ignoring the fact that predictions of some entity pairs heavily depend on the predicted results of other pairs. To deal with this issue, in this paper, we propose a novel document-level RE model with iterative inference. Our model is mainly composed of two modules: 1) a base module expected to provide preliminary relation predictions on entity pairs; 2) an inference module introduced to refine these preliminary predictions by iteratively dealing with difficultly-predicted entity pairs depending on other pairs in an easy-to-hard manner. Unlike previous methods which only consider feature information of entity pairs, our inference module is equipped with two Extended Cross Attention units, allowing it to exploit both feature information and previous predictions of entity pairs during relational inference. Furthermore, we adopt a two-stage strategy to train our model. At the first stage, we only train our base module. During the second stage, we train the whole model, where contrastive learning is introduced to enhance the training of inference module. Experimental results on three commonly-used datasets show that our model consistently outperforms other competitive baselines.

* Accepted by EMNLP 2022 (long paper) 
Viaarxiv icon

Detecting and analyzing missing citations to published scientific entities

Oct 18, 2022
Jialiang Lin, Yao Yu, Jiaxin Song, Xiaodong Shi

Proper citation is of great importance in academic writing for it enables knowledge accumulation and maintains academic integrity. However, citing properly is not an easy task. For published scientific entities, the ever-growing academic publications and over-familiarity of terms easily lead to missing citations. To deal with this situation, we design a special method Citation Recommendation for Published Scientific Entity (CRPSE) based on the cooccurrences between published scientific entities and in-text citations in the same sentences from previous researchers. Experimental outcomes show the effectiveness of our method in recommending the source papers for published scientific entities. We further conduct a statistical analysis on missing citations among papers published in prestigious computer science conferences in 2020. In the 12,278 papers collected, 475 published scientific entities of computer science and mathematics are found to have missing citations. Many entities mentioned without citations are found to be well-accepted research results. On a median basis, the papers proposing these published scientific entities with missing citations were published 8 years ago, which can be considered the time frame for a published scientific entity to develop into a well-accepted concept. For published scientific entities, we appeal for accurate and full citation of their source papers as required by academic standards.

* Scientometrics, Vol. 127, Issue 5, pp. 2395-2412 (2022)  
* Please cite the version of Scientometrics 
Viaarxiv icon

Automatic Analysis of Available Source Code of Top Artificial Intelligence Conference Papers

Sep 28, 2022
Jialiang Lin, Yingmin Wang, Yao Yu, Yu Zhou, Yidong Chen, Xiaodong Shi

Figure 1 for Automatic Analysis of Available Source Code of Top Artificial Intelligence Conference Papers
Figure 2 for Automatic Analysis of Available Source Code of Top Artificial Intelligence Conference Papers
Figure 3 for Automatic Analysis of Available Source Code of Top Artificial Intelligence Conference Papers
Figure 4 for Automatic Analysis of Available Source Code of Top Artificial Intelligence Conference Papers

Source code is essential for researchers to reproduce the methods and replicate the results of artificial intelligence (AI) papers. Some organizations and researchers manually collect AI papers with available source code to contribute to the AI community. However, manual collection is a labor-intensive and time-consuming task. To address this issue, we propose a method to automatically identify papers with available source code and extract their source code repository URLs. With this method, we find that 20.5% of regular papers of 10 top AI conferences published from 2010 to 2019 are identified as papers with available source code and that 8.1% of these source code repositories are no longer accessible. We also create the XMU NLP Lab README Dataset, the largest dataset of labeled README files for source code document research. Through this dataset, we have discovered that quite a few README files have no installation instructions or usage tutorials provided. Further, a large-scale comprehensive statistical analysis is made for a general picture of the source code of AI conference papers. The proposed solution can also go beyond AI conference papers to analyze other scientific papers from both journals and conferences to shed light on more domains.

* International Journal of Software Engineering and Knowledge Engineering, Vol. 32, No. 07, pp. 947-970 (2022)  
* Please cite the version of IJSEKE 
Viaarxiv icon

ConSLT: A Token-level Contrastive Framework for Sign Language Translation

Apr 11, 2022
Biao Fu, Peigen Ye, Liang Zhang, Pei Yu, Cong Hu, Yidong Chen, Xiaodong Shi

Figure 1 for ConSLT: A Token-level Contrastive Framework for Sign Language Translation
Figure 2 for ConSLT: A Token-level Contrastive Framework for Sign Language Translation
Figure 3 for ConSLT: A Token-level Contrastive Framework for Sign Language Translation
Figure 4 for ConSLT: A Token-level Contrastive Framework for Sign Language Translation

Sign language translation (SLT) is an important technology that can bridge the communication gap between the deaf and the hearing people. SLT task is essentially a low-resource problem due to the scarcity of publicly available parallel data. To this end, inspired by the success of neural machine translation methods based on contrastive learning, we propose ConSLT, a novel token-level \textbf{Con}trastive learning framework for \textbf{S}ign \textbf{L}anguage \textbf{T}ranslation. Unlike previous contrastive learning based works whose goal is to obtain better sentence representation, ConSLT aims to learn effective token representation by pushing apart tokens from different sentences. Concretely, our model follows the two-stage SLT method. First, in the recoginition stage, we use a state-of-the-art continuous sign language recognition model to recognize glosses from sign frames. Then, in the translation stage, we adopt the Transformer framework while introducing contrastive learning. Specifically, we pass each sign glosses to the Transformer model twice to obtain two different hidden layer representations for each token as "positive examples" and randomly sample K tokens that are not in the current sentence from the vocabulary as "negative examples" for each token. Experimental results demonstrate that ConSLT achieves new state-of-the-art performance on PHOENIX14T dataset, with +1.48 BLEU improvements.

Viaarxiv icon