Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Resolution of Unidentified Words in Machine Translation

Jul 28, 2010
Sana Ullah, M. Asdaque Hussain, Kyung Sup Kwak

This paper presents a mechanism of resolving unidentified lexical units in Text-based Machine Translation (TBMT). In a Machine Translation (MT) system it is unlikely to have a complete lexicon and hence there is intense need of a new mechanism to handle the problem of unidentified words. These unknown words could be abbreviations, names, acronyms and newly introduced terms. We have proposed an algorithm for the resolution of the unidentified words. This algorithm takes discourse unit (primitive discourse) as a unit of analysis and provides real time updates to the lexicon. We have manually applied the algorithm to news paper fragments. Along with anaphora and cataphora resolution, many unknown words especially names and abbreviations were updated to the lexicon.

* 4 pages, 2 figures, 2 tables, The 4th annual International New Exploratory Technologies Conference 2007 (NEXT 2007), Seoul, South Korea 

  Access Paper or Ask Questions

Offline Arabic Handwriting Recognition Using Artificial Neural Network

Jun 14, 2010
A. A Zaidan, B. B Zaidan, Hamid. A. Jalab, Hamdan. O. Alanazi, Rami Alnaqeib

The ambition of a character recognition system is to transform a text document typed on paper into a digital format that can be manipulated by word processor software Unlike other languages, Arabic has unique features, while other language doesn't have, from this language these are seven or eight language such as ordo, jewie and Persian writing, Arabic has twenty eight letters, each of which can be linked in three different ways or separated depending on the case. The difficulty of the Arabic handwriting recognition is that, the accuracy of the character recognition which affects on the accuracy of the word recognition, in additional there is also two or three from for each character, the suggested solution by using artificial neural network can solve the problem and overcome the difficulty of Arabic handwriting recognition.

* Journal of Computer Science and Engineering, Volume 1, Issue 1, p55-58, May 2010 
* Submitted to Journal of Computer Science and Engineering, see 

  Access Paper or Ask Questions

PageRank without hyperlinks: Structural re-ranking using links induced by language models

Jan 11, 2006
Oren Kurland, Lillian Lee

Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural re-ranking approach to ad hoc information retrieval: we reorder the documents in an initially retrieved set by exploiting asymmetric relationships between them. Specifically, we consider generation links, which indicate that the language model induced from one document assigns high probability to the text of another; in doing so, we take care to prevent bias against long documents. We study a number of re-ranking criteria based on measures of centrality in the graphs formed by generation links, and show that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks.

* Proceedings of SIGIR 2005, pp. 306--313 

  Access Paper or Ask Questions

Question Answering over Unstructured Data without Domain Restrictions

Jul 18, 2002
Jochen L. Leidner

Information needs are naturally represented as questions. Automatic Natural-Language Question Answering (NLQA) has only recently become a practical task on a larger scale and without domain constraints. This paper gives a brief introduction to the field, its history and the impact of systematic evaluation competitions. It is then demonstrated that an NLQA system for English can be built and evaluated in a very short time using off-the-shelf parsers and thesauri. The system is based on Robust Minimal Recursion Semantics (RMRS) and is portable with respect to the parser used as a frontend. It applies atomic term unification supported by question classification and WordNet lookup for semantic similarity matching of parsed question representation and free text.

* 8 pages, 6 figures, 5 tables. To appear in Proc. TaCoS'02, Potsdam, Germany 

  Access Paper or Ask Questions

Anaphora Resolution in Japanese Sentences Using Surface Expressions and Examples

Sep 19, 2000
Masaki Murata

Anaphora resolution is one of the major problems in natural language processing. It is also one of the important tasks in machine translation and man/machine dialogue. We solve the problem by using surface expressions and examples. Surface expressions are the words in sentences which provide clues for anaphora resolution. Examples are linguistic data which are actually used in conversations and texts. The method using surface expressions and examples is a practical method. This thesis handles almost all kinds of anaphora: i. The referential property and number of a noun phrase ii. Noun phrase direct anaphora iii. Noun phrase indirect anaphora iv. Pronoun anaphora v. Verb phrase ellipsis

* 156 pages. Doctoral thesis in Kyoto University, December 1996, supervised by M. Nagao 

  Access Paper or Ask Questions

With raised eyebrows or the eyebrows raised ? A Neural Network Approach to Grammar Checking for Definiteness

Jun 14, 1996
Gabriele Scheler

In this paper, we use a feature model of the semantics of plural determiners to present an approach to grammar checking for definiteness. Using neural network techniques, a semantics -- morphological category mapping was learned. We then applied a textual encoding technique to the 125 occurences of the relevant category in a 10 000 word narrative text and learned a surface -- semantics mapping. By applying the learned generation function to the newly generated representations, we achieved a correct category assignment in many cases (87 %). These results are considerably better than a direct surface categorization approach (54 %), with a baseline (always guessing the dominant category) of 60 %. It is discussed, how these results could be used in multilingual NLP applications.

* 14 pages 

  Access Paper or Ask Questions

Evaluating Discourse Processing Algorithms

Oct 11, 1994
Marilyn A. Walker

In order to take steps towards establishing a methodology for evaluating Natural Language systems, we conducted a case study. We attempt to evaluate two different approaches to anaphoric processing in discourse by comparing the accuracy and coverage of two published algorithms for finding the co-specifiers of pronouns in naturally occurring texts and dialogues. We present the quantitative results of hand-simulating these algorithms, but this analysis naturally gives rise to both a qualitative evaluation and recommendations for performing such evaluations in general. We illustrate the general difficulties encountered with quantitative evaluation. These are problems with: (a) allowing for underlying assumptions, (b) determining how to handle underspecifications, and (c) evaluating the contribution of false positives and error chaining.

* Association of Computational Linguistics, 1989, p. 251-262 
* plain latex but includes psfig.tex, 11 pages with one psfig, published in 27th Annual Meeting of the ACL, 1989 

  Access Paper or Ask Questions

SubER: A Metric for Automatic Evaluation of Subtitle Quality

May 11, 2022
Patrick Wilken, Panayota Georgakopoulou, Evgeny Matusov

This paper addresses the problem of evaluating the quality of automatically generated subtitles, which includes not only the quality of the machine-transcribed or translated speech, but also the quality of line segmentation and subtitle timing. We propose SubER - a single novel metric based on edit distance with shifts that takes all of these subtitle properties into account. We compare it to existing metrics for evaluating transcription, translation, and subtitle quality. A careful human evaluation in a post-editing scenario shows that the new metric has a high correlation with the post-editing effort and direct human assessment scores, outperforming baseline metrics considering only the subtitle text, such as WER and BLEU, and existing methods to integrate segmentation and timing features.

* IWSLT 2022 

  Access Paper or Ask Questions

HumanAL: Calibrating Human Matching Beyond a Single Task

May 06, 2022
Roee Shraga

This work offers a novel view on the use of human input as labels, acknowledging that humans may err. We build a behavioral profile for human annotators which is used as a feature representation of the provided input. We show that by utilizing black-box machine learning, we can take into account human behavior and calibrate their input to improve the labeling quality. To support our claims and provide a proof-of-concept, we experiment with three different matching tasks, namely, schema matching, entity matching and text matching. Our empirical evaluation suggests that the method can improve the quality of gathered labels in multiple settings including cross-domain (across different matching tasks).

* To appear in HILDA (, Co-located with SIGMOD 2022 (

  Access Paper or Ask Questions

Automatic Speech recognition for Speech Assessment of Preschool Children

Mar 24, 2022
Amirhossein Abaskohi, Fatemeh Mortazavi, Hadi Moradi

The acoustic and linguistic features of preschool speech are investigated in this study to design an automated speech recognition (ASR) system. Acoustic fluctuation has been highlighted as a significant barrier to developing high-performance ASR applications for youngsters. Because of the epidemic, preschool speech assessment should be conducted online. Accordingly, there is a need for an automatic speech recognition system. We were confronted with new challenges in our cognitive system, including converting meaningless words from speech to text and recognizing word sequence. After testing and experimenting with several models we obtained a 3.1\% phoneme error rate in Persian. Wav2Vec 2.0 is a paradigm that could be used to build a robust end-to-end speech recognition system.

* 10 pages, 5 figures 

  Access Paper or Ask Questions