Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Learning to Describe Phrases with Local and Global Contexts

Nov 01, 2018
Shonosuke Ishiwatari, Hiroaki Hayashi, Naoki Yoshinaga, Graham Neubig, Masashi Toyoda, Masaru Kitsuregawa

When reading a text, it is common to become stuck on unfamiliar words and phrases, such as polysemous words with novel senses, rarely used idioms, Internet slang, or emerging entities. At first, we attempt to figure out the meaning of those expressions from their context, and ultimately we may consult a dictionary for their definitions. However, rarely-used senses or emerging entities are not always covered by the hand-crafted definitions in existing dictionaries, which can cause problems in text comprehension. This paper undertakes a task of describing (or defining) a given expression (word or phrase) based on its usage contexts, and presents a novel neural-network generator for expressing its meaning as a natural language description. Experimental results on four datasets (including WordNet, Oxford and Urban Dictionaries, non-standard English, and Wikipedia) demonstrate the effectiveness of our method over previous methods for definition generation[Noraset+17; Gadetsky+18; Ni+17].

  Access Paper or Ask Questions

An Improved Phrase-based Approach to Annotating and Summarizing Student Course Responses

May 25, 2018
Wencan Luo, Fei Liu, Diane Litman

Teaching large classes remains a great challenge, primarily because it is difficult to attend to all the student needs in a timely manner. Automatic text summarization systems can be leveraged to summarize the student feedback, submitted immediately after each lecture, but it is left to be discovered what makes a good summary for student responses. In this work we explore a new methodology that effectively extracts summary phrases from the student responses. Each phrase is tagged with the number of students who raise the issue. The phrases are evaluated along two dimensions: with respect to text content, they should be informative and well-formed, measured by the ROUGE metric; additionally, they shall attend to the most pressing student needs, measured by a newly proposed metric. This work is enabled by a phrase-based annotation and highlighting scheme, which is new to the summarization task. The phrase-based framework allows us to summarize the student responses into a set of bullet points and present to the instructor promptly.

* 11 pages 

  Access Paper or Ask Questions

Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models

Aug 15, 2017
Måns Magnusson, Leif Jonsson, Mattias Villani, David Broman

Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler.

* Accepted for publication in Journal of Computational and Graphical Statistics 

  Access Paper or Ask Questions

Image Segmentation Using Overlapping Group Sparsity

Dec 21, 2016
Shervin Minaee, Yao Wang

Sparse decomposition has been widely used for different applications, such as source separation, image classification and image denoising. This paper presents a new algorithm for segmentation of an image into background and foreground text and graphics using sparse decomposition. First, the background is represented using a suitable smooth model, which is a linear combination of a few smoothly varying basis functions, and the foreground text and graphics are modeled as a sparse component overlaid on the smooth background. Then the background and foreground are separated using a sparse decomposition framework and imposing some prior information, which promote the smoothness of background, and the sparsity and connectivity of foreground pixels. This algorithm has been tested on a dataset of images extracted from HEVC standard test sequences for screen content coding, and is shown to outperform prior methods, including least absolute deviation fitting, k-means clustering based segmentation in DjVu, and shape primitive extraction and coding algorithm.

* arXiv admin note: substantial text overlap with arXiv:1602.02434. appears in IEEE Signal Processing in Medicine and Biology Symposium, 2016 

  Access Paper or Ask Questions

Supervised learning Methods for Bangla Web Document Categorization

Oct 08, 2014
Ashis Kumar Mandal, Rikta Sen

This paper explores the use of machine learning approaches, or more specifically, four supervised learning Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Na\"ive Bays (NB), and Support Vector Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy document feature vectors.

* 13 pages, International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 5, September 2014 

  Access Paper or Ask Questions

Corpus-based Web Document Summarization using Statistical and Linguistic Approach

Apr 09, 2013
Rushdi Shams, M. M. A. Hashem, Afrina Hossain, Suraiya Rumana Akter, Monika Gope

Single document summarization generates summary by extracting the representative sentences from the document. In this paper, we presented a novel technique for summarization of domain-specific text from a single web document that uses statistical and linguistic analysis on the text in a reference corpus and the web document. The proposed summarizer uses the combinational function of Sentence Weight (SW) and Subject Weight (SuW) to determine the rank of a sentence, where SW is the function of number of terms (t_n) and number of words (w_n) in a sentence, and term frequency (t_f) in the corpus and SuW is the function of t_n and w_n in a subject, and t_f in the corpus. 30 percent of the ranked sentences are considered to be the summary of the web document. We generated three web document summaries using our technique and compared each of them with the summaries developed manually from 16 different human subjects. Results showed that 68 percent of the summaries produced by our approach satisfy the manual summaries.

* Procs. of the IEEE International Conference on Computer and Communication Engineering (ICCCE10), pp. 115-120, Kuala Lumpur, Malaysia, May 11-13, (2010) 

  Access Paper or Ask Questions

Precision-biased Parsing and High-Quality Parse Selection

May 20, 2012
Yoav Goldberg, Michael Elhadad

We introduce precision-biased parsing: a parsing task which favors precision over recall by allowing the parser to abstain from decisions deemed uncertain. We focus on dependency-parsing and present an ensemble method which is capable of assigning parents to 84% of the text tokens while being over 96% accurate on these tokens. We use the precision-biased parsing task to solve the related high-quality parse-selection task: finding a subset of high-quality (accurate) trees in a large collection of parsed text. We present a method for choosing over a third of the input trees while keeping unlabeled dependency parsing accuracy of 97% on these trees. We also present a method which is not based on an ensemble but rather on directly predicting the risk associated with individual parser decisions. In addition to its efficiency, this method demonstrates that a parsing system can provide reasonable estimates of confidence in its predictions without relying on ensembles or aggregate corpus counts.

* Rejected from EMNLP 2012 (among others) 

  Access Paper or Ask Questions

LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents

Apr 01, 2022
Debanjan Mahata, Navneet Agarwal, Dibya Gautam, Amardeep Kumar, Swapnil Parekh, Yaman Kumar Singla, Anish Acharya, Rajiv Ratn Shah

Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from human-written summaries that are often very short (approx 8 sentences). This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract. Therefore, we release two extensive corpora mapping KPs of ~1.3M and ~100K scientific articles with their fully extracted text and additional metadata including publication venue, year, author, field of study, and citations for facilitating research on this real-world problem.

  Access Paper or Ask Questions

Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer

Mar 17, 2022
Woojeong Jin, Dong-Ho Lee, Chenguang Zhu, Jay Pujara, Xiang Ren

Pre-trained language models are still far from human performance in tasks that need understanding of properties (e.g. appearance, measurable quantity) and affordances of everyday objects in the real world since the text lacks such information due to reporting bias. In this work, we study whether integrating visual knowledge into a language model can fill the gap. We investigate two types of knowledge transfer: (1) text knowledge transfer using image captions that may contain enriched visual knowledge and (2) cross-modal knowledge transfer using both images and captions with vision-language training objectives. On 5 downstream tasks that may need visual knowledge to solve the problem, we perform extensive empirical comparisons over the presented objectives. Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.

* Accepted to ACL 2022, 13 pages, 4 figures 

  Access Paper or Ask Questions

Variational Autoencoder with Disentanglement Priors for Low-Resource Task-Specific Natural Language Generation

Mar 15, 2022
Zhuang Li, Lizhen Qu, Qiongkai Xu, Tongtong Wu, Tianyang Zhan, Gholamreza Haffari

In this paper, we propose a variational autoencoder with disentanglement priors, VAE-DPRIOR, for conditional natural language generation with none or a handful of task-specific labeled examples. In order to improve compositional generalization, our model performs disentangled representation learning by introducing a prior for the latent content space and another prior for the latent label space. We show both empirically and theoretically that the conditional priors can already disentangle representations even without specific regularizations as in the prior work. We can also sample diverse content representations from the content space without accessing data of the seen tasks, and fuse them with the representations of novel tasks for generating diverse texts in the low-resource settings. Our extensive experiments demonstrate the superior performance of our model over competitive baselines in terms of i) data augmentation in continuous zero/few-shot learning, and ii) text style transfer in both zero/few-shot settings.

* 11 pages 

  Access Paper or Ask Questions