Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

An Ontology for Modelling and Supporting the Process of Authoring Technical Assessments

Feb 19, 2013
Khalil Riad Bouzidi, Bruno Fies, Marc Bourdeau, Catherine Faron-Zucker, Nhan Le-Thanh

In this paper, we present a semantic web approach for modelling the process of creating new technical and regulatory documents related to the Building sector. This industry, among other industries, is currently experiencing a phenomenal growth in its technical and regulatory texts. Therefore, it is urgent and crucial to improve the process of creating regulations by automating it as much as possible. We focus on the creation of particular technical documents issued by the French Scientific and Technical Centre for Building (CSTB), called Technical Assessments, and we propose services based on Semantic Web models and techniques for modelling the process of their creation.

* In the International Council for Building Conference, CIB 2011 (2011) 

  Access Paper or Ask Questions

Matrix Approximation under Local Low-Rank Assumption

Jan 15, 2013
Joonseok Lee, Seungyeon Kim, Guy Lebanon, Yoram Singer

Matrix approximation is a common tool in machine learning for building accurate prediction models for recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is of low-rank. We propose a new matrix approximation model where we assume instead that the matrix is only locally of low-rank, leading to a representation of the observed matrix as a weighted sum of low-rank matrices. We analyze the accuracy of the proposed local low-rank modeling. Our experiments show improvements in prediction accuracy in recommendation tasks.

* 3 pages, 2 figures, Workshop submission to the First International Conference on Learning Representations (ICLR) 

  Access Paper or Ask Questions

A Practical Algorithm for Topic Modeling with Provable Guarantees

Dec 19, 2012
Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu

Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for topic model inference that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.

* 26 pages 

  Access Paper or Ask Questions

Alberti's letter counts

Oct 26, 2012
Bernard Ycart

Four centuries before modern statistical linguistics was born, Leon Battista Alberti (1404--1472) compared the frequency of vowels in Latin poems and orations, making the first quantified observation of a stylistic difference ever. Using a corpus of 20 Latin texts (over 5 million letters), Alberti's observations are statistically assessed. Letter counts prove that poets used significantly more a's, e's, and y's, whereas orators used more of the other vowels. The sample sizes needed to justify the assertions are studied, and proved to be within reach for Alberti's scholarship.

* Literary and Linguistic Computing (2013) 10.1093/llc/fqt034 

  Access Paper or Ask Questions

Asymptotic Analysis of Generative Semi-Supervised Learning

Feb 26, 2010
Joshua V Dillon, Krishnakumar Balasubramanian, Guy Lebanon

Semisupervised learning has emerged as a popular framework for improving modeling accuracy while controlling labeling cost. Based on an extension of stochastic composite likelihood we quantify the asymptotic accuracy of generative semi-supervised learning. In doing so, we complement distribution-free analysis by providing an alternative framework to measure the value associated with different labeling policies and resolve the fundamental question of how much data to label and in what manner. We demonstrate our approach with both simulation studies and real world experiments using naive Bayes for text classification and MRFs and CRFs for structured prediction in NLP.

* 12 pages, 9 figures 

  Access Paper or Ask Questions

Machine Transliteration

Apr 14, 1997
Kevin Knight, Jonathan Graehl

It is challenging to translate names and technical terms across languages with different alphabets and sound inventories. These items are commonly transliterated, i.e., replaced with approximate phonetic equivalents. For example, "computer" in English comes out as "konpyuutaa" in Japanese. Translating such items from Japanese back to English is even more challenging, and of practical interest, as transliterated items make up the bulk of text phrases not found in bilingual dictionaries. We describe and evaluate a method for performing backwards transliterations by machine. This method uses a generative model, incorporating several distinct stages in the transliteration process.

* 8 pages, postscript, to appear, ACL-97/EACL-97 

  Access Paper or Ask Questions

Towards a Workbench for Acquisition of Domain Knowledge from Natural Language

Apr 30, 1996
Andrei Mikheev, Steven Finch

In this paper we describe an architecture and functionality of main components of a workbench for an acquisition of domain knowledge from large text corpora. The workbench supports an incremental process of corpus analysis starting from a rough automatic extraction and organization of lexico-semantic regularities and ending with a computer supported analysis of extracted data and a semi-automatic refinement of obtained hypotheses. For doing this the workbench employs methods from computational linguistics, information retrieval and knowledge engineering. Although the workbench is currently under implementation some of its components are already implemented and their performance is illustrated with samples from engineering for a medical domain.

* 8 pages, compressed postscript; Proceedings of EACL-95 Dublin, Ireland 

  Access Paper or Ask Questions

Transformer based ensemble for emotion detection

Apr 10, 2022
Aditya Kane, Shantanu Patankar, Sahil Khose, Neeraja Kirtane

Detecting emotions in languages is important to accomplish a complete interaction between humans and machines. This paper describes our contribution to the WASSA 2022 shared task which handles this crucial task of emotion detection. We have to identify the following emotions: sadness, surprise, neutral, anger, fear, disgust, joy based on a given essay text. We are using an ensemble of ELECTRA and BERT models to tackle this problem achieving an F1 score of $62.76\%$. Our codebase ( and our WandB project ( is publicly available.

* Accepted at WASSA, ACL 2022 

  Access Paper or Ask Questions

Generative Spoken Dialogue Language Modeling

Mar 30, 2022
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking. Generation samples can be found at:

  Access Paper or Ask Questions

A Feasibility Study of Answer-Agnostic Question Generation for Education

Mar 29, 2022
Liam Dugan, Eleni Miltsakaki, Shriyash Upadhyay, Etan Ginsberg, Hannah Gonzalez, Dayheon Choi, Chuning Yuan, Chris Callison-Burch

We conduct a feasibility study into the applicability of answer-agnostic question generation models to textbook passages. We show that a significant portion of errors in such systems arise from asking irrelevant or uninterpretable questions and that such errors can be ameliorated by providing summarized input. We find that giving these models human-written summaries instead of the original text results in a significant increase in acceptability of generated questions (33% $\rightarrow$ 83%) as determined by expert annotators. We also find that, in the absence of human-written summaries, automatic summarization can serve as a good middle ground.

* To be published in 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) 

  Access Paper or Ask Questions