Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Fine-Grained Sentence Functions for Short-Text Conversation

Jul 26, 2019
Wei Bi, Jun Gao, Xiaojiang Liu, Shuming Shi

Sentence function is an important linguistic feature referring to a user's purpose in uttering a specific sentence. The use of sentence function has shown promising results to improve the performance of conversation models. However, there is no large conversation dataset annotated with sentence functions. In this work, we collect a new Short-Text Conversation dataset with manually annotated SEntence FUNctions (STC-Sefun). Classification models are trained on this dataset to (i) recognize the sentence function of new data in a large corpus of short-text conversations; (ii) estimate a proper sentence function of the response given a test query. We later train conversation models conditioned on the sentence functions, including information retrieval-based and neural generative models. Experimental results demonstrate that the use of sentence functions can help improve the quality of the returned responses.

* Here is a revised version of our paper accepted by ACL2019 

  Access Paper or Ask Questions

An Algorithm Based on Empirical Methods, for Text-to-Tuneful-Speech Synthesis of Sanskrit Verse

Sep 15, 2014
Rama N., Meenakshi Lakshmanan

The rendering of Sanskrit poetry from text to speech is a problem that has not been solved before. One reason may be the complications in the language itself. We present unique algorithms based on extensive empirical analysis, to synthesize speech from a given text input of Sanskrit verses. Using a pre-recorded audio units database which is itself tremendously reduced in size compared to the colossal size that would otherwise be required, the algorithms work on producing the best possible, tunefully rendered chanting of the given verse. His would enable the visually impaired and those with reading disabilities to easily access the contents of Sanskrit verses otherwise available only in writing.

* International Journal of Computer Science and Network Security, Vol.10, No. 1, January 2010 

  Access Paper or Ask Questions

Finite-sample Analysis of M-estimators using Self-concordance

Oct 16, 2018
Dmitrii Ostrovskii, Francis Bach

We demonstrate how self-concordance of the loss can be exploited to obtain asymptotically optimal rates for M-estimators in finite-sample regimes. We consider two classes of losses: (i) canonically self-concordant losses in the sense of Nesterov and Nemirovski (1994), i.e., with the third derivative bounded with the $3/2$ power of the second; (ii) pseudo self-concordant losses, for which the power is removed, as introduced by Bach (2010). These classes contain some losses arising in generalized linear models, including logistic regression; in addition, the second class includes some common pseudo-Huber losses. Our results consist in establishing the critical sample size sufficient to reach the asymptotically optimal excess risk for both classes of losses. Denoting $d$ the parameter dimension, and $d_{\text{eff}}$ the effective dimension which takes into account possible model misspecification, we find the critical sample size to be $O(d_{\text{eff}} \cdot d)$ for canonically self-concordant losses, and $O(\rho \cdot d_{\text{eff}} \cdot d)$ for pseudo self-concordant losses, where $\rho$ is the problem-dependent local curvature parameter. In contrast to the existing results, we only impose local assumptions on the data distribution, assuming that the calibrated design, i.e., the design scaled with the square root of the second derivative of the loss, is subgaussian at the best predictor $\theta_*$. Moreover, we obtain the improved bounds on the critical sample size, scaling near-linearly in $\max(d_{\text{eff}},d)$, under the extra assumption that the calibrated design is subgaussian in the Dikin ellipsoid of $\theta_*$. Motivated by these findings, we construct canonically self-concordant analogues of the Huber and logistic losses with improved statistical properties. Finally, we extend some of these results to $\ell_1$-regularized M-estimators in high dimensions.

  Access Paper or Ask Questions

Preliminary experiments on automatic gender recognition based on online capital letters

Mar 11, 2022
Marcos Faundez-Zanuy, Enric Sesa-Nogueras

In this paper we present some experiments to automatically classify online handwritten text based on capital letters. Although handwritten text is not as discriminative as face or voice, we still found some chance for gender classification based on handwritten text. Accuracies are up to 74%, even in the most challenging case of capital letters.

* In: Bassis S., Esposito A., Morabito F. (eds) Recent Advances of Neural Network Models and Applications. Smart Innovation, Systems and Technologies, vol 26. Springer, Cham. 2014 
* 9 pages 

  Access Paper or Ask Questions

PET: A new Dataset for Process Extraction from Natural Language Text

Mar 09, 2022
Patrizio Bellan, Han van der Aa, Mauro Dragoni, Chiara Ghidini, Simone Paolo Ponzetto

Although there is a long tradition of work in NLP on extracting entities and relations from text, to date there exists little work on the acquisition of business processes from unstructured data such as textual corpora of process descriptions. With this work we aim at filling this gap and establishing the first steps towards bridging data-driven information extraction methodologies from Natural Language Processing and the model-based formalization that is aimed from Business Process Management. For this, we develop the first corpus of business process descriptions annotated with activities, gateways, actors and flow information. We present our new resource, including a detailed overview of the annotation schema and guidelines, as well as a variety of baselines to benchmark the difficulty and challenges of business process extraction from text.

  Access Paper or Ask Questions

Plan-then-Generate: Controlled Data-to-Text Generation via Planning

Aug 31, 2021
Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, Nigel Collier

Recent developments in neural networks have led to the advance in data-to-text generation. However, the lack of ability of neural models to control the structure of generated output can be limiting in certain real-world applications. In this study, we propose a novel Plan-then-Generate (PlanGen) framework to improve the controllability of neural data-to-text models. Extensive experiments and analyses are conducted on two benchmark datasets, ToTTo and WebNLG. The results show that our model is able to control both the intra-sentence and inter-sentence structure of the generated output. Furthermore, empirical comparisons against previous state-of-the-art methods show that our model improves the generation quality as well as the output diversity as judged by human and automatic evaluations.

* Accepted to Findings of EMNLP 2021 

  Access Paper or Ask Questions

Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval

Apr 13, 2021
Chen Zhao, Chenyan Xiong, Jordan Boyd-Graber, Hal Daumé III

Complex question answering often requires finding a reasoning chain that consists of multiple evidence pieces. Current approaches incorporate the strengths of structured knowledge and unstructured text, assuming text corpora is semi-structured. Building on dense retrieval methods, we propose a new multi-step retrieval approach (BeamDR) that iteratively forms an evidence chain through beam search in dense representations. When evaluated on multi-hop question answering, BeamDR is competitive to state-of-the-art systems, without using any semi-structured information. Through query composition in dense space, BeamDR captures the implicit relationships between evidence in the reasoning chain. The code is available at henryzhao5852/BeamDR.

* NAACL 2021 

  Access Paper or Ask Questions

Limits of Detecting Text Generated by Large-Scale Language Models

Feb 09, 2020
Lav R. Varshney, Nitish Shirish Keskar, Richard Socher

Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated. We show that error exponents for particular language models are bounded in terms of their perplexity, a standard measure of language generation performance. Under the assumption that human language is stationary and ergodic, the formulation is extended from considering specific language models to considering maximum likelihood language models, among the class of k-order Markov approximations; error probabilities are characterized. Some discussion of incorporating semantic side information is also given.

* ITA 2020 

  Access Paper or Ask Questions

DocSCAN: Unsupervised Text Classification via Learning from Neighbors

May 11, 2021
Dominik Stammbach, Elliott Ash

We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN). For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels. On five topic classification benchmarks, we improve on various unsupervised baselines by a large margin. In datasets with relatively few and balanced outcome classes, DocSCAN approaches the performance of supervised classification. The method fails for other types of classification, such as sentiment analysis, pointing to important conceptual and practical differences between classifying images and texts.

  Access Paper or Ask Questions