Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

NER-MQMRC: Formulating Named Entity Recognition as Multi Question Machine Reading Comprehension

May 12, 2022
Anubhav Shrimal, Avi Jain, Kartik Mehta, Promod Yenigalla

NER has been traditionally formulated as a sequence labeling task. However, there has been recent trend in posing NER as a machine reading comprehension task (Wang et al., 2020; Mengge et al., 2020), where entity name (or other information) is considered as a question, text as the context and entity value in text as answer snippet. These works consider MRC based on a single question (entity) at a time. We propose posing NER as a multi-question MRC task, where multiple questions (one question per entity) are considered at the same time for a single text. We propose a novel BERT-based multi-question MRC (NER-MQMRC) architecture for this formulation. NER-MQMRC architecture considers all entities as input to BERT for learning token embeddings with self-attention and leverages BERT-based entity representation for further improving these token embeddings for NER task. Evaluation on three NER datasets show that our proposed architecture leads to average 2.5 times faster training and 2.3 times faster inference as compared to NER-SQMRC framework based models by considering all entities together in a single pass. Further, we show that our model performance does not degrade compared to single-question based MRC (NER-SQMRC) (Devlin et al., 2019) leading to F1 gain of +0.41%, +0.32% and +0.27% for AE-Pub, Ecommerce5PT and Twitter datasets respectively. We propose this architecture primarily to solve large scale e-commerce attribute (or entity) extraction from unstructured text of a magnitude of 50k+ attributes to be extracted on a scalable production environment with high performance and optimised training and inference runtimes.

* NAACL 2022 Industry Track 

  Access Paper or Ask Questions

Weighted SGD for $\ell_p$ Regression with Randomized Preconditioning

Jul 10, 2017
Jiyan Yang, Yin-Lam Chow, Christopher Ré, Michael W. Mahoney

In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems---e.g., $\ell_2$ and $\ell_1$ regression problems. We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system. We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Particularly, when solving $\ell_1$ regression with size $n$ by $d$, pwSGD returns an approximate solution with $\epsilon$ relative error in the objective value in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d)/\epsilon^2)$ time. This complexity is uniformly better than that of RLA methods in terms of both $\epsilon$ and $d$ when the problem is unconstrained. For $\ell_2$ regression, pwSGD returns an approximate solution with $\epsilon$ relative error in the objective value and the solution vector measured in prediction norm in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d) \log(1/\epsilon) /\epsilon)$ time. We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets.

* A conference version of this paper appears under the same title in Proceedings of ACM-SIAM Symposium on Discrete Algorithms, Arlington, VA, 2016 

  Access Paper or Ask Questions

Text Embeddings for Retrieval From a Large Knowledge Base

Oct 24, 2018
Tolgahan Cakaloglu, Christian Szegedy, Xiaowei Xu

Text embedding representing natural language documents in a semantic vector space can be used for document retrieval using nearest neighbor lookup. In order to study the feasibility of neural models specialized for retrieval in a semantically meaningful way, we suggest the use of the Stanford Question Answering Dataset (SQuAD) in an open-domain question answering context, where the first task is to find paragraphs useful for answering a given question. First, we compare the quality of various text-embedding methods on the performance of retrieval and give an extensive empirical comparison on the performance of various non-augmented base embedding with, and without IDF weighting. Our main results are that by training deep residual neural models specifically for retrieval purposes can yield significant gains when it is used to augment existing embeddings. We also establish that deeper models are superior to this task. The best base baseline embeddings augmented by our learned neural approach improves the top-1 paragraph recall of the system by 14%.

* 12 pages, 7 figures, Submitted to Seventh International Conference on Learning Representations (ICLR 2019), May 6-9, New Orleans 

  Access Paper or Ask Questions

Private Eye: On the Limits of Textual Screen Peeking via Eyeglass Reflections in Video Conferencing

May 08, 2022
Yan Long, Chen Yan, Shivan Prasad, Wenyuan Xu, Kevin Fu

Personal video conferencing has become the new norm after COVID-19 caused a seismic shift from in-person meetings and phone calls to video conferencing for daily communications and sensitive business. Video leaks participants' on-screen information because eyeglasses and other reflective objects unwittingly expose partial screen contents. Using mathematical modeling and human subjects experiments, this research explores the extent to which emerging webcams might leak recognizable textual information gleamed from eyeglass reflections captured by webcams. The primary goal of our work is to measure, compute, and predict the factors, limits, and thresholds of recognizability as webcam technology evolves in the future. Our work explores and characterizes the viable threat models based on optical attacks using multi-frame super resolution techniques on sequences of video frames. Our experimental results and models show it is possible to reconstruct and recognize on-screen text with a height as small as 10 mm with a 720p webcam. We further apply this threat model to web textual content with varying attacker capabilities to find thresholds at which text becomes recognizable. Our user study with 20 participants suggests present-day 720p webcams are sufficient for adversaries to reconstruct textual content on big-font websites. Our models further show that the evolution toward 4K cameras will tip the threshold of text leakage to reconstruction of most header texts on popular websites. Our research proposes near-term mitigations, and justifies the importance of following the principle of least privilege for long-term defense against this attack. For privacy-sensitive scenarios, it's further recommended to develop technologies that blur all objects by default, then only unblur what is absolutely necessary to facilitate natural-looking conversations.

  Access Paper or Ask Questions

KnowBias: Detecting Political Polarity in Long Text Content

Sep 22, 2019
Aditya Saligrama

We introduce a classification scheme for detecting political bias in long text content such as newspaper opinion articles. Obtaining long text data and annotations at sufficient scale for training is difficult, but it is relatively easy to extract political polarity from tweets through their authorship; as such, we train on tweets and perform inference on articles. Universal sentence encoders and other existing methods that aim to address this domain-adaptation scenario deliver inaccurate and inconsistent predictions on articles, which we show is due to a difference in opinion concentration between tweets and articles. We propose a two-step classification scheme that utilizes a neutral detector trained on tweets to remove neutral sentences from articles in order to align opinion concentration and therefore improve accuracy on that domain. Our implementation is available for public use at

* 2 pages, 2 figures 

  Access Paper or Ask Questions

Effective Feature Representation for Clinical Text Concept Extraction

Oct 31, 2018
Yifeng Tao, Bruno Godefroy, Guillaume Genthial, Christopher Potts

Crucial information about the practice of healthcare is recorded only in free-form text, which creates an enormous opportunity for high-impact NLP. However, annotated healthcare datasets tend to be small and expensive to obtain, which raises the question of how to make maximally efficient uses of the available data. To this end, we develop an LSTM-CRF model for combining unsupervised word representations and hand-built feature representations derived from publicly available healthcare ontologies. We show that this combined model yields superior performance on five datasets of diverse kinds of healthcare text (clinical, social, scientific, commercial). Each involves the labeling of complex, multi-word spans that pick out different healthcare concepts. We also introduce a new labeled dataset for identifying the treatment relations between drugs and diseases.

  Access Paper or Ask Questions

textTOvec: Deep Contextualized Neural Autoregressive Models of Language with Distributed Compositional Prior

Oct 10, 2018
Pankaj Gupta, Yatin Chaudhary, Florian Buettner, Hinrich Schütze

We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging. We address this challenge by incorporating external knowledge into neural autoregressive topic models via a language modelling approach: we use word embeddings as input of a LSTM-LM with the aim to improve the word-topic mapping on a smaller and/or short-text corpus. The proposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled with neural LMs and embeddings priors that consistently outperform state-of-the-art generative TMs in terms of generalization (perplexity), interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.

* #ICLR2019 under review 

  Access Paper or Ask Questions

Interactive Tools and Tasks for the Hebrew Bible

Oct 24, 2017
Nicolai Winther-Nielsen

This contribution to a special issue on "Computer-aided processing of intertextuality" in ancient texts will illustrate how using digital tools to interact with the Hebrew Bible offers new promising perspectives for visualizing the texts and for performing tasks in education and research. This contribution explores how the corpus of the Hebrew Bible created and maintained by the Eep Talstra Centre for Bible and Computer can support new methods for modern knowledge workers within the field of digital humanities and theology be applied to ancient texts, and how this can be envisioned as a new field of digital intertextuality. The article first describes how the corpus was used to develop the Bible Online Learner as a persuasive technology to enhance language learning with, in, and around a database that acts as the engine driving interactive tasks for learners. Intertextuality in this case is a matter of active exploration and ongoing practice. Furthermore, interactive corpus-technology has an important bearing on the task of textual criticism as a specialized area of research that depends increasingly on the availability of digital resources. Commercial solutions developed by software companies like Logos and Accordance offer a market-based intertextuality defined by the production of advanced digital resources for scholars and students as useful alternatives to often inaccessible and expensive printed versions. It is reasonable to expect that in the future interactive corpus technology will allow scholars to do innovative academic tasks in textual criticism and interpretation. We have already seen the emergence of promising tools for text categorization, analysis of translation shifts, and interpretation. Broadly speaking, interactive tools and tasks within the three areas of language learning, textual criticism, and Biblical studies illustrate a new kind of intertextuality emerging within digital humanities.

  Access Paper or Ask Questions

Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation

Jun 01, 1994
Kuang-hua Chen, Hsin-Hsi Chen

To acquire noun phrases from running texts is useful for many applications, such as word grouping,terminology indexing, etc. The reported literatures adopt pure probabilistic approach, or pure rule-based noun phrases grammar to tackle this problem. In this paper, we apply a probabilistic chunker to deciding the implicit boundaries of constituents and utilize the linguistic knowledge to extract the noun phrases by a finite state mechanism. The test texts are SUSANNE Corpus and the results are evaluated by comparing the parse field of SUSANNE Corpus automatically. The results of this preliminary experiment are encouraging.

* 8 pages, Postscript file, Unix compressed, uuencoded 

  Access Paper or Ask Questions

Recovering Patient Journeys: A Corpus of Biomedical Entities and Relations on Twitter (BEAR)

Apr 21, 2022
Amelie Wührl, Roman Klinger

Text mining and information extraction for the medical domain has focused on scientific text generated by researchers. However, their direct access to individual patient experiences or patient-doctor interactions can be limited. Information provided on social media, e.g., by patients and their relatives, complements the knowledge in scientific text. It reflects the patient's journey and their subjective perspective on the process of developing symptoms, being diagnosed and offered a treatment, being cured or learning to live with a medical condition. The value of this type of data is therefore twofold: Firstly, it offers direct access to people's perspectives. Secondly, it might cover information that is not available elsewhere, including self-treatment or self-diagnoses. Named entity recognition and relation extraction are methods to structure information that is available in unstructured text. However, existing medical social media corpora focused on a comparably small set of entities and relations and particular domains, rather than putting the patient into the center of analyses. With this paper we contribute a corpus with a rich set of annotation layers following the motivation to uncover and model patients' journeys and experiences in more detail. We label 14 entity classes (incl. environmental factors, diagnostics, biochemical processes, patients' quality-of-life descriptions, pathogens, medical conditions, and treatments) and 20 relation classes (e.g., prevents, influences, interactions, causes) most of which have not been considered before for social media data. The publicly available dataset consists of 2,100 tweets with approx. 6,000 entity and 3,000 relation annotations. In a corpus analysis we find that over 80 % of documents contain relevant entities. Over 50 % of tweets express relations which we consider essential for uncovering patients' narratives about their journeys.

* Accepted at LREC 2022 

  Access Paper or Ask Questions