Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Jun 07, 2019
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at:

  Access Paper or Ask Questions

Quality-Efficiency Trade-offs in Machine Learning for Text Processing

Nov 07, 2017
Ricardo Baeza-Yates, Zeinab Liaghat

Data mining, machine learning, and natural language processing are powerful techniques that can be used together to extract information from large texts. Depending on the task or problem at hand, there are many different approaches that can be used. The methods available are continuously being optimized, but not all these methods have been tested and compared in a set of problems that can be solved using supervised machine learning algorithms. The question is what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training data size, learning time, and quality obtained. We propose a performance trade-off framework and apply it to three important text processing problems: Named Entity Recognition, Sentiment Analysis and Document Classification. These problems were also chosen because they have different levels of object granularity: words, paragraphs, and documents. For each problem, we selected several supervised machine learning algorithms and we evaluated the trade-offs of them on large publicly available data sets (news, reviews, patents). To explore these trade-offs, we use different data subsets of increasing size ranging from 50 MB to several GB. We also consider the impact of the data set and the evaluation technique. We find that the results do not change significantly and that most of the time the best algorithms is the fastest. However, we also show that the results for small data (say less than 100 MB) are different from the results for big data and in those cases the best algorithm is much harder to determine.

* Ten pages, long version of paper that will be presented at IEEE Big Data 2017 (8 pages) 

  Access Paper or Ask Questions

Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning

Jan 19, 2022
Christian Reul, Stefan Tomasek, Florian Langhanki, Uwe Springmann

This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts. We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training a new model on a few pages of transcribed text (ground truth). To train the mixed models we collected a corpus of 35 manuscripts and ca. 12.5k text lines for two widely used handwriting styles, Gothic and Bastarda cursives. Evaluating the mixed models out-of-the-box on four unseen manuscripts resulted in an average Character Error Rate (CER) of 6.22%. After training on 2, 4 and eventually 32 pages the CER dropped to 3.27%, 2.58%, and 1.65%, respectively. While the in-domain recognition and training of models (Bastarda model to Bastarda material, Gothic to Gothic) unsurprisingly yielded the best results, finetuning out-of-domain models to unseen scripts was still shown to be superior to training from scratch. Our new mixed models have been made openly available to the community.

  Access Paper or Ask Questions

Review Regularized Neural Collaborative Filtering

Aug 20, 2020
Zhimeng Pan, Wenzheng Tao, Qingyao Ai

In recent years, text-aware collaborative filtering methods have been proposed to address essential challenges in recommendations such as data sparsity, cold start problem, and long-tail distribution. However, many of these text-oriented methods rely heavily on the availability of text information for every user and item, which obviously does not hold in real-world scenarios. Furthermore, specially designed network structures for text processing are highly inefficient for on-line serving and are hard to integrate into current systems. In this paper, we propose a flexible neural recommendation framework, named Review Regularized Recommendation, short as R3. It consists of a neural collaborative filtering part that focuses on prediction output, and a text processing part that serves as a regularizer. This modular design incorporates text information as richer data sources in the training phase while being highly friendly for on-line serving as it needs no on-the-fly text processing in serving time. Our preliminary results show that by using a simple text processing approach, it could achieve better prediction performance than state-of-the-art text-aware methods.

  Access Paper or Ask Questions

An automated domain-independent text reading, interpreting and extracting approach for reviewing the scientific literature

Aug 04, 2021
Amauri J Paula

It is presented here a machine learning-based (ML) natural language processing (NLP) approach capable to automatically recognize and extract categorical and numerical parameters from a corpus of articles. The approach (named a.RIX) operates with a concomitant/interchangeable use of ML models such as neuron networks (NNs), latent semantic analysis (LSA), naive-Bayes classifiers (NBC), and a pattern recognition model using regular expression (REGEX). A corpus of 7,873 scientific articles dealing with natural products (NPs) was used to demonstrate the efficiency of the a.RIX engine. The engine automatically extracts categorical and numerical parameters such as (i) the plant species from which active molecules are extracted, (ii) the microorganisms species for which active molecules can act against, and (iii) the values of minimum inhibitory concentration (MIC) against these microorganisms. The parameters are extracted without part-of-speech tagging (POS) and named entity recognition (NER) approaches (i.e. without the need of text annotation), and the models training is performed with unsupervised approaches. In this way, a.RIX can be essentially used on articles from any scientific field. Finally, it can potentially make obsolete the current article reviewing process in some areas, especially those in which machine learning models capture texts structure, text semantics, and latent knowledge.

  Access Paper or Ask Questions

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

Jul 07, 2021
Hui Lu, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen Meng

This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through forced alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level durations. The VAENAR-TTS model does not contain recurrent structures and is completely non-autoregressive in both the training and inference phases. Based on the VAE architecture, the alignment information is encoded in the latent variable, and attention-based soft alignment between the text and the latent variable is used in the decoder to reconstruct the spectrogram. Experiments show that VAENAR-TTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models.

  Access Paper or Ask Questions

SSIM-Based CTU-Level Joint Optimal Bit Allocation and Rate Distortion Optimization

Apr 28, 2020
Yang Li, Xuanqin Mou

Structural similarity (SSIM)-based distortion $D_\text{SSIM}$ is more consistent with human perception than the traditional mean squared error $D_\text{MSE}$. To achieve better video quality, many studies on optimal bit allocation (OBA) and rate-distortion optimization (RDO) used $D_\text{SSIM}$ as the distortion metric. However, many of them failed to optimize OBA and RDO jointly based on SSIM, thus causing a non-optimal R-$D_\text{SSIM}$ performance. This problem is due to the lack of an accurate R-$D_\text{SSIM}$ model that can be used uniformly in both OBA and RDO. To solve this problem, we propose a $D_\text{SSIM}$-$D_\text{MSE}$ model first. Based on this model, the complex R-$D_\text{SSIM}$ cost in RDO can be calculated as simpler R-$D_\text{MSE}$ cost with a new SSIM-related Lagrange multiplier. This not only reduces the computation burden of SSIM-based RDO, but also enables the R-$D_\text{SSIM}$ model to be uniformly used in OBA and RDO. Moreover, with the new SSIM-related Lagrange multiplier in hand, the joint relationship of R-$D_\text{SSIM}$-$\lambda_\text{SSIM}$ (the negative derivative of R-$D_\text{SSIM}$) can be built, based on which the R-$D_\text{SSIM}$ model parameters can be calculated accurately. With accurate and unified R-$D_\text{SSIM}$ model, SSIM-based OBA and SSIM-based RDO are unified together in our scheme, called SOSR. Compared with the HEVC reference encoder HM16.20, SOSR saves 4%, 10%, and 14% bitrate under the same SSIM in all-intra, hierarchical and non-hierarchical low-delay-B configurations, which is superior to other state-of-the-art schemes.

  Access Paper or Ask Questions

Leveraging Table Content for Zero-shot Text-to-SQL with Meta-Learning

Sep 12, 2021
Yongrui Chen, Xinnan Guo, Chaojie Wang, Jian Qiu, Guilin Qi, Meng Wang, Huiying Li

Single-table text-to-SQL aims to transform a natural language question into a SQL query according to one single table. Recent work has made promising progress on this task by pre-trained language models and a multi-submodule framework. However, zero-shot table, that is, the invisible table in the training set, is currently the most critical bottleneck restricting the application of existing approaches to real-world scenarios. Although some work has utilized auxiliary tasks to help handle zero-shot tables, expensive extra manual annotation limits their practicality. In this paper, we propose a new approach for the zero-shot text-to-SQL task which does not rely on any additional manual annotations. Our approach consists of two parts. First, we propose a new model that leverages the abundant information of table content to help establish the mapping between questions and zero-shot tables. Further, we propose a simple but efficient meta-learning strategy to train our model. The strategy utilizes the two-step gradient update to force the model to learn a generalization ability towards zero-shot tables. We conduct extensive experiments on a public open-domain text-to-SQL dataset WikiSQL and a domain-specific dataset ESQL. Compared to existing approaches using the same pre-trained model, our approach achieves significant improvements on both datasets. Compared to the larger pre-trained model and the tabular-specific pre-trained model, our approach is still competitive. More importantly, on the zero-shot subsets of both the datasets, our approach further increases the improvements.

* Accepted to AAAI 2021 

  Access Paper or Ask Questions

Fishing for Clickbaits in Social Images and Texts with Linguistically-Infused Neural Network Models

Oct 17, 2017
Maria Glenski, Ellyn Ayton, Dustin Arendt, Svitlana Volkova

This paper presents the results and conclusions of our participation in the Clickbait Challenge 2017 on automatic clickbait detection in social media. We first describe linguistically-infused neural network models and identify informative representations to predict the level of clickbaiting present in Twitter posts. Our models allow to answer the question not only whether a post is a clickbait or not, but to what extent it is a clickbait post e.g., not at all, slightly, considerably, or heavily clickbaity using a score ranging from 0 to 1. We evaluate the predictive power of models trained on varied text and image representations extracted from tweets. Our best performing model that relies on the tweet text and linguistic markers of biased language extracted from the tweet and the corresponding page yields mean squared error (MSE) of 0.04, mean absolute error (MAE) of 0.16 and R2 of 0.43 on the held-out test data. For the binary classification setup (clickbait vs. non-clickbait), our model achieved F1 score of 0.69. We have not found that image representations combined with text yield significant performance improvement yet. Nevertheless, this work is the first to present preliminary analysis of objects extracted using Google Tensorflow object detection API from images in clickbait vs. non-clickbait Twitter posts. Finally, we outline several steps to improve model performance as a part of the future work.

* Pineapplefish Clickbait Detector, Clickbait Challenge 2017 

  Access Paper or Ask Questions