Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

Mar 10, 2022
Yang Jiao, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang

Recently, one-stage visual grounders attract high attention due to the comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all the objects within the image, as only a part of them are related to the text query and may confuse the model. We call these objects "suspected objects". However, exploring relationships among these suspected objects in the one-stage visual grounding paradigm is non-trivial due to two core problems: (1) no object proposals are available as the basis on which to select suspected objects and perform relationship modeling; (2) compared with those irrelevant to the text query, suspected objects are more confusing, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model's prediction. To address the above issues, this paper proposes a Suspected Object Graph (SOG) approach to encourage the correct referred object selection among the suspected ones in the one-stage visual grounding. Suspected objects are dynamically selected from a learned activation map as nodes to adapt to the current discrimination ability of the model during training. Afterward, on top of the suspected objects, a Keyword-aware Node Representation module (KNR) and an Exploration by Random Connection strategy (ERC) are concurrently proposed within the SOG to help the model rethink its initial prediction. Extensive ablation studies and comparison with state-of-the-art approaches on prevalent visual grounding benchmarks demonstrate the effectiveness of our proposed method.

  Access Paper or Ask Questions

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Nov 17, 2021
Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, Zheng-Jun Zha

Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions. However, they have some insuperable drawbacks, e.g., they cannot handle videos without references, and they may result in biased evaluation due to the one-to-many nature of video-to-text and the neglect of visual relevance. From the human evaluator's viewpoint, a high-quality caption should be consistent with the provided video, but not necessarily be similar to the reference in literal or semantics. Inspired by human evaluation, we propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning, which directly measures similarity between video and candidate captions. Benefit from the recent development of large-scale pre-training models, we exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore. Specifically, EMScore combines matching scores of both coarse-grained (video and caption) and fine-grained (frames and words) levels, which takes the overall understanding and detailed characteristics of the video into account. Furthermore, considering the potential information gain, EMScore can be flexibly extended to the conditions where human-labeled references are available. Last but not least, we collect VATEX-EVAL and ActivityNet-FOIl datasets to systematically evaluate the existing metrics. VATEX-EVAL experiments demonstrate that EMScore has higher human correlation and lower reference dependency. ActivityNet-FOIL experiment verifies that EMScore can effectively identify "hallucinating" captions. The datasets will be released to facilitate the development of video captioning metrics. The code is available at:

  Access Paper or Ask Questions

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Jun 25, 2021
Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.

  Access Paper or Ask Questions

A Neural Network Perturbation Theory Based on the Born Series

Sep 07, 2020
Bastian Kaspschak, Ulf-G. Meißner

Deep Learning has become an attractive approach towards various data-based problems of theoretical physics in the past decade. Its protagonists, the deep neural networks (DNNs), are capable of making accurate predictions for data of arbitrarily high complexity. A well-known issue most DNNs share is their lack of interpretability. In order to explain their behavior and extract physical laws they have discovered during training, a suitable interpretation method has, therefore, to be applied post-hoc. Due to its simplicity and ubiquity in quantum physics, we decide to present a rather general interpretation method in the context of two-body scattering: We find a one-to-one correspondence between the $n^\text{th}$-order Born approximation and the $n^\text{th}$-order Taylor approximation of deep multilayer perceptrons (MLPs), that predict S-wave scattering lengths $a_0$ for discretized, attractive potentials of finite range. This defines a perturbation theory for MLPs similarily to Born approximations defining a perturbation theory for $a_0$. In the case of shallow potentials, lower-order approximations, that can be argued to be local interpretations of respective MLPs, reliably reproduce $a_0$. As deep MLPs are highly nested functions, the computation of higher-order partial derivatives, which is substantial for a Taylor approximation, is an effortful endeavour. By introducing quantities we refer to as propagators and vertices and that depend on the MLP's weights and biases, we establish a graph-theoretical approach towards partial derivatives and local interpretability. Similar to Feynman rules in quantum field theories, we find rules that systematically assign diagrams consisting of propagators and vertices to the corresponding order of the MLP perturbation theory.

* 22+10 pages, 3 figures 

  Access Paper or Ask Questions

Learning Image Information for eCommerce Queries

Apr 29, 2019
Utkarsh Porwal

Computing similarity between a query and a document is fundamental in any information retrieval system. In search engines, computing query-document similarity is an essential step in both retrieval and ranking stages. In eBay search, document is an item and the query-item similarity can be computed by comparing different facets of the query-item pair. Query text can be compared with the text of the item title. Likewise, a category constraint applied on the query can be compared with the listing category of the item. However, images are one signal that are usually present in the items but are not present in the query. Images are one of the most intuitive signals used by users to determine the relevance of the item given a query. Including this signal in estimating similarity between the query-item pair is likely to improve the relevance of the search engine. We propose a novel way of deriving image information for queries. We attempt to learn image information for queries from item images instead of generating explicit image features or an image for queries. We use canonical correlation analysis (CCA) to learn a new subspace where projecting the original data will give us a new query and item representation. We hypothesize that this new query representation will also have image information about the query. We estimate the query-item similarity using a vector space model and report the performance of the proposed method on eBay's search data. We show 11.89\% relevance improvement over the baseline using area under the receiver operating characteristic curve (AUROC) as the evaluation metric. We also show 3.1\% relevance improvement over the baseline with area under the precision recall curve (AUPRC) .

  Access Paper or Ask Questions

No Padding Please: Efficient Neural Handwriting Recognition

Feb 28, 2019
Gideon Maillette de Buy Wenniger, Lambert Schomaker, Andy Way

Neural handwriting recognition (NHR) is the recognition of handwritten text with deep learning models, such as multi-dimensional long short-term memory (MDLSTM) recurrent neural networks. Models with MDLSTM layers have achieved state-of-the art results on handwritten text recognition tasks. While multi-directional MDLSTM-layers have an unbeaten ability to capture the complete context in all directions, this strength limits the possibilities for parallelization, and therefore comes at a high computational cost. In this work we develop methods to create efficient MDLSTM-based models for NHR, particularly a method aimed at eliminating computation waste that results from padding. This proposed method, called example-packing, replaces wasteful stacking of padded examples with efficient tiling in a 2-dimensional grid. For word-based NHR this yields a speed improvement of factor 6.6 over an already efficient baseline of minimal padding for each batch separately. For line-based NHR the savings are more modest, but still significant. In addition to example-packing, we propose: 1) a technique to optimize parallelization for dynamic graph definition frameworks including PyTorch, using convolutions with grouping, 2) a method for parallelization across GPUs for variable-length example batches. All our techniques are thoroughly tested on our own PyTorch re-implementation of MDLSTM-based NHR models. A thorough evaluation on the IAM dataset shows that our models are performing similar to earlier implementations of state-of-the-art models. Our efficient NHR model and some of the reusable techniques discussed with it offer ways to realize relatively efficient models for the omnipresent scenario of variable-length inputs in deep learning.

  Access Paper or Ask Questions

ASR Context-Sensitive Error Correction Based on Microsoft N-Gram Dataset

Mar 23, 2012
Youssef Bassil, Paul Semaan

At the present time, computers are employed to solve complex tasks and problems ranging from simple calculations to intensive digital image processing and intricate algorithmic optimization problems to computationally-demanding weather forecasting problems. ASR short for Automatic Speech Recognition is yet another type of computational problem whose purpose is to recognize human spoken speech and convert it into text that can be processed by a computer. Despite that ASR has many versatile and pervasive real-world applications,it is still relatively erroneous and not perfectly solved as it is prone to produce spelling errors in the recognized text, especially if the ASR system is operating in a noisy environment, its vocabulary size is limited, and its input speech is of bad or low quality. This paper proposes a post-editing ASR error correction method based on MicrosoftN-Gram dataset for detecting and correcting spelling errors generated by ASR systems. The proposed method comprises an error detection algorithm for detecting word errors; a candidate corrections generation algorithm for generating correction suggestions for the detected word errors; and a context-sensitive error correction algorithm for selecting the best candidate for correction. The virtue of using the Microsoft N-Gram dataset is that it contains real-world data and word sequences extracted from the web which canmimica comprehensive dictionary of words having a large and all-inclusive vocabulary. Experiments conducted on numerous speeches, performed by different speakers, showed a remarkable reduction in ASR errors. Future research can improve upon the proposed algorithm so much so that it can be parallelized to take advantage of multiprocessor and distributed systems.

* Journal of Computing, Vol.4, No.1, January 2012 
* LACSC - Lebanese Association for Computational Sciences - 

  Access Paper or Ask Questions

Towards a Theoretical Understanding of Word and Relation Representation

Feb 01, 2022
Carl Allen

Representing words by vectors, or embeddings, enables computational reasoning and is foundational to automating natural language tasks. For example, if word embeddings of similar words contain similar values, word similarity can be readily assessed, whereas judging that from their spelling is often impossible (e.g. cat /feline) and to predetermine and store similarities between all words is prohibitively time-consuming, memory intensive and subjective. We focus on word embeddings learned from text corpora and knowledge graphs. Several well-known algorithms learn word embeddings from text on an unsupervised basis by learning to predict those words that occur around each word, e.g. word2vec and GloVe. Parameters of such word embeddings are known to reflect word co-occurrence statistics, but how they capture semantic meaning has been unclear. Knowledge graph representation models learn representations both of entities (words, people, places, etc.) and relations between them, typically by training a model to predict known facts in a supervised manner. Despite steady improvements in fact prediction accuracy, little is understood of the latent structure that enables this. The limited understanding of how latent semantic structure is encoded in the geometry of word embeddings and knowledge graph representations makes a principled means of improving their performance, reliability or interpretability unclear. To address this: 1. we theoretically justify the empirical observation that particular geometric relationships between word embeddings learned by algorithms such as word2vec and GloVe correspond to semantic relations between words; and 2. we extend this correspondence between semantics and geometry to the entities and relations of knowledge graphs, providing a model for the latent structure of knowledge graph representation linked to that of word embeddings.

* PhD thesis 

  Access Paper or Ask Questions

Efficient Knowledge Graph Validation via Cross-Graph Representation Learning

Aug 16, 2020
Yaqing Wang, Fenglong Ma, Jing Gao

Recent advances in information extraction have motivated the automatic construction of huge Knowledge Graphs (KGs) by mining from large-scale text corpus. However, noisy facts are unavoidably introduced into KGs that could be caused by automatic extraction. To validate the correctness of facts (i.e., triplets) inside a KG, one possible approach is to map the triplets into vector representations by capturing the semantic meanings of facts. Although many representation learning approaches have been developed for knowledge graphs, these methods are not effective for validation. They usually assume that facts are correct, and thus may overfit noisy facts and fail to detect such facts. Towards effective KG validation, we propose to leverage an external human-curated KG as auxiliary information source to help detect the errors in a target KG. The external KG is built upon human-curated knowledge repositories and tends to have high precision. On the other hand, although the target KG built by information extraction from texts has low precision, it can cover new or domain-specific facts that are not in any human-curated repositories. To tackle this challenging task, we propose a cross-graph representation learning framework, i.e., CrossVal, which can leverage an external KG to validate the facts in the target KG efficiently. This is achieved by embedding triplets based on their semantic meanings, drawing cross-KG negative samples and estimating a confidence score for each triplet based on its degree of correctness. We evaluate the proposed framework on datasets across different domains. Experimental results show that the proposed framework achieves the best performance compared with the state-of-the-art methods on large-scale KGs.

* CIKM 2020 

  Access Paper or Ask Questions

Optimal Estimation of Change in a Population of Parameters

Nov 28, 2019
Ramya Korlakai Vinayak, Weihao Kong, Sham M. Kakade

Paired estimation of change in parameters of interest over a population plays a central role in several application domains including those in the social sciences, epidemiology, medicine and biology. In these domains, the size of the population under study is often very large, however, the number of observations available per individual in the population is very small (\emph{sparse observations}) which makes the problem challenging. Consider the setting with $N$ independent individuals, each with unknown parameters $(p_i, q_i)$ drawn from some unknown distribution on $[0, 1]^2$. We observe $X_i \sim \text{Bin}(t, p_i)$ before an event and $Y_i \sim \text{Bin}(t, q_i)$ after the event. Provided these paired observations, $\{(X_i, Y_i) \}_{i=1}^N$, our goal is to accurately estimate the \emph{distribution of the change in parameters}, $\delta_i := q_i - p_i$, over the population and properties of interest like the \emph{$\ell_1$-magnitude of the change} with sparse observations ($t\ll N$). We provide \emph{information theoretic lower bounds} on the error in estimating the distribution of change and the $\ell_1$-magnitude of change. Furthermore, we show that the following two step procedure achieves the optimal error bounds: first, estimate the full joint distribution of the paired parameters using the maximum likelihood estimator (MLE) and then estimate the distribution of change and the $\ell_1$-magnitude of change using the joint MLE. Notably, and perhaps surprisingly, these error bounds are of the same order as the minimax optimal error bounds for learning the \emph{full} joint distribution itself (in Wasserstein-1 distance); in other words, estimating the magnitude of the change of parameters over the population is, in a minimax sense, as difficult as estimating the full joint distribution itself.

  Access Paper or Ask Questions