Alert button
Picture for Kyusong Lee

Kyusong Lee

Alert button

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Aug 25, 2023
Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao, Chunxin Fang, Kyusong Lee, Qing Wang

Figure 1 for How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection
Figure 2 for How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection
Figure 3 for How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection
Figure 4 for How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at \url{https://github.com/om-ai-lab/OVDEval}

Viaarxiv icon

OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Sep 10, 2022
Tiancheng Zhao, Peng Liu, Xiaopeng Lu, Kyusong Lee

Figure 1 for OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training
Figure 2 for OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training
Figure 3 for OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training
Figure 4 for OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Advancing object detection to open-vocabulary and few-shot transfer has long been a challenge for computer vision research. This work explores a continual learning approach that enables a detector to expand its zero/few-shot capabilities via multi-dataset vision-language pre-training. Using natural language as knowledge representation, we explore methods to accumulate "visual vocabulary" from different training datasets and unify the task as a language-conditioned detection framework. Specifically, we propose a novel language-aware detector OmDet and a novel training mechanism. The proposed multimodal detection network can resolve the technical challenges in multi-dataset joint training and it can generalize to arbitrary number of training datasets without the requirements for manual label taxonomy merging. Experiment results on COCO, Pascal VOC, and Wider Face/Pedestrian confirmed the efficacy by achieving on par or higher scores in joint training compared to training separately. Moreover, we pre-train on more than 20 million images with 4 million unique object vocabulary, and the resulting model is evaluated on 35 downstream tasks of ODinW. Results show that OmDet is able to achieve the state-of-the-art fine-tuned performance on ODinW. And analysis shows that by scaling up the proposed pre-training method, OmDet continues to improve its zero/few-shot tuning performance, suggesting a promising way for further scaling.

Viaarxiv icon

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Jul 01, 2022
Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, Jianwei Yin

Figure 1 for VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations
Figure 2 for VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations
Figure 3 for VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations
Figure 4 for VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream task accuracy provides little information about the pros and cons of each VLP method, let alone provides insights on how the community can improve the systems in the future. Inspired by the CheckList for testing natural language processing, we introduce VL-CheckList, a novel framework to understand the capabilities of VLP models. The proposed method divides the image-texting ability of a VLP model into three categories: objects, attributes, and relations, and uses a novel taxonomy to further break down these three aspects. We conduct comprehensive studies to analyze seven recently popular VLP models via the proposed framework. Results confirm the effectiveness of the proposed method by revealing fine-grained differences among the compared models that were not visible from downstream task-only evaluation. Further results show promising research direction in building better VLP models. Data and Code: https://github.com/om-ai-lab/VL-CheckList

* 9 pages, preprint 
Viaarxiv icon

When is it permissible for artificial intelligence to lie? A trust-based approach

Mar 14, 2021
Tae Wan Kim, Tong, Lu, Kyusong Lee, Zhaoqi Cheng, Yanhan Tang, John Hooker

Conversational Artificial Intelligence (AI) used in industry settings can be trained to closely mimic human behaviors, including lying and deception. However, lying is often a necessary part of negotiation. To address this, we develop a normative framework for when it is ethical or unethical for a conversational AI to lie to humans, based on whether there is what we call "invitation of trust" in a particular scenario. Importantly, cultural norms play an important role in determining whether there is invitation of trust across negotiation settings, and thus an AI trained in one culture may not be generalizable to others. Moreover, individuals may have different expectations regarding the invitation of trust and propensity to lie for human vs. AI negotiators, and these expectations may vary across cultures as well. Finally, we outline how a conversational chatbot can be trained to negotiate ethically by applying autoregressive models to large dialog and negotiations datasets.

Viaarxiv icon

SF-QA: Simple and Fair Evaluation Library for Open-domain Question Answering

Jan 06, 2021
Xiaopeng Lu, Kyusong Lee, Tiancheng Zhao

Figure 1 for SF-QA: Simple and Fair Evaluation Library for Open-domain Question Answering
Figure 2 for SF-QA: Simple and Fair Evaluation Library for Open-domain Question Answering
Figure 3 for SF-QA: Simple and Fair Evaluation Library for Open-domain Question Answering
Figure 4 for SF-QA: Simple and Fair Evaluation Library for Open-domain Question Answering

Although open-domain question answering (QA) draws great attention in recent years, it requires large amounts of resources for building the full system and is often difficult to reproduce previous results due to complex configurations. In this paper, we introduce SF-QA: simple and fair evaluation framework for open-domain QA. SF-QA framework modularizes the pipeline open-domain QA system, which makes the task itself easily accessible and reproducible to research groups without enough computing resources. The proposed evaluation framework is publicly available and anyone can contribute to the code and evaluations.

* 7 pages 
Viaarxiv icon

VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search

Jan 01, 2021
Xiaopeng Lu, Tiancheng Zhao, Kyusong Lee

Figure 1 for VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search
Figure 2 for VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search
Figure 3 for VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search
Figure 4 for VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search

Text-to-image retrieval is an essential task in multi-modal information retrieval, i.e. retrieving relevant images from a large and unlabelled image dataset given textual queries. In this paper, we propose VisualSparta, a novel text-to-image retrieval model that shows substantial improvement over existing models on both accuracy and efficiency. We show that VisualSparta is capable of outperforming all previous scalable methods in MSCOCO and Flickr30K. It also shows substantial retrieving speed advantages, i.e. for an index with 1 million images, VisualSparta gets over 391x speed up compared to standard vector search. Experiments show that this speed advantage even gets bigger for larger datasets because VisualSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for very large dataset, with significant accuracy improvement compared to previous state-of-the-art methods.

* 9 pages 
Viaarxiv icon

SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval

Sep 28, 2020
Tiancheng Zhao, Xiaopeng Lu, Kyusong Lee

Figure 1 for SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval
Figure 2 for SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval
Figure 3 for SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval
Figure 4 for SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval

We introduce SPARTA, a novel neural retrieval method that shows great promise in performance, generalization, and interpretability for open-domain question answering. Unlike many neural ranking methods that use dense vector nearest neighbor search, SPARTA learns a sparse representation that can be efficiently implemented as an Inverted Index. The resulting representation enables scalable neural retrieval that does not require expensive approximate vector search and leads to better performance than its dense counterpart. We validated our approaches on 4 open-domain question answering (OpenQA) tasks and 11 retrieval question answering (ReQA) tasks. SPARTA achieves new state-of-the-art results across a variety of open-domain question answering tasks in both English and Chinese datasets, including open SQuAD, Natuarl Question, CMRC and etc. Analysis also confirms that the proposed method creates human interpretable representation and allows flexible control over the trade-off between performance and efficiency.

* 11 pages 
Viaarxiv icon

Talk to Papers: Bringing Neural Question Answering to Academic Search

Apr 13, 2020
Tianchang Zhao, Kyusong Lee

Figure 1 for Talk to Papers: Bringing Neural Question Answering to Academic Search
Figure 2 for Talk to Papers: Bringing Neural Question Answering to Academic Search
Figure 3 for Talk to Papers: Bringing Neural Question Answering to Academic Search
Figure 4 for Talk to Papers: Bringing Neural Question Answering to Academic Search

We introduce Talk to Papers, which exploits the recent open-domain question answering (QA) techniques to improve the current experience of academic search. It's designed to enable researchers to use natural language queries to find precise answers and extract insights from a massive amount of academic papers. We present a large improvement over classic search engine baseline on several standard QA datasets and provide the community a collaborative data collection tool to curate the first natural language processing research QA dataset via a community effort.

* demo paper accepted at ACL 2020 
Viaarxiv icon

Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation

Apr 22, 2018
Tiancheng Zhao, Kyusong Lee, Maxine Eskenazi

Figure 1 for Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation
Figure 2 for Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation
Figure 3 for Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation

The encoder-decoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoder-decoder dialog models for interpretable response generation. Building upon variational autoencoders (VAEs), we present two novel models, DI-VAE and DI-VST that improve VAEs and can discover interpretable semantics via either auto encoding or context predicting. Our methods have been validated on real-world dialog datasets to discover semantic representations and enhance encoder-decoder models with interpretable generation.

* Accepted as a long paper in ACL 2018 
Viaarxiv icon