Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sheng-Chieh Lin

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Feb 19, 2021
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, Rodrigo Nogueira

Figure 1 for Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Figure 2 for Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Figure 3 for Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Figure 4 for Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. We also describe how our group has built a culture of replicability through shared norms and tools that enable rigorous automated testing.

Via

Access Paper or Ask Questions

Optical Wavelength Guided Self-Supervised Feature Learning For Galaxy Cluster Richness Estimate

Dec 04, 2020
Gongbo Liang, Yuanyuan Su, Sheng-Chieh Lin, Yu Zhang, Yuanyuan Zhang, Nathan Jacobs

Figure 1 for Optical Wavelength Guided Self-Supervised Feature Learning For Galaxy Cluster Richness Estimate

Figure 2 for Optical Wavelength Guided Self-Supervised Feature Learning For Galaxy Cluster Richness Estimate

Most galaxies in the nearby Universe are gravitationally bound to a cluster or group of galaxies. Their optical contents, such as optical richness, are crucial for understanding the co-evolution of galaxies and large-scale structures in modern astronomy and cosmology. The determination of optical richness can be challenging. We propose a self-supervised approach for estimating optical richness from multi-band optical images. The method uses the data properties of the multi-band optical images for pre-training, which enables learning feature representations from a large but unlabeled dataset. We apply the proposed method to the Sloan Digital Sky Survey. The result shows our estimate of optical richness lowers the mean absolute error and intrinsic scatter by 11.84% and 20.78%, respectively, while reducing the need for labeled training data by up to 60%. We believe the proposed method will benefit astronomy and cosmology, where a large number of unlabeled multi-band images are available, but acquiring image labels is costly.

* Accepted to NeurIPS 2020 Workshop on Machine Learning and the Physical Sciences

Via

Access Paper or Ask Questions

Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

Oct 22, 2020
Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin

Figure 1 for Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

Figure 2 for Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

Figure 3 for Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

Figure 4 for Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

We present an approach to ranking with dense representations that applies knowledge distillation to improve the recently proposed late-interaction ColBERT model. Specifically, we distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product, thus enabling single-step ANN search. Our key insight is that during distillation, tight coupling between the teacher model and the student model enables more flexible distillation strategies and yields better learned representations. We empirically show that our approach improves query latency and greatly reduces the onerous storage requirements of ColBERT, while only making modest sacrifices in terms of effectiveness. By combining our dense representations with sparse representations derived from document expansion, we are able to approach the effectiveness of a standard cross-encoder reranker using BERT that is orders of magnitude slower.

Via

Access Paper or Ask Questions

Personalized TV Recommendation: Fusing User Behavior and Preferences

Aug 30, 2020
Sheng-Chieh Lin, Ting-Wei Lin, Jing-Kai Lou, Ming-Feng Tsai, Chuan-Ju Wang

Figure 1 for Personalized TV Recommendation: Fusing User Behavior and Preferences

Figure 2 for Personalized TV Recommendation: Fusing User Behavior and Preferences

In this paper, we propose a two-stage ranking approach for recommending linear TV programs. The proposed approach first leverages user viewing patterns regarding time and TV channels to identify potential candidates for recommendation and then further leverages user preferences to rank these candidates given textual information about programs. To evaluate the method, we conduct empirical studies on a real-world TV dataset, the results of which demonstrate the superior performance of our model in terms of both recommendation accuracy and time efficiency.

* 8 pages

Via

Access Paper or Ask Questions

Query Reformulation using Query History for Passage Retrieval in Conversational Search

May 05, 2020
Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

Figure 1 for Query Reformulation using Query History for Passage Retrieval in Conversational Search

Figure 2 for Query Reformulation using Query History for Passage Retrieval in Conversational Search

Figure 3 for Query Reformulation using Query History for Passage Retrieval in Conversational Search

Figure 4 for Query Reformulation using Query History for Passage Retrieval in Conversational Search

Passage retrieval in a conversational context is essential for many downstream applications; it is however extremely challenging due to limited data resources. To address this problem, we present an effective multi-stage pipeline for passage ranking in conversational search that integrates a widely-used IR system with a conversational query reformulation module. Along these lines, we propose two simple yet effective query reformulation approaches: historical query expansion (HQE) and neural transfer reformulation (NTR). Whereas HQE applies query expansion, a traditional IR query reformulation technique, NTR transfers human knowledge of conversational query understanding to a neural query reformulation model. The proposed HQE method was the top-performing submission of automatic systems in CAsT Track at TREC 2019. Building on this, our NTR approach improves an additional 18% over that best entry in terms of NDCG@3. We further analyze the distinct behaviors of the two approaches, and show that fusing their output reduces the performance gap (measured in NDCG@3) between the manually-rewritten and automatically-generated queries to 4 from 22 points when compared with the best CAsT submission.

* 11 pages

Via

Access Paper or Ask Questions

Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

Apr 04, 2020
Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

Figure 1 for Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

Figure 2 for Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

Figure 3 for Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

Figure 4 for Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs). We leverage PLMs to address the strong token-to-token independence assumption made in the common objective, maximum likelihood estimation, for the CQR task. In CQR benchmarks of task-oriented dialogue systems, we evaluate fine-tuned PLMs on the recently-introduced CANARD dataset as an in-domain task and validate the models using data from the TREC 2019 CAsT Track as an out-domain task. Examining a variety of architectures with different numbers of parameters, we demonstrate that the recent text-to-text transfer transformer (T5) achieves the best results both on CANARD and CAsT with fewer parameters, compared to similar transformer architectures.

Via

Access Paper or Ask Questions

TTTTTackling WinoGrande Schemas

Mar 18, 2020
Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

Figure 1 for TTTTTackling WinoGrande Schemas

Figure 2 for TTTTTackling WinoGrande Schemas

We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the "entailment" token as a score of the hypothesis. Our first (and only) submission to the official leaderboard yielded 0.7673 AUC on March 13, 2020, which is the best known result at this time and beats the previous state of the art by over five points.

Via

Access Paper or Ask Questions