Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Roberto Lotufo

NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

Oct 26, 2022

Thales Sales Almeida, Thiago Laitz, João Seródio, Luiz Henrique Bonifacio, Roberto Lotufo, Rodrigo Nogueira

Figure 1 for NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

Figure 2 for NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

Figure 3 for NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

Figure 4 for NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

Abstract:The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this work, we describe NeuralSearchX, a metasearch engine based on a multi-purpose large reranking model to merge results and highlight sentences. Due to the homogeneity of our architecture, we could focus our optimization efforts on a single component. We compare our system with Microsoft's Biomedical Search and show that our design choices led to a much cost-effective system with competitive QPS while having close to state-of-the-art results on a wide range of public benchmarks. Human evaluation on two domain-specific tasks shows that our retrieval system outperformed Google API by a large margin in terms of nDCG@10 scores. By describing our architecture and implementation in detail, we hope that the community will build on our design choices. The system is available at https://neuralsearchx.nsx.ai.

* DESIRES 2022-3rd International Conference on Design of Experimental Search and Information REtrieval Systems, 30-31,August 2022, San Jose, CA, USA
* published as a full paper at the DESIRES 2022 Conference. 13 pages

Via

Access Paper or Ask Questions

Open-source tool for Airway Segmentation in Computed Tomography using 2.5D Modified EfficientDet: Contribution to the ATM22 Challenge

Oct 03, 2022

Diedre Carmo, Leticia Rittner, Roberto Lotufo

Figure 1 for Open-source tool for Airway Segmentation in Computed Tomography using 2.5D Modified EfficientDet: Contribution to the ATM22 Challenge

Figure 2 for Open-source tool for Airway Segmentation in Computed Tomography using 2.5D Modified EfficientDet: Contribution to the ATM22 Challenge

Figure 3 for Open-source tool for Airway Segmentation in Computed Tomography using 2.5D Modified EfficientDet: Contribution to the ATM22 Challenge

Figure 4 for Open-source tool for Airway Segmentation in Computed Tomography using 2.5D Modified EfficientDet: Contribution to the ATM22 Challenge

Abstract:Airway segmentation in computed tomography images can be used to analyze pulmonary diseases, however, manual segmentation is labor intensive and relies on expert knowledge. This manuscript details our contribution to MICCAI's 2022 Airway Tree Modelling challenge, a competition of fully automated methods for airway segmentation. We employed a previously developed deep learning architecture based on a modified EfficientDet (MEDSeg), training from scratch for binary airway segmentation using the provided annotations. Our method achieved 90.72 Dice in internal validation, 95.52 Dice on external validation, and 93.49 Dice in the final test phase, while not being specifically designed or tuned for airway segmentation. Open source code and a pip package for predictions with our model and trained weights are in https://github.com/MICLab-Unicamp/medseg.

* Open source code, graphical user interface, and a pip package for predictions with our model and trained weights are in https://github.com/MICLab-Unicamp/medseg

Via

Access Paper or Ask Questions

mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

Sep 27, 2022

Vitor Jeronymo, Mauricio Nascimento, Roberto Lotufo, Rodrigo Nogueira

Figure 1 for mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

Figure 2 for mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

Abstract:Robust 2004 is an information retrieval benchmark whose large number of judgments per query make it a reliable evaluation dataset. In this paper, we present mRobust04, a multilingual version of Robust04 that was translated to 8 languages using Google Translate. We also provide results of three different multilingual retrievers on this dataset. The dataset is available at https://huggingface.co/datasets/unicamp-dl/mrobust

* 4 pages

Via

Access Paper or Ask Questions

MonoByte: A Pool of Monolingual Byte-level Language Models

Sep 27, 2022

Hugo Abonizio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo Nogueira

Figure 1 for MonoByte: A Pool of Monolingual Byte-level Language Models

Figure 2 for MonoByte: A Pool of Monolingual Byte-level Language Models

Figure 3 for MonoByte: A Pool of Monolingual Byte-level Language Models

Abstract:The zero-shot cross-lingual ability of models pretrained on multilingual and even monolingual corpora has spurred many hypotheses to explain this intriguing empirical result. However, due to the costs of pretraining, most research uses public models whose pretraining methodology, such as the choice of tokenization, corpus size, and computational budget, might differ drastically. When researchers pretrain their own models, they often do so under a constrained budget, and the resulting models might underperform significantly compared to SOTA models. These experimental differences led to various inconsistent conclusions about the nature of the cross-lingual ability of these models. To help further research on the topic, we released 10 monolingual byte-level models rigorously pretrained under the same configuration with a large compute budget (equivalent to 420 days on a V100) and corpora that are 4 times larger than the original BERT's. Because they are tokenizer-free, the problem of unseen token embeddings is eliminated, thus allowing researchers to try a wider range of cross-lingual experiments in languages with different scripts. Additionally, we release two models pretrained on non-natural language texts that can be used in sanity-check experiments. Experiments on QA and NLI tasks show that our monolingual models achieve competitive performance to the multilingual one, and hence can be served to strengthen our understanding of cross-lingual transferability in language models.

Via

Access Paper or Ask Questions

Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

Aug 24, 2022

Mirelle Bueno, Carlos Gemmel, Jeffrey Dalton, Roberto Lotufo, Rodrigo Nogueira

Figure 1 for Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

Figure 2 for Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

Figure 3 for Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

Figure 4 for Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

Abstract:The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for current deep learning models. Recent work shows that this limitation persists in state-of-the-art Transformer-based models. Most solutions to this problem use specific architectures or training methods that do not generalize to other tasks. We demonstrate that large language models can succeed in extrapolation without modifying their architecture or training procedure. Experimental results show that generating step-by-step rationales and introducing marker tokens are both required for effective extrapolation. First, we induce it to produce step-by-step rationales before outputting the answer to effectively communicate the task to the model. However, as sequences become longer, we find that current models struggle to keep track of token positions. To address this issue, we interleave output tokens with markup tokens that act as explicit positional and counting symbols. Our findings show how these two complementary approaches enable remarkable sequence extrapolation and highlight a limitation of current architectures to effectively generalize without explicit surface form guidance. Code available at https://github.com/MirelleB/induced-rationales-markup-tokens

Via

Access Paper or Ask Questions

A Boring-yet-effective Approach for the Product Ranking Task of the Amazon KDD Cup 2022

Aug 09, 2022

Vitor Jeronymo, Guilherme Rosa, Surya Kallumadi, Roberto Lotufo, Rodrigo Nogueira

Figure 1 for A Boring-yet-effective Approach for the Product Ranking Task of the Amazon KDD Cup 2022

Abstract:In this work we describe our submission to the product ranking task of the Amazon KDD Cup 2022. We rely on a receipt that showed to be effective in previous competitions: we focus our efforts towards efficiently training and deploying large language odels, such as mT5, while reducing to a minimum the number of task-specific adaptations. Despite the simplicity of our approach, our best model was less than 0.004 nDCG@20 below the top submission. As the top 20 teams achieved an nDCG@20 close to .90, we argue that we need more difficult e-Commerce evaluation datasets to discriminate retrieval methods.

Via

Access Paper or Ask Questions

No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

Jun 06, 2022

Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

Figure 1 for No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

Figure 2 for No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

Figure 3 for No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

Figure 4 for No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

Abstract:Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of parameters and early query-document interaction play a significant role in the generalization ability of retrieval models. Our experiments show that increasing model size results in marginal gains on in-domain test sets, but much larger gains in new domains never seen during fine-tuning. Furthermore, we show that rerankers largely outperform dense ones of similar size in several tasks. Our largest reranker reaches the state of the art in 12 of the 18 datasets of the Benchmark-IR (BEIR) and surpasses the previous state of the art by 3 average points. Finally, we confirm that in-domain effectiveness is not a good indicator of zero-shot effectiveness. Code is available at https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git

Via

Access Paper or Ask Questions

Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

May 30, 2022

Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Roberto Lotufo, Rodrigo Nogueira

Figure 1 for Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

Figure 2 for Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

Figure 3 for Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

Abstract:Recent work has shown that language models scaled to billions of parameters, such as GPT-3, perform remarkably well in zero-shot and few-shot scenarios. In this work, we experiment with zero-shot models in the legal case entailment task of the COLIEE 2022 competition. Our experiments show that scaling the number of parameters in a language model improves the F1 score of our previous zero-shot result by more than 6 points, suggesting that stronger zero-shot capability may be a characteristic of larger models, at least for this task. Our 3B-parameter zero-shot model outperforms all models, including ensembles, in the COLIEE 2021 test set and also achieves the best performance of a single model in the COLIEE 2022 competition, second only to the ensemble composed of the 3B model itself and a smaller version of the same model. Despite the challenges posed by large language models, mainly due to latency constraints in real-time applications, we provide a demonstration of our zero-shot monoT5-3b model being used in production as a search engine, including for legal documents. The code for our submission and the demo of our system are available at https://github.com/neuralmind-ai/coliee and https://neuralsearchx.neuralmind.ai, respectively.

Via

Access Paper or Ask Questions

On the ability of monolingual models to learn language-agnostic representations

Sep 04, 2021

Leandro Rodrigues de Souza, Rodrigo Nogueira, Roberto Lotufo

Figure 1 for On the ability of monolingual models to learn language-agnostic representations

Figure 2 for On the ability of monolingual models to learn language-agnostic representations

Figure 3 for On the ability of monolingual models to learn language-agnostic representations

Abstract:Pretrained multilingual models have become a de facto default approach for zero-shot cross-lingual transfer. Previous work has shown that these models are able to achieve cross-lingual representations when pretrained on two or more languages with shared parameters. In this work, we provide evidence that a model can achieve language-agnostic representations even when pretrained on a single language. That is, we find that monolingual models pretrained and finetuned on different languages achieve competitive performance compared to the ones that use the same target language. Surprisingly, the models show a similar performance on a same task regardless of the pretraining language. For example, models pretrained on distant languages such as German and Portuguese perform similarly on English tasks.

Via

Access Paper or Ask Questions

mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset

Aug 31, 2021

Luiz Henrique Bonifacio, Israel Campiotti, Roberto Lotufo, Rodrigo Nogueira

Figure 1 for mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset

Figure 2 for mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset

Figure 3 for mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset

Abstract:The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in other languages than English. In this work we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 8 languages that was created using machine translation. We evaluated mMARCO by fine-tuning mono and multilingual re-ranking models on it. Experimental results demonstrate that multilingual models fine-tuned on our translated dataset achieve superior effectiveness than models fine-tuned on the original English version alone. Also, our distilled multilingual re-ranker is competitive with non-distilled models while having 5.4 times fewer parameters. The translated datasets as well as fine-tuned models are available at https://github.com/unicamp-dl/mMARCO.git.

Via

Access Paper or Ask Questions