Cell segmentation in histopathological images plays a crucial role in understanding, diagnosing, and treating many diseases. However, data annotation for this is expensive since there can be a large number of cells per image, and expert pathologists are needed for labelling images. Instead, our paper focuses on using weak supervision -- annotation from related tasks -- to induce a segmenter. Recent foundation models, such as Segment Anything (SAM), can use prompts to leverage additional supervision during inference. SAM has performed remarkably well in natural image segmentation tasks; however, its applicability to cell segmentation has not been explored. In response, we investigate guiding the prompting procedure in SAM for weakly supervised cell segmentation when only bounding box supervision is available. We develop two workflows: (1) an object detector's output as a test-time prompt to SAM (D-SAM), and (2) SAM as pseudo mask generator over training data to train a standalone segmentation model (SAM-S). On finding that both workflows have some complementary strengths, we develop an integer programming-based approach to reconcile the two sets of segmentation masks, achieving yet higher performance. We experiment on three publicly available cell segmentation datasets namely, ConSep, MoNuSeg, and TNBC, and find that all SAM-based solutions hugely outperform existing weakly supervised image segmentation models, obtaining 9-15 pt Dice gains.
We address the zero-shot transfer learning setting for the knowledge base question answering (KBQA) problem, where a large volume of labeled training data is available for the source domain, but no such labeled examples are available for the target domain. Transfer learning for KBQA makes use of large volumes of unlabeled data in the target in addition to the labeled data in the source. More recently, few-shot in-context learning using Black-box Large Language Models (BLLMs) has been adapted for KBQA without considering any source domain data. In this work, we show how to meaningfully combine these two paradigms for KBQA so that their benefits add up. Specifically, we preserve the two stage retrieve-then-generate pipeline of supervised KBQA and introduce interaction between in-context learning using BLLMs and transfer learning from the source for both stages. In addition, we propose execution-guided self-refinement using BLLMs, decoupled from the transfer setting. With the help of experiments using benchmark datasets GrailQA as the source and WebQSP as the target, we show that the proposed combination brings significant improvements to both stages and also outperforms by a large margin state-of-the-art supervised KBQA models trained on the source. We also show that in the in-domain setting, the proposed BLLM augmentation significantly outperforms state-of-the-art supervised models, when the volume of labeled data is limited, and also outperforms these marginally even when using the entire large training dataset.
End-to-end conversational recommendation systems (CRS) generate responses by leveraging both dialog history and a knowledge base (KB). A CRS mainly faces three key challenges: (1) at each turn, it must decide if recommending a KB entity is appropriate; if so, it must identify the most relevant KB entity to recommend; and finally, it must recommend the entity in a fluent utterance that is consistent with the conversation history. Recent CRSs do not pay sufficient attention to these desiderata, often generating unfluent responses or not recommending (relevant) entities at the right turn. We introduce a new CRS we call CoRE-CoG. CoRE-CoG addresses the limitations in prior systems by implementing (1) a recommendation trigger that decides if the system utterance should include an entity, (2) a type pruning module that improves the relevance of recommended entities, and (3) a novel constrained response generator to make recommendations while maintaining fluency. Together, these modules ensure simultaneous accurate recommendation decisions and fluent system utterances. Experiments with recent benchmarks show the superiority particularly on conditional generation sub-tasks with close to 10 F1 and 4 Recall@1 percent points gain over baselines.
We consider two popular approaches to Knowledge Graph Completion (KGC): textual models that rely on textual entity descriptions, and structure-based models that exploit the connectivity structure of the Knowledge Graph (KG). Preliminary experiments show that these approaches have complementary strengths: structure-based models perform well when the gold answer is easily reachable from the query head in the KG, while textual models exploit descriptions to give good performance even when the gold answer is not reachable. In response, we explore ensembling as a way of combining the best of both approaches. We propose a novel method for learning query-dependent ensemble weights by using the distributions of scores assigned by individual models to all candidate entities. Our ensemble baseline achieves state-of-the-art results on three standard KGC datasets, with up to 6.8 pt MRR and 8.3 pt Hits@1 gains over best individual models.
We tackle the problem of zero-shot cross-lingual transfer in NLP tasks via the use of language adapters (LAs). Most of the earlier works have explored training with adapter of a single source (often English), and testing either using the target LA or LA of another related language. Training target LA requires unlabeled data, which may not be readily available for low resource unseen languages: those that are neither seen by the underlying multilingual language model (e.g., mBERT), nor do we have any (labeled or unlabeled) data for them. We posit that for more effective cross-lingual transfer, instead of just one source LA, we need to leverage LAs of multiple (linguistically or geographically related) source languages, both at train and test-time - which we investigate via our novel neural architecture, ZGUL. Extensive experimentation across four language groups, covering 15 unseen target languages, demonstrates improvements of up to 3.2 average F1 points over standard fine-tuning and other strong baselines on POS tagging and NER tasks. We also extend ZGUL to settings where either (1) some unlabeled data or (2) few-shot training examples are available for the target language. We find that ZGUL continues to outperform baselines in these settings too.
Large language models (LLMs) are now available in various sizes and configurations from cloud API providers. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present AutoMix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to AutoMix is a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring training. Given that verifications can be noisy, we employ a meta verifier in AutoMix to refine the accuracy of these assessments. Our experiments using LLAMA2-13/70B, on five context-grounded reasoning datasets demonstrate that AutoMix surpasses established baselines, improving the incremental benefit per cost by up to 89%. Our code and data are available at https://github.com/automix-llm/automix.
Recent literature has seen a significant focus on building machine learning models with specific properties such as fairness, i.e., being non-biased with respect to a given set of attributes, calibration i.e., model confidence being aligned with its predictive accuracy, and explainability, i.e., ability to be understandable to humans. While there has been work focusing on each of these aspects individually, researchers have shied away from simultaneously addressing more than one of these dimensions. In this work, we address the problem of building models which are both fair and calibrated. We work with a specific definition of fairness, which closely matches [Biswas et. al. 2019], and has the nice property that Bayes optimal classifier has the maximum possible fairness under our definition. We show that an existing negative result towards achieving a fair and calibrated model [Kleinberg et. al. 2017] does not hold for our definition of fairness. Further, we show that ensuring group-wise calibration with respect to the sensitive attributes automatically results in a fair model under our definition. Using this result, we provide a first cut approach for achieving fair and calibrated models, via a simple post-processing technique based on temperature scaling. We then propose modifications of existing calibration losses to perform group-wise calibration, as a way of achieving fair and calibrated models in a variety of settings. Finally, we perform extensive experimentation of these techniques on a diverse benchmark of datasets, and present insights on the pareto-optimality of the resulting solutions.
Discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these outstanding challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing to fillip to IE for the materials knowledge base.
Information extraction and textual comprehension from materials literature are vital for developing an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials domain that can evaluate the understanding of the key concepts by these language models. In this work, we curate a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials student who has cleared their undergraduate degree. We classify these questions based on their structure and the materials science domain-based subcategories. Further, we evaluate the performance of GPT-3.5 and GPT-4 models on solving these questions via zero-shot and chain of thought prompting. It is observed that GPT-4 gives the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in contrast to the general observation, no significant improvement in accuracy is observed with the chain of thought prompting. To evaluate the limitations, we performed an error analysis, which revealed conceptual errors (~64%) as the major contributor compared to computational errors (~36%) towards the reduced performance of LLMs. We hope that the dataset and analysis performed in this work will promote further research in developing better materials science domain-specific LLMs and strategies for information extraction.
Task-oriented dialog (TOD) agents often ground their responses on external knowledge bases (KBs). These KBs can be dynamic and may be updated frequently. Existing approaches for learning TOD agents assume the KB snapshot contemporary to each individual dialog is available during training. However, in real-world scenarios, only the latest KB snapshot is available during training and as a result, the train dialogs may contain facts conflicting with the latest KB. These dialog-KB inconsistencies in the training data may potentially confuse the TOD agent learning algorithm. In this work, we define the novel problem of learning a TOD agent with dialog-KB inconsistencies in the training data. We propose a Dialog-KB Arbitration Framework (DKAF) which reduces the dialog-KB inconsistencies by predicting the contemporary KB snapshot for each train dialog. These predicted KB snapshots are then used for training downstream TOD agents. As there are no existing datasets with dialog-KB inconsistencies, we systematically introduce inconsistencies in two publicly available dialog datasets. We show that TOD agents trained with DKAF perform better than existing baselines on both these datasets