Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bang Liu

CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Sep 26, 2024

Sifan Wu, Amir Khasahmadi, Mor Katz, Pradeep Kumar Jayaraman, Yewen Pu, Karl Willis, Bang Liu

Figure 1 for CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Figure 2 for CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Figure 3 for CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Figure 4 for CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Abstract:Parametric Computer-Aided Design (CAD) is central to contemporary mechanical design. However, it encounters challenges in achieving precise parametric sketch modeling and lacks practical evaluation metrics suitable for mechanical design. We harness the capabilities of pre-trained foundation models, renowned for their successes in natural language processing and computer vision, to develop generative models specifically for CAD. These models are adept at understanding complex geometries and design reasoning, a crucial advancement in CAD technology. In this paper, we propose CadVLM, an end-to-end vision language model for CAD generation. Our approach involves adapting pre-trained foundation models to manipulate engineering sketches effectively, integrating both sketch primitive sequences and sketch images. Extensive experiments demonstrate superior performance on multiple CAD sketch generation tasks such as CAD autocompletion, CAD autoconstraint, and image conditional generation. To our knowledge, this is the first instance of a multimodal Large Language Model (LLM) being successfully applied to parametric CAD generation, representing a pioneering step in the field of computer-aided mechanical design.

Via

Access Paper or Ask Questions

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Sep 05, 2024

Jeremy Qin, Bang Liu, Quoc Dinh Nguyen

Figure 1 for Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Figure 2 for Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Figure 3 for Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Figure 4 for Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Abstract:Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to effectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence, leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, \textit{Atypical Presentations Recalibration}, which leverages atypical presentations to adjust the model's confidence estimates. Our approach significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within the recalibration framework.

Via

Access Paper or Ask Questions

Pairing Analogy-Augmented Generation with Procedural Memory for Procedural Q&A

Sep 02, 2024

K Roth, Rushil Gupta, Simon Halle, Bang Liu

Figure 1 for Pairing Analogy-Augmented Generation with Procedural Memory for Procedural Q&A

Figure 2 for Pairing Analogy-Augmented Generation with Procedural Memory for Procedural Q&A

Figure 3 for Pairing Analogy-Augmented Generation with Procedural Memory for Procedural Q&A

Figure 4 for Pairing Analogy-Augmented Generation with Procedural Memory for Procedural Q&A

Abstract:While LLMs in the RAG paradigm have shown remarkable performance on a variety of tasks, they still under-perform on unseen domains, especially on complex tasks like procedural question answering. In this work, we introduce a novel formalism and structure for manipulating text-based procedures. Based on this formalism, we further present a novel dataset called LCStep, scraped from the LangChain Python docs. Moreover, we extend the traditional RAG system to propose a novel system called analogy-augmented generation (AAG), that draws inspiration from human analogical reasoning and ability to assimilate past experiences to solve unseen problems. The proposed method uses a frozen language model with a custom procedure memory store to adapt to specialized knowledge. We demonstrate that AAG outperforms few-shot and RAG baselines on LCStep, RecipeNLG, and CHAMP datasets under a pairwise LLM-based evaluation, corroborated by human evaluation in the case of RecipeNLG.

Via

Access Paper or Ask Questions

HoneyComb: A Flexible LLM-Based Agent System for Materials Science

Aug 29, 2024

Huan Zhang, Yu Song, Ziyu Hou, Santiago Miret, Bang Liu

Figure 1 for HoneyComb: A Flexible LLM-Based Agent System for Materials Science

Figure 2 for HoneyComb: A Flexible LLM-Based Agent System for Materials Science

Figure 3 for HoneyComb: A Flexible LLM-Based Agent System for Materials Science

Figure 4 for HoneyComb: A Flexible LLM-Based Agent System for Materials Science

Abstract:The emergence of specialized large language models (LLMs) has shown promise in addressing complex tasks for materials science. Many LLMs, however, often struggle with distinct complexities of material science tasks, such as materials science computational tasks, and often rely heavily on outdated implicit knowledge, leading to inaccuracies and hallucinations. To address these challenges, we introduce HoneyComb, the first LLM-based agent system specifically designed for materials science. HoneyComb leverages a novel, high-quality materials science knowledge base (MatSciKB) and a sophisticated tool hub (ToolHub) to enhance its reasoning and computational capabilities tailored to materials science. MatSciKB is a curated, structured knowledge collection based on reliable literature, while ToolHub employs an Inductive Tool Construction method to generate, decompose, and refine API tools for materials science. Additionally, HoneyComb leverages a retriever module that adaptively selects the appropriate knowledge source or tools for specific tasks, thereby ensuring accuracy and relevance. Our results demonstrate that HoneyComb significantly outperforms baseline models across various tasks in materials science, effectively bridging the gap between current LLM capabilities and the specialized needs of this domain. Furthermore, our adaptable framework can be easily extended to other scientific domains, highlighting its potential for broad applicability in advancing scientific research and applications.

* Under Review on EMNLP 2024

Via

Access Paper or Ask Questions

Improving Clinical Note Generation from Complex Doctor-Patient Conversation

Aug 26, 2024

Yizhan Li, Sifan Wu, Christopher Smith, Thomas Lo, Bang Liu

Figure 1 for Improving Clinical Note Generation from Complex Doctor-Patient Conversation

Figure 2 for Improving Clinical Note Generation from Complex Doctor-Patient Conversation

Figure 3 for Improving Clinical Note Generation from Complex Doctor-Patient Conversation

Figure 4 for Improving Clinical Note Generation from Complex Doctor-Patient Conversation

Abstract:Writing clinical notes and documenting medical exams is a critical task for healthcare professionals, serving as a vital component of patient care documentation. However, manually writing these notes is time-consuming and can impact the amount of time clinicians can spend on direct patient interaction and other tasks. Consequently, the development of automated clinical note generation systems has emerged as a clinically meaningful area of research within AI for health. In this paper, we present three key contributions to the field of clinical note generation using large language models (LLMs). First, we introduce CliniKnote, a comprehensive dataset consisting of 1,200 complex doctor-patient conversations paired with their full clinical notes. This dataset, created and curated by medical experts with the help of modern neural networks, provides a valuable resource for training and evaluating models in clinical note generation tasks. Second, we propose the K-SOAP (Keyword, Subjective, Objective, Assessment, and Plan) note format, which enhances traditional SOAP~\cite{podder2023soap} (Subjective, Objective, Assessment, and Plan) notes by adding a keyword section at the top, allowing for quick identification of essential information. Third, we develop an automatic pipeline to generate K-SOAP notes from doctor-patient conversations and benchmark various modern LLMs using various metrics. Our results demonstrate significant improvements in efficiency and performance compared to standard LLM finetuning methods.

Via

Access Paper or Ask Questions

T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

Aug 21, 2024

Yili Li, Jing Yu, Keke Gai, Bang Liu, Gang Xiong, Qi Wu

Figure 1 for T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

Figure 2 for T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

Figure 3 for T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

Figure 4 for T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

Abstract:Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30\%-50\% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at https://github.com/Lilidamowang/T2VIndexer-generativeSearch.

Via

Access Paper or Ask Questions

Enhancing Agent Learning through World Dynamics Modeling

Jul 25, 2024

Zhiyuan Sun, Haochen Shi, Marc-Alexandre Côté, Glen Berseth, Xingdi Yuan, Bang Liu

Abstract:While large language models (LLMs) have been increasingly deployed across tasks in language understanding and interactive decision-making, their impressive performance is largely due to the comprehensive and in-depth domain knowledge embedded within them. However, the extent of this knowledge can vary across different domains. Existing methods often assume that LLMs already possess such comprehensive and in-depth knowledge of their environment, overlooking potential gaps in their understanding of actual world dynamics. To address this gap, we introduce Discover, Verify, and Evolve (DiVE), a framework that discovers world dynamics from a small number of demonstrations, verifies the correctness of these dynamics, and evolves new, advanced dynamics tailored to the current situation. Through extensive evaluations, we analyze the impact of each component on performance and compare the automatically generated dynamics from DiVE with human-annotated world dynamics. Our results demonstrate that LLMs guided by DiVE can make better decisions, achieving rewards comparable to human players in the Crafter environment.

Via

Access Paper or Ask Questions

Can Pre-trained Language Models Understand Chinese Humor?

Jul 04, 2024

Yuyan Chen, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Bang Liu, Yunwen Chen

Figure 1 for Can Pre-trained Language Models Understand Chinese Humor?

Figure 2 for Can Pre-trained Language Models Understand Chinese Humor?

Figure 3 for Can Pre-trained Language Models Understand Chinese Humor?

Figure 4 for Can Pre-trained Language Models Understand Chinese Humor?

Abstract:Humor understanding is an important and challenging research in natural language processing. As the popularity of pre-trained language models (PLMs), some recent work makes preliminary attempts to adopt PLMs for humor recognition and generation. However, these simple attempts do not substantially answer the question: {\em whether PLMs are capable of humor understanding?} This paper is the first work that systematically investigates the humor understanding ability of PLMs. For this purpose, a comprehensive framework with three evaluation steps and four evaluation tasks is designed. We also construct a comprehensive Chinese humor dataset, which can fully meet all the data requirements of the proposed evaluation framework. Our empirical study on the Chinese humor dataset yields some valuable observations, which are of great guiding value for future optimization of PLMs in humor understanding and generation.

* Accepted to WSDM 2022

Via

Access Paper or Ask Questions

MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Jul 04, 2024

Yuyan Chen, Zhihao Wen, Ge Fan, Zhengyu Chen, Wei Wu, Dayiheng Liu, Zhixu Li, Bang Liu, Yanghua Xiao

Figure 1 for MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Figure 2 for MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Figure 3 for MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Figure 4 for MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Abstract:Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.

* Accepted to EMNLP 2023 (Findings)

Via

Access Paper or Ask Questions

VCR: Visual Caption Restoration

Jun 10, 2024

Tianyu Zhang, Suyuchen Wang, Lu Li, Ge Zhang, Perouz Taslakian, Sai Rajeswar, Jie Fu, Bang Liu, Yoshua Bengio

Figure 1 for VCR: Visual Caption Restoration

Figure 2 for VCR: Visual Caption Restoration

Figure 3 for VCR: Visual Caption Restoration

Figure 4 for VCR: Visual Caption Restoration

Abstract:We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.

* 18 pages, 2 figures

Via

Access Paper or Ask Questions