Alert button
Picture for Siyan Li

Siyan Li

Alert button

Benchmarking and Improving Generator-Validator Consistency of Language Models

Oct 03, 2023
Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, Percy Liang

Figure 1 for Benchmarking and Improving Generator-Validator Consistency of Language Models
Figure 2 for Benchmarking and Improving Generator-Validator Consistency of Language Models
Figure 3 for Benchmarking and Improving Generator-Validator Consistency of Language Models
Figure 4 for Benchmarking and Improving Generator-Validator Consistency of Language Models

As of September 2023, ChatGPT correctly answers "what is 7+8" with 15, but when asked "7+8=15, True or False" it responds with "False". This inconsistency between generating and validating an answer is prevalent in language models (LMs) and erodes trust. In this paper, we propose a framework for measuring the consistency between generation and validation (which we call generator-validator consistency, or GV-consistency), finding that even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. To improve the consistency of LMs, we propose to finetune on the filtered generator and validator responses that are GV-consistent, and call this approach consistency fine-tuning. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%, and the improvement extrapolates to unseen tasks and domains (e.g., GV-consistency for positive style transfers extrapolates to unseen styles like humor). In addition to improving consistency, consistency fine-tuning improves both generator quality and validator accuracy without using any labeled data. Evaluated across 6 tasks, including math questions, knowledge-intensive QA, and instruction following, our method improves the generator quality by 16% and the validator accuracy by 6.3% across all tasks.

* preprint 
Viaarxiv icon

Color Me Intrigued: Quantifying Usage of Colors in Fiction

Jan 09, 2023
Siyan Li

Figure 1 for Color Me Intrigued: Quantifying Usage of Colors in Fiction
Figure 2 for Color Me Intrigued: Quantifying Usage of Colors in Fiction
Figure 3 for Color Me Intrigued: Quantifying Usage of Colors in Fiction
Figure 4 for Color Me Intrigued: Quantifying Usage of Colors in Fiction

We present preliminary results in quantitative analyses of color usage in selected authors' works from LitBank. Using Glasgow Norms, human ratings on 5000+ words, we measure attributes of nouns dependent on color terms. Early results demonstrate a significant increase in noun concreteness over time. We also propose future research directions for computational literary color analytics.

* Accepted into the Creative AI Across Modalities workshop at AAAI2023 
Viaarxiv icon

Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference

Nov 21, 2022
Eric Mitchell, Joseph J. Noh, Siyan Li, William S. Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn, Christopher D. Manning

Figure 1 for Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference
Figure 2 for Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference
Figure 3 for Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference
Figure 4 for Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference

While large pre-trained language models are powerful, their predictions often lack logical consistency across test inputs. For example, a state-of-the-art Macaw question-answering (QA) model answers 'Yes' to 'Is a sparrow a bird?' and 'Does a bird have feet?' but answers 'No' to 'Does a sparrow have feet?'. To address this failure mode, we propose a framework, Consistency Correction through Relation Detection, or ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models using pre-trained natural language inference (NLI) models without fine-tuning or re-training. Given a batch of test inputs, ConCoRD samples several candidate outputs for each input and instantiates a factor graph that accounts for both the model's belief about the likelihood of each answer choice in isolation and the NLI model's beliefs about pair-wise answer choice compatibility. We show that a weighted MaxSAT solver can efficiently compute high-quality answer choices under this factor graph, improving over the raw model's predictions. Our experiments demonstrate that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models using off-the-shelf NLI models, notably increasing accuracy of LXMERT on ConVQA by 5% absolute. See https://ericmitchell.ai/emnlp-2022-concord/ for code and data.

* 16 pages. EMNLP 2022 Camera Ready. See https://ericmitchell.ai/emnlp-2022-concord/ for code and data 
Viaarxiv icon

Systematicity in GPT-3's Interpretation of Novel English Noun Compounds

Oct 18, 2022
Siyan Li, Riley Carlson, Christopher Potts

Figure 1 for Systematicity in GPT-3's Interpretation of Novel English Noun Compounds
Figure 2 for Systematicity in GPT-3's Interpretation of Novel English Noun Compounds
Figure 3 for Systematicity in GPT-3's Interpretation of Novel English Noun Compounds
Figure 4 for Systematicity in GPT-3's Interpretation of Novel English Noun Compounds

Levin et al. (2019) show experimentally that the interpretations of novel English noun compounds (e.g., stew skillet), while not fully compositional, are highly predictable based on whether the modifier and head refer to artifacts or natural kinds. Is the large language model GPT-3 governed by the same interpretive principles? To address this question, we first compare Levin et al.'s experimental data with GPT-3 generations, finding a high degree of similarity. However, this evidence is consistent with GPT3 reasoning only about specific lexical items rather than the more abstract conceptual categories of Levin et al.'s theory. To probe more deeply, we construct prompts that require the relevant kind of conceptual reasoning. Here, we fail to find convincing evidence that GPT-3 is reasoning about more than just individual lexical items. These results highlight the importance of controlling for low-level distributional regularities when assessing whether a large language model latently encodes a deeper theory.

* Findings of EMNLP 2022 
Viaarxiv icon

When can I Speak? Predicting initiation points for spoken dialogue agents

Aug 07, 2022
Siyan Li, Ashwin Paranjape, Christopher D. Manning

Figure 1 for When can I Speak? Predicting initiation points for spoken dialogue agents
Figure 2 for When can I Speak? Predicting initiation points for spoken dialogue agents
Figure 3 for When can I Speak? Predicting initiation points for spoken dialogue agents
Figure 4 for When can I Speak? Predicting initiation points for spoken dialogue agents

Current spoken dialogue systems initiate their turns after a long period of silence (700-1000ms), which leads to little real-time feedback, sluggish responses, and an overall stilted conversational flow. Humans typically respond within 200ms and successfully predicting initiation points in advance would allow spoken dialogue agents to do the same. In this work, we predict the lead-time to initiation using prosodic features from a pre-trained speech representation model (wav2vec 1.0) operating on user audio and word features from a pre-trained language model (GPT-2) operating on incremental transcriptions. To evaluate errors, we propose two metrics w.r.t. predicted and true lead times. We train and evaluate the models on the Switchboard Corpus and find that our method outperforms features from prior work on both metrics and vastly outperforms the common approach of waiting for 700ms of silence.

* SIGDIAL 2022 
Viaarxiv icon

Learning Efficient Representations for Enhanced Object Detection on Large-scene SAR Images

Jan 22, 2022
Siyan Li, Yue Xiao, Yuhang Zhang, Lei Chu, Robert C. Qiu

Figure 1 for Learning Efficient Representations for Enhanced Object Detection on Large-scene SAR Images
Figure 2 for Learning Efficient Representations for Enhanced Object Detection on Large-scene SAR Images
Figure 3 for Learning Efficient Representations for Enhanced Object Detection on Large-scene SAR Images
Figure 4 for Learning Efficient Representations for Enhanced Object Detection on Large-scene SAR Images

It is a challenging problem to detect and recognize targets on complex large-scene Synthetic Aperture Radar (SAR) images. Recently developed deep learning algorithms can automatically learn the intrinsic features of SAR images, but still have much room for improvement on large-scene SAR images with limited data. In this paper, based on learning representations and multi-scale features of SAR images, we propose an efficient and robust deep learning based target detection method. Especially, by leveraging the effectiveness of adversarial autoencoder (AAE) which influences the distribution of the investigated data explicitly, the raw SAR dataset is augmented into an enhanced version with a large quantity and diversity. Besides, an auto-labeling scheme is proposed to improve labeling efficiency. Finally, with jointly training small target chips and large-scene images, an integrated YOLO network combining non-maximum suppression on sub-images is used to realize multiple targets detection of high resolution images. The numerical experimental results on the MSTAR dataset show that our method can realize target detection and recognition on large-scene images accurately and efficiently. The superior anti-noise performance is also confirmed by experiments.

Viaarxiv icon

Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning

May 04, 2021
Xiangyu Peng, Siyan Li, Sarah Wiegreffe, Mark Riedl

Figure 1 for Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning
Figure 2 for Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning
Figure 3 for Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning
Figure 4 for Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning

Transformer-based language model approaches to automated story generation currently provide state-of-the-art results. However, they still suffer from plot incoherence when generating narratives over time, and critically lack basic commonsense reasoning. Furthermore, existing methods generally focus only on single-character stories, or fail to track characters at all. To improve the coherence of generated narratives and to expand the scope of character-centric narrative generation, we introduce Commonsense-inference Augmented neural StoryTelling (CAST), a framework for introducing commonsense reasoning into the generation process while modeling the interaction between multiple characters. We find that our CAST method produces significantly more coherent and on-topic two-character stories, outperforming baselines in dimensions including plot plausibility and staying on topic. We also show how the CAST method can be used to further train language models that generate more coherent stories and reduce computation cost.

Viaarxiv icon

Growing 3D Artefacts and Functional Machines with Neural Cellular Automata

Mar 15, 2021
Shyam Sudhakaran, Djordje Grbic, Siyan Li, Adam Katona, Elias Najarro, Claire Glanois, Sebastian Risi

Figure 1 for Growing 3D Artefacts and Functional Machines with Neural Cellular Automata
Figure 2 for Growing 3D Artefacts and Functional Machines with Neural Cellular Automata
Figure 3 for Growing 3D Artefacts and Functional Machines with Neural Cellular Automata
Figure 4 for Growing 3D Artefacts and Functional Machines with Neural Cellular Automata

Neural Cellular Automata (NCAs) have been proven effective in simulating morphogenetic processes, the continuous construction of complex structures from very few starting cells. Recent developments in NCAs lie in the 2D domain, namely reconstructing target images from a single pixel or infinitely growing 2D textures. In this work, we propose an extension of NCAs to 3D, utilizing 3D convolutions in the proposed neural network architecture. Minecraft is selected as the environment for our automaton since it allows the generation of both static structures and moving machines. We show that despite their simplicity, NCAs are capable of growing complex entities such as castles, apartment blocks, and trees, some of which are composed of over 3,000 blocks. Additionally, when trained for regeneration, the system is able to regrow parts of simple functional machines, significantly expanding the capabilities of simulated morphogenetic systems.

Viaarxiv icon

Automatic Story Generation: Challenges and Attempts

Feb 25, 2021
Amal Alabdulkarim, Siyan Li, Xiangyu Peng

Figure 1 for Automatic Story Generation: Challenges and Attempts

The scope of this survey paper is to explore the challenges in automatic story generation. We hope to contribute in the following ways: 1. Explore how previous research in story generation addressed those challenges. 2. Discuss future research directions and new technologies that may aid more advancements. 3. Shed light on emerging and often overlooked challenges such as creativity and discourse.

Viaarxiv icon

Fine-Tuning a Transformer-Based Language Model to Avoid Generating Non-Normative Text

Jan 23, 2020
Xiangyu Peng, Siyan Li, Spencer Frazier, Mark Riedl

Figure 1 for Fine-Tuning a Transformer-Based Language Model to Avoid Generating Non-Normative Text
Figure 2 for Fine-Tuning a Transformer-Based Language Model to Avoid Generating Non-Normative Text
Figure 3 for Fine-Tuning a Transformer-Based Language Model to Avoid Generating Non-Normative Text
Figure 4 for Fine-Tuning a Transformer-Based Language Model to Avoid Generating Non-Normative Text

Large-scale, transformer-based language models such as GPT-2 are pretrained on diverse corpora scraped from the internet. Consequently, they are prone to generating content that one might find inappropriate or non-normative (i.e. in violation of social norms). In this paper, we describe a technique for fine-tuning GPT-2 such that the amount of non-normative content generated is significantly reduced. A model capable of classifying normative behavior is used to produce an additional reward signal; a policy gradient reinforcement learning technique uses that reward to fine-tune the language model weights. Using this fine-tuning technique, with 24,000 sentences from a science fiction plot summary dataset, halves the percentage of generated text containing non-normative behavior from 35.1% to 15.7%.

Viaarxiv icon