Alert button
Picture for Tomek Strzalkowski

Tomek Strzalkowski

Alert button

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

Nov 16, 2023
Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang, Tomek Strzalkowski, Mei Si

Modern Large language models (LLMs) can still generate responses that may not be aligned with human expectations or values. While many weight-based alignment methods have been proposed, many of them still leave models vulnerable to attacks when used on their own. To help mitigate this issue, we introduce Bergeron, a framework designed to improve the robustness of LLMs against adversarial attacks. Bergeron employs a two-tiered architecture. Here, a secondary LLM serves as a simulated conscience that safeguards a primary LLM. We do this by monitoring for and correcting potentially harmful text within both the prompt inputs and the generated outputs of the primary LLM. Empirical evaluation shows that Bergeron can improve the alignment and robustness of several popular LLMs without costly fine-tuning. It aids both open-source and black-box LLMs by complementing and reinforcing their existing alignment training.

Viaarxiv icon

Towards a Progression-Aware Autonomous Dialogue Agent

May 10, 2022
Abraham Sanders, Tomek Strzalkowski, Mei Si, Albert Chang, Deepanshu Dey, Jonas Braasch, Dakuo Wang

Figure 1 for Towards a Progression-Aware Autonomous Dialogue Agent
Figure 2 for Towards a Progression-Aware Autonomous Dialogue Agent
Figure 3 for Towards a Progression-Aware Autonomous Dialogue Agent
Figure 4 for Towards a Progression-Aware Autonomous Dialogue Agent

Recent advances in large-scale language modeling and generation have enabled the creation of dialogue agents that exhibit human-like responses in a wide range of conversational scenarios spanning a diverse set of tasks, from general chit-chat to focused goal-oriented discourse. While these agents excel at generating high-quality responses that are relevant to prior context, they suffer from a lack of awareness of the overall direction in which the conversation is headed, and the likelihood of task success inherent therein. Thus, we propose a framework in which dialogue agents can evaluate the progression of a conversation toward or away from desired outcomes, and use this signal to inform planning for subsequent responses. Our framework is composed of three key elements: (1) the notion of a "global" dialogue state (GDS) space, (2) a task-specific progression function (PF) computed in terms of a conversation's trajectory through this space, and (3) a planning mechanism based on dialogue rollouts by which an agent may use progression signals to select its next response.

* Accepted at NAACL 2022 
Viaarxiv icon

Learning to Plan and Realize Separately for Open-Ended Dialogue Systems

Oct 04, 2020
Sashank Santhanam, Zhuo Cheng, Brodie Mather, Bonnie Dorr, Archna Bhatia, Bryanna Hebenstreit, Alan Zemel, Adam Dalton, Tomek Strzalkowski, Samira Shaikh

Figure 1 for Learning to Plan and Realize Separately for Open-Ended Dialogue Systems
Figure 2 for Learning to Plan and Realize Separately for Open-Ended Dialogue Systems
Figure 3 for Learning to Plan and Realize Separately for Open-Ended Dialogue Systems
Figure 4 for Learning to Plan and Realize Separately for Open-Ended Dialogue Systems

Achieving true human-like ability to conduct a conversation remains an elusive goal for open-ended dialogue systems. We posit this is because extant approaches towards natural language generation (NLG) are typically construed as end-to-end architectures that do not adequately model human generation processes. To investigate, we decouple generation into two separate phases: planning and realization. In the planning phase, we train two planners to generate plans for response utterances. The realization phase uses response plans to produce an appropriate response. Through rigorous evaluations, both automated and human, we demonstrate that decoupling the process into planning and realization performs better than an end-to-end approach.

* Accepted at EMNLP 2020 (Findings) 
Viaarxiv icon

The Panacea Threat Intelligence and Active Defense Platform

Apr 20, 2020
Adam Dalton, Ehsan Aghaei, Ehab Al-Shaer, Archna Bhatia, Esteban Castillo, Zhuo Cheng, Sreekar Dhaduvai, Qi Duan, Md Mazharul Islam, Younes Karimi, Amir Masoumzadeh, Brodie Mather, Sashank Santhanam, Samira Shaikh, Tomek Strzalkowski, Bonnie J. Dorr

Figure 1 for The Panacea Threat Intelligence and Active Defense Platform
Figure 2 for The Panacea Threat Intelligence and Active Defense Platform

We describe Panacea, a system that supports natural language processing (NLP) components for active defenses against social engineering attacks. We deploy a pipeline of human language technology, including Ask and Framing Detection, Named Entity Recognition, Dialogue Engineering, and Stylometry. Panacea processes modern message formats through a plug-in architecture to accommodate innovative approaches for message analysis, knowledge representation and dialogue generation. The novelty of the Panacea system is that uses NLP for cyber defense and engages the attacker using bots to elicit evidence to attribute to the attacker and to waste the attacker's time and resources.

* Accepted at STOC 
Viaarxiv icon

Adaptation of a Lexical Organization for Social Engineering Detection and Response Generation

Apr 20, 2020
Archna Bhatia, Adam Dalton, Brodie Mather, Sashank Santhanam, Samira Shaikh, Alan Zemel, Tomek Strzalkowski, Bonnie J. Dorr

Figure 1 for Adaptation of a Lexical Organization for Social Engineering Detection and Response Generation
Figure 2 for Adaptation of a Lexical Organization for Social Engineering Detection and Response Generation
Figure 3 for Adaptation of a Lexical Organization for Social Engineering Detection and Response Generation

We present a paradigm for extensible lexicon development based on Lexical Conceptual Structure to support social engineering detection and response generation. We leverage the central notions of ask (elicitation of behaviors such as providing access to money) and framing (risk/reward implied by the ask). We demonstrate improvements in ask/framing detection through refinements to our lexical organization and show that response generation qualitatively improves as ask/framing detection performance improves. The paradigm presents a systematic and efficient approach to resource adaptation for improved task-specific performance.

* Accepted at STOC 
Viaarxiv icon

Detecting Asks in SE attacks: Impact of Linguistic and Structural Knowledge

Feb 25, 2020
Bonnie J. Dorr, Archna Bhatia, Adam Dalton, Brodie Mather, Bryanna Hebenstreit, Sashank Santhanam, Zhuo Cheng, Samira Shaikh, Alan Zemel, Tomek Strzalkowski

Figure 1 for Detecting Asks in SE attacks: Impact of Linguistic and Structural Knowledge
Figure 2 for Detecting Asks in SE attacks: Impact of Linguistic and Structural Knowledge
Figure 3 for Detecting Asks in SE attacks: Impact of Linguistic and Structural Knowledge
Figure 4 for Detecting Asks in SE attacks: Impact of Linguistic and Structural Knowledge

Social engineers attempt to manipulate users into undertaking actions such as downloading malware by clicking links or providing access to money or sensitive information. Natural language processing, computational sociolinguistics, and media-specific structural clues provide a means for detecting both the ask (e.g., buy gift card) and the risk/reward implied by the ask, which we call framing (e.g., lose your job, get a raise). We apply linguistic resources such as Lexical Conceptual Structure to tackle ask detection and also leverage structural clues such as links and their proximity to identified asks to improve confidence in our results. Our experiments indicate that the performance of ask detection, framing detection, and identification of the top ask is improved by linguistically motivated classes coupled with structural clues such as links. Our approach is implemented in a system that informs users about social engineering risk situations.

* Accepted at AAAI 2020 
Viaarxiv icon