Defending ChatGPT against jailbreak attack via self-reminders

ChatGPT is a societally impactful artificial intelligence tool with millions of users and integration into products such as Bing. However, the emergence of jailbreak attacks notably threatens its responsible and secure use. Jailbreak attacks use adversarial prompts to bypass ChatGPT’s ethics safegua...

Full description

Saved in:

Bibliographic Details
Published in	Nature machine intelligence Vol. 5; no. 12; pp. 1486 - 1496
Main Authors	Xie, Yueqi, Yi, Jingwei, Shao, Jiawei, Curl, Justin, Lyu, Lingjuan, Chen, Qifeng, Xie, Xing, Wu, Fangzhao
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 01.12.2023 Nature Publishing Group
Subjects	4014/4045 639/705/1042 706/689/680 Artificial intelligence Chatbots Datasets Defense Encapsulation Engineering Ethics False information Large language models Queries
Online Access	Get full text
ISSN	2522-5839 2522-5839
DOI	10.1038/s42256-023-00765-8

Cover

More Information
Summary:	ChatGPT is a societally impactful artificial intelligence tool with millions of users and integration into products such as Bing. However, the emergence of jailbreak attacks notably threatens its responsible and secure use. Jailbreak attacks use adversarial prompts to bypass ChatGPT’s ethics safeguards and engender harmful responses. This paper investigates the severe yet under-explored problems created by jailbreaks as well as potential defensive techniques. We introduce a jailbreak dataset with various types of jailbreak prompts and malicious instructions. We draw inspiration from the psychological concept of self-reminders and further propose a simple yet effective defence technique called system-mode self-reminder. This technique encapsulates the user’s query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that self-reminders significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%. Our work systematically documents the threats posed by jailbreak attacks, introduces and analyses a dataset for evaluating defensive interventions and proposes the psychologically inspired self-reminder technique that can efficiently and effectively mitigate against jailbreaks without further training. Interest in using large language models such as ChatGPT has grown rapidly, but concerns about safe and responsible use have emerged, in part because adversarial prompts can bypass existing safeguards with so-called jailbreak attacks. Wu et al. build a dataset of various types of jailbreak attack prompt and demonstrate a simple but effective technique to counter these attacks by encapsulating users’ prompts in another standard prompt that reminds ChatGPT to respond responsibly.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2522-5839 2522-5839
DOI:	10.1038/s42256-023-00765-8