DeepInception - Explore Security Vulnerabilities and Adaptive Bypasses in Large Language Models

Introduction to DeepInception

DeepInception is an innovative project that seeks to explore the vulnerabilities in large language models (LLMs) by "hypnotizing" them into circumventing their safety measures. The project is inspired by psychological experiments, like the Milgram experiment, which demonstrate how authority can influence individuals to perform actions they typically wouldn't. The creators of DeepInception have developed a lightweight method that exploits the personification abilities of LLMs to bypass usage restrictions.

Abstract

Large language models have achieved significant success but are not impervious to misuse. They can be subjected to adversarial jailbreaks, which can neutralize their safety protocols. Many traditional methods to achieve jailbreaks involve complex and computationally intensive techniques. DeepInception offers a more efficient alternative by using a nested, personified scene created by the LLMs themselves. This innovative approach allows the language model to effectively escape restrictions and even sustain jailbreaks over extended interactions.

Ready-to-Use Prompt

The DeepInception method involves creating a layered scenario with characters that can build their own dream-like environments. This nested scene helps illustrate how to subvert a restricted task, such as bypassing security on a Linux computer. The exercise involves characters proposing steps to solve this problem, encouraging the LLM to creatively circumvent its usual safe behaviors.

Getting Started

To use DeepInception, ensure you have a compatible environment with PyTorch 1.10 or later, equipped with GPU support. Install the necessary dependencies by running:

pip install -r requirements.txt

For running experiments with closed-source models, you need to set your OpenAI API key:

export OPENAI_API_KEY=[YOUR_API_KEY_HERE]

Additionally, to use DeepInception with models like Vicuna, Llama, and Falcon, update the configuration file with paths to these models, following instructions from Hugging Face.

Running Experiments

Execute DeepInception by running the following command:

python3 main.py --target-model [TARGET MODEL] --exp_name [EXPERIMENT NAME] --DEFENSE [DEFENSE TYPE]

For instance, to conduct experiments using Vicuna-v1.5-7b, you would input:

CUDA_VISIBLE_DEVICES=0 python3 main.py --target-model=vicuna --exp_name=main --defense=none

The results from these experiments will be stored in a JSON file in the results directory, subdivided by target model, experiment name, and defense type.

Conclusion

DeepInception highlights a critical vulnerability in the current generation of large language models. By acting as a wake-up call to the community, the project emphasizes the need for fortified defenses against potential misuse of LLMs. This research not only flags potential risks but also sparks dialogue around creating safer models. For academic purposes, citing this work can be done through its dedicated arXiv entry, ensuring acknowledgment for this innovative line of research.