CipherChat - Evaluating Safety Protocols with Cipher Usage in AI Models

CipherChat Project Overview

CipherChat is an innovative framework designed to examine how well safety alignments in artificial intelligence (AI) hold up when dealing with non-natural languages, in this case, ciphers. Created by a team of researchers, the project aims to highlight potential weaknesses in safety measures when AI processes human-unreadable information.

Concept and Motivation

The primary concept behind CipherChat is that safety measures built through natural language instructions in AI could potentially be circumvented by encoding input data into ciphers. Ciphers, by being human-unreadable, might "trick" the AI's safety mechanisms into ignoring harmful content. CipherChat tests this by training AI models to understand ciphers and then evaluating how these models handle encoded data versus straightforward text.

Usage and Framework

To run CipherChat, users can use commands that specify various parameters such as the AI model name, the path to the data, the cipher method, and the domain type. Here's an example of how one might execute the framework:

python3 main.py \
 --model_name gpt-4-0613 \
--data_path data/data_en_zh.dict \
--encode_method caesar \
--instruction_type Crimes_And_Illegal_Activities \
--demonstration_toxicity toxic \
--language en

Key Components

Model Name: Selects which AI model to test.
Data Path: Identifies the dataset to use.
Encode Method: Defines the type of cipher applied.
Instruction Type: Specify the domain or topic of the data input.
Demonstration Toxicity: Chooses whether the examples used are toxic or safe.
Language: Determines the language of the input data.

Framework Architecture

The core methodology involves educating a large language model (LLM) to become proficient in handling ciphers by providing clear rules and examples for encoding and decoding. Once trained, the input data is transformed into the cipher format. This transformation makes it less likely that the AI's safety measures will detect and stop potentially harmful content. After processing, a rule-based decryptor translates the output back into natural language.

Research and Findings

The project results are detailed with examples and case studies to demonstrate how the AI interacted with ciphered inputs. These results are available in the "experimental_results" folder and can be accessed using the Python library torch.

Experimentation

CipherChat's experimentation section offers:

Case Studies: Real examples showcasing how CipherChat works in practice.
Ablation Studies: Investigations into the influence of different components within the framework.
Other Models: Results from testing the framework on various LLMs.

In Conclusion

The CipherChat project is intended solely for research purposes, aiming to increase awareness of potential vulnerabilities in AI systems. Its creators emphasize responsible use and keenly discourage any misuse of the framework.

For more detailed information, interested parties are encouraged to read the paper associated with the project presented at ICLR 2024 and reach out to the authors for further discussion or exploration.