en

#Safety alignment

The CipherChat framework evaluates the generalizability of safety protocols in AI through the use of ciphers in non-natural languages. By teaching language models to comprehend and process ciphered inputs, it potentially bypasses traditional safety measures. The study includes comprehensive evaluations, demonstrating effective input transformation and post-processing decoding while minimally affecting established safety alignments. Extensive results and case studies are presented to further research in cipher utilization for AI safety. For detailed insights, refer to the ICLR 2024 publication.

Terms of Use Privacy Policy Advertising Services

Feedback Email: [email protected]