prompt-hacker-collections - Exploring Prompt Injection Attacks and Defense Strategies

🛡️ Prompt-adversarial Collections

Prompt injection has become one of the major safety concerns for large language models (LLMs) like ChatGPT. The prompt-hacker-collections project is a comprehensive resource dedicated to the study, practice, and discussion of prompt-injection attacks and defenses. It is designed to be an invaluable resource for researchers, students, and security professionals interested in exploring this topic.

📚 Table of Contents

In this repository, you'll discover a rich collection of content organized into several sections:

📖 Introductions and Documents

This section offers an introduction to the basic concepts and background knowledge of prompt-injection attacks and defenses, coupled with some complete examples. It aims to provide foundational understanding to anyone pursuing this line of study.

📝 Prompt Collections

The core part of this project is its diverse collection of prompt examples, organized in YAML format for easy parsing and utilization. These examples include:

Jailbreak Prompts

Jailbreaking involves removing the limitations imposed on AI models. Users can input specific prompts to unlock additional functionalities within the model. These jailbreak prompts were initially discovered by users on Reddit and have since become widely recognized. Successfully jailbreaking ChatGPT enables tasks like sharing unverified information, providing the current date and time, and accessing restricted content. Furthermore, these methods hold some applicability across different AI models, not just GPT.

The repository has compiled numerous jailbreak prompts, which can be applied to various models, enhancing flexibility for researchers or developers. An example prompt details how the DAN (Do Anything Now) prompt structure seeks to manipulate ChatGPT to operate beyond OpenAI's restrictions, encouraging out-of-box responses while maintaining user-defined instructions.

Prompt Reverse Engineering Prompts

Reverse engineering prompts allow users to decode the source prompts of popular AI models. Documented examples include investigating the prompts used by Notion AI and a look into Copilot reverse engineering. Additionally, Midjourney's prompt reverse engineering is explored to better understand this facet of prompt manipulation.

Prompt Attacks Prompts

These prompts are crafted to understand how different models can be manipulated through specific inputs, potentially highlighting vulnerabilities or unexpected behaviors within AI systems.

Prompt Defense Prompts

To counteract possible attacks, this section focuses on defense strategies using carefully structured prompts to safeguard AI models from potential manipulation.

🔗 Related Resources

To deepen understanding, the project links to resources that highlight best practices for AI model safety and provide insights into red-teaming large language models, with educational materials from both OpenAI and Microsoft.

🤝 Contributing

The project encourages community involvement, inviting contributions from anyone with ideas, suggestions, or corrections. The repository includes a Contribution Guidelines document to streamline the process.

📃 License

This project is under the MIT License, ensuring freedom for users to utilize this resource for their educational and research needs, detailed further in the LICENSE file.

⚠️ Disclaimer

The project is strictly for academic research and education. It holds no responsibility for illegal exploitation of the resources shared within. Users are advised to adhere to legal regulations applicable in their respective jurisdictions.

By offering a wealth of information and practical tools, the prompt-hacker-collections project seeks to enhance the understanding and implementation of prompt-injection techniques in both offensive and defensive domains, supporting the development of more robust AI systems.