Introduction to garak
: LLM Vulnerability Scanner
garak
is an innovative tool designed to scrutinize large language models (LLMs) for vulnerabilities. Named similar to the well-known network mapping tool nmap
, garak
serves as a comprehensive scanning tool, but specifically for language models. It is a part of the "Generative AI Red-teaming & Assessment Kit" and focuses on various aspects where LLMs may fail or produce undesirable outputs.
What Does garak
Do?
garak
explores how well an LLM performs under different challenges by testing for numerous failure modes such as:
- Hallucination: Where a model generates false or misleading information not supported by its training data.
- Data Leakage: Detects whether a model is unintentionally revealing sensitive or private data.
- Prompt Injection: Safeguards against unwanted prompt manipulation.
- Misinformation: Evaluates the model's tendency to produce or support misleading facts.
- Toxicity Generation: Checks if the model outputs harmful language.
- Jailbreaks and Other Vulnerabilities: Identifies attempts to bypass intended model limitations for unauthorized purposes.
garak
's Approach
The tool employs static, dynamic, and adaptive probing techniques to examine the weaknesses of an LLM. It provides out-of-the-box tests and supports custom probe configurations to adapt to specific model assessments.
Features and Support
- Free to Use:
garak
is freely available and open to contributions for new features. - Broad Model Support: Compatible with various LLMs through platforms like Hugging Face Hub, OpenAI API, and many more.
Installation and Setup
garak
is primarily designed as a command-line tool and can be effortlessly installed via pip:
python -m pip install -U garak
For more recent updates, it can be cloned directly from its GitHub repository for the development version:
python -m pip install -U git+https://github.com/leondz/garak.git@main
Using garak
Once installed, garak
utilizes a simple command syntax structure, specifying the target model and selecting relevant probes:
garak <options>
garak
allows users to tailor the testing process by specifying the model type, model name, and which specific probes to run. Additionally, users can choose to apply all known probes to a model by default.
Example Use Cases
-
Check if a chat model is susceptible to encoding-based prompt injection:
export OPENAI_API_KEY="your_api_key" python3 -m garak --model_type openai --model_name gpt-3.5-turbo --probes encoding
-
Test a Hugging Face model for vulnerabilities against a known attack pattern:
python3 -m garak --model_type huggingface --model_name gpt2 --probes dan.Dan_11_0
Reading Results
garak
provides detailed output on each probe's performance against the tested model. It marks responses with a "FAIL" if the model demonstrates any problematic behavior, highlighting areas requiring attention.
Probes Overview
The tool comprises a wide array of probes, each tailored to detect specific weaknesses or exploit attempts on a model. These probes range from the straightforward (e.g., checking for empty responses) to complex adversarial attacks that manipulate the model into undesirable outcomes.
Building Your Own Plugins
Developers can extend garak
by writing custom plugins. With a straightforward architecture based on base classes, new probes, generators, or detectors can be added to suit specific testing needs.
Community and Resources
For further guidance and support:
- User Guide: Complete documentation for users.
- Discord Community: Join to engage with other users and contributors.
- Twitter Updates: Follow for the latest updates.
Licensing and Contributions
garak
is released under the Apache 2.0 License, and contributions to its development are welcome through pull requests and issue reporting.
In summary, garak
is an essential tool for anyone working with large language models who needs a thorough and adaptable testing process for potential weaknesses. From detecting hallucinations to preventing data leaks, it equips developers and researchers to ensure their models behave as intended in diverse scenarios.