Tonic Validate Project Introduction
Tonic Validate is an innovative framework designed to evaluate the outputs of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) systems. With a focus on simplicity and performance, Tonic Validate enables users to effectively evaluate, track, and monitor their LLM and RAG applications. It provides a set of metrics that measure various aspects of LLM performance, including answer correctness and the prevention of LLM hallucinations. An optional user interface is available to visualize evaluation results, making tracking and monitoring straightforward.
Key Features
Evaluation Metrics
Tonic Validate comes equipped with numerous metrics suited for RAG systems, allowing users to gauge LLM performance accurately. Key metrics include:
- Answer Similarity Score: Measures how well the LLM answer matches a reference answer.
- Retrieval Precision: Evaluates if the retrieved context is relevant to the question.
- Augmentation Precision and Accuracy: Checks if relevant context appears in the LLM answer and if all context is included, respectively.
- Answer Consistency: Assesses if the LLM answer contains only information sourced from the provided context.
- Latency: Tracks the time the LLM takes to respond.
- Text Content Checks: Determines if specific text is present in the LLM response.
Preparing Data for RAG Systems
High-quality, secure data are essential for high-performing RAG systems. Tonic Validate's companion tool, Tonic Textual, enhances data pre-processing for RAG systems. It extracts text from unstructured data, de-identifies sensitive information, and optimizes data formatting for RAG applications, all while enriching it with metadata and contextual entity tags. This process helps build semantic entity graphs that ground RAG systems in factual data, reducing hallucinations and improving output quality.
Getting Started
To start using Tonic Validate locally, follow these steps:
-
Install Tonic Validate:
pip install tonic-validate
-
Code Setup Example:
from tonic_validate import ValidateScorer, Benchmark import os os.environ["OPENAI_API_KEY"] = "your-openai-key" def get_llm_response(question): return { "llm_answer": "Paris", "llm_context_list": ["Paris is the capital of France."] } benchmark = Benchmark(questions=["What is the capital of France?"], answers=["Paris"]) scorer = ValidateScorer() run = scorer.score(benchmark, get_llm_response)
Continuous Integration/Continuous Deployment (CI/CD)
Tonic Validate can be integrated into the CI/CD pipeline. Many users conduct evaluations during the code review or pull request process. The framework can be set up using the provided Github Action available on the Github Marketplace, enabling automated evaluations that ensure code reliability and performance.
Usage Examples
Tonic Validate supports configurable metrics and custom metric development. Users must supply relevant input data from their RAG applications for accurate performance measurement. Example metrics cover aspects such as answer similarity, augmentation precision, and latency. Users can apply metrics with various LLM services, including OpenAI, Azure, and others, by setting the appropriate API keys.
Viewing Results
Results from Tonic Validate can be viewed directly in Python by printing scores for each question, answer, and LLM output, alongside the overall performance scores. This structured approach aids in understanding how well the LLM performs across various metrics and inputs.
Conclusion
Tonic Validate offers a comprehensive solution for evaluating the efficacy of LLM and RAG systems. By incorporating Tonic Validate into their workflow, developers can enhance the trustworthiness and reliability of their applications. The framework's flexibility, combined with its detailed metrics and easy integration into existing workflows, makes Tonic Validate an essential tool for developers working with RAG and LLM technologies. For more details, explore the Tonic Validate documentation.