RAGAS: Retrieval-Augmented Generation Assessment
RAGAS is a comprehensive framework for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a suite of metrics to assess the quality of RAG pipelines, helping developers and researchers understand how well their systems perform across different dimensions.
Overview
RAGAS (Retrieval-Augmented Generation Assessment) is an open-source Python library designed to evaluate RAG systems systematically. It focuses on three main aspects:
- Faithfulness: How well the generated answer is grounded in the retrieved context
- Answer Relevance: How relevant the generated answer is to the user's question
- Context Relevance: How relevant the retrieved context is to the user's question
Key Features
🔍 Core Evaluation Metrics
- Faithfulness: Measures if the generated answer is factually consistent with the retrieved context
- Answer Relevance: Evaluates how well the answer addresses the user's question
- Context Relevance: Assesses the relevance of retrieved documents to the query
- Context Recall: Measures how much of the relevant information is captured in the retrieved context
- Answer Correctness: Evaluates factual accuracy of the generated answers
🚀 Advanced Capabilities
- Reference-Free Evaluation: No need for human-annotated ground truth
- Custom Metrics: Extensible framework for custom evaluation metrics
- Batch Processing: Efficient evaluation of large datasets
- Multiple Formats: Support for various data formats and RAG frameworks
- Reproducible Results: Consistent evaluation across different runs
Installation
Basic Installation
# Install RAGAS
pip install ragas
# Install with additional dependencies for advanced features
pip install "ragas[all]"
Development Installation
# Clone the repository
git clone https://github.com/explodinggradients/ragas.git
cd ragas
# Install in development mode
pip install -e .
Optional Dependencies
# For specific evaluation metrics
pip install ragas[faithfulness]
pip install ragas[relevance]
pip install ragas[context_recall]
# For all features
pip install "ragas[all]"
Quick Start
Basic Usage
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance, context_relevance
from datasets import Dataset
# Prepare your data
data = {
"question": ["What is the capital of France?"],
"contexts": [["Paris is the capital of France.", "France is a country in Europe."]],
"answer": ["Paris is the capital of France."]
}
dataset = Dataset.from_dict(data)
# Evaluate your RAG system
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevance, context_relevance]
)
print(results)
Advanced Usage with Custom Data
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance, context_relevance
from datasets import Dataset
import pandas as pd
# Load your RAG evaluation data
df = pd.read_csv("rag_evaluation_data.csv")
# Convert to RAGAS format
dataset = Dataset.from_pandas(df)
# Define evaluation metrics
metrics = [
faithfulness,
answer_relevance,
context_relevance
]
# Run evaluation
results = evaluate(
dataset,
metrics=metrics,
batch_size=8
)
# Print detailed results
for metric_name, score in results.items():
print(f"{metric_name}: {score:.4f}")
Data Format
Required Schema
RAGAS expects data in the following format:
{
"question": ["What is machine learning?"],
"contexts": [["Machine learning is a subset of AI.", "It involves training models on data."]],
"answer": ["Machine learning is a subset of artificial intelligence that involves training models on data."]
}
Extended Schema with Metadata
{
"question": ["What is machine learning?"],
"contexts": [["Machine learning is a subset of AI.", "It involves training models on data."]],
"answer": ["Machine learning is a subset of artificial intelligence that involves training models on data."],
"ground_truth": ["Machine learning is a subset of AI."], # Optional
"metadata": [{"source": "wikipedia", "confidence": 0.95}] # Optional
}
Evaluation Metrics
1. Faithfulness
Measures if the generated answer is factually consistent with the retrieved context.
from ragas.metrics import faithfulness
# Evaluate faithfulness
results = evaluate(dataset, metrics=[faithfulness])
print(f"Faithfulness Score: {results['faithfulness']:.4f}")
Interpretation:
- High Score (0.8-1.0): Generated answer is well-grounded in the context
- Medium Score (0.5-0.8): Some inconsistencies or hallucinations
- Low Score (0.0-0.5): Significant factual inconsistencies
2. Answer Relevance
Evaluates how well the answer addresses the user's question.
from ragas.metrics import answer_relevance
# Evaluate answer relevance
results = evaluate(dataset, metrics=[answer_relevance])
print(f"Answer Relevance Score: {results['answer_relevance']:.4f}")
Interpretation:
- High Score (0.8-1.0): Answer directly addresses the question
- Medium Score (0.5-0.8): Partially relevant answer
- Low Score (0.0-0.5): Answer doesn't address the question
3. Context Relevance
Assesses the relevance of retrieved documents to the query.
from ragas.metrics import context_relevance
# Evaluate context relevance
results = evaluate(dataset, metrics=[context_relevance])
print(f"Context Relevance Score: {results['context_relevance']:.4f}")
Interpretation:
- High Score (0.8-1.0): Retrieved context is highly relevant
- Medium Score (0.5-0.8): Some relevant information
- Low Score (0.0-0.5): Retrieved context is not relevant
4. Context Recall
Measures how much of the relevant information is captured in the retrieved context.
from ragas.metrics import context_recall
# Evaluate context recall
results = evaluate(dataset, metrics=[context_recall])
print(f"Context Recall Score: {results['context_recall']:.4f}")
5. Answer Correctness
Evaluates factual accuracy when ground truth is available.
from ragas.metrics import answer_correctness
# Evaluate answer correctness
results = evaluate(dataset, metrics=[answer_correctness])
print(f"Answer Correctness Score: {results['answer_correctness']:.4f}")
Integration with RAG Frameworks
LangChain Integration
from langchain.retrievers import VectorStoreRetriever
from langchain.llms import OpenAI
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance
# Your RAG pipeline
retriever = VectorStoreRetriever(...)
llm = OpenAI(...)
# Generate evaluation data
questions = ["What is machine learning?"]
evaluation_data = []
for question in questions:
# Retrieve context
contexts = retriever.get_relevant_documents(question)
# Generate answer
answer = llm.generate_answer(question, contexts)
evaluation_data.append({
"question": question,
"contexts": [ctx.page_content for ctx in contexts],
"answer": answer
})
# Evaluate
dataset = Dataset.from_list(evaluation_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevance])
LlamaIndex Integration
from llama_index import VectorStoreIndex, ServiceContext
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance
# Your LlamaIndex setup
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
# Generate evaluation data
questions = ["What is machine learning?"]
evaluation_data = []
for question in questions:
response = query_engine.query(question)
evaluation_data.append({
"question": question,
"contexts": [node.text for node in response.source_nodes],
"answer": str(response)
})
# Evaluate
dataset = Dataset.from_list(evaluation_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevance])
Custom Metrics
Creating Custom Evaluation Metrics
from ragas.metrics.base import Metric
from typing import Dict, Any
import numpy as np
class CustomMetric(Metric):
name: str = "custom_metric"
def score(self, data: Dict[str, Any]) -> float:
# Your custom evaluation logic here
questions = data["question"]
answers = data["answer"]
# Example: Calculate average answer length
answer_lengths = [len(answer.split()) for answer in answers]
return np.mean(answer_lengths) / 100 # Normalize to 0-1 range
# Use custom metric
custom_metric = CustomMetric()
results = evaluate(dataset, metrics=[custom_metric])
Custom Faithfulness Metric
from ragas.metrics.faithfulness import Faithfulness
from transformers import pipeline
class CustomFaithfulness(Faithfulness):
def __init__(self):
super().__init__()
self.nli_pipeline = pipeline("text-classification", model="microsoft/DialoGPT-medium")
def _compute_score(self, question: str, contexts: list, answer: str) -> float:
# Custom faithfulness computation
context_text = " ".join(contexts)
# Use NLI to check if answer is entailed by context
result = self.nli_pipeline(
hypothesis=answer,
premise=context_text
)
return result[0]["score"] if result[0]["label"] == "ENTAILMENT" else 0.0
Best Practices
1. Data Preparation
# Clean and validate your data
def prepare_ragas_data(questions, contexts, answers):
"""Prepare data for RAGAS evaluation."""
cleaned_data = []
for q, ctx, ans in zip(questions, contexts, answers):
# Validate data
if not q or not ctx or not ans:
continue
# Clean text
q = q.strip()
ans = ans.strip()
ctx = [c.strip() for c in ctx if c.strip()]
if q and ans and ctx:
cleaned_data.append({
"question": q,
"contexts": ctx,
"answer": ans
})
return cleaned_data
2. Comprehensive Evaluation
from ragas.metrics import (
faithfulness, answer_relevance, context_relevance,
context_recall, answer_correctness
)
# Evaluate all metrics
all_metrics = [
faithfulness,
answer_relevance,
context_relevance,
context_recall,
answer_correctness
]
results = evaluate(dataset, metrics=all_metrics)
# Analyze results
for metric_name, score in results.items():
print(f"{metric_name}: {score:.4f}")
# Provide interpretation
if score >= 0.8:
print(f" ✅ Excellent performance")
elif score >= 0.6:
print(f" ⚠️ Good performance, room for improvement")
else:
print(f" ❌ Needs significant improvement")
3. Batch Processing for Large Datasets
# Process large datasets in batches
def evaluate_large_dataset(dataset, batch_size=32):
"""Evaluate large datasets efficiently."""
results_list = []
for i in range(0, len(dataset), batch_size):
batch = dataset.select(range(i, min(i + batch_size, len(dataset))))
batch_results = evaluate(batch, metrics=all_metrics)
results_list.append(batch_results)
# Aggregate results
aggregated_results = {}
for metric_name in results_list[0].keys():
scores = [r[metric_name] for r in results_list]
aggregated_results[metric_name] = np.mean(scores)
return aggregated_results
4. Error Handling and Logging
import logging
from ragas import evaluate
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def safe_evaluate(dataset, metrics):
"""Safely evaluate with error handling."""
try:
results = evaluate(dataset, metrics=metrics)
logger.info("Evaluation completed successfully")
return results
except Exception as e:
logger.error(f"Evaluation failed: {str(e)}")
return None
Performance Optimization
GPU Acceleration
# Use GPU for faster evaluation
import torch
if torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
# Configure metrics to use GPU
faithfulness_metric = faithfulness
faithfulness_metric.device = device
Parallel Processing
from concurrent.futures import ThreadPoolExecutor
import multiprocessing
# Use multiple CPU cores
num_workers = multiprocessing.cpu_count()
def parallel_evaluate(datasets, metrics):
"""Evaluate multiple datasets in parallel."""
with ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [
executor.submit(evaluate, dataset, metrics)
for dataset in datasets
]
results = [future.result() for future in futures]
return results
Troubleshooting
Common Issues
-
Memory Issues
# Reduce batch size
results = evaluate(dataset, metrics=metrics, batch_size=4)
# Use smaller models
faithfulness_metric = faithfulness(model_name="small-model") -
CUDA Out of Memory
# Force CPU usage
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
# Or use smaller batches
results = evaluate(dataset, metrics=metrics, batch_size=1) -
Data Format Issues
# Validate data format
def validate_ragas_data(data):
required_fields = ["question", "contexts", "answer"]
for field in required_fields:
if field not in data:
raise ValueError(f"Missing required field: {field}")
Debug Mode
# Enable debug logging
import logging
logging.getLogger("ragas").setLevel(logging.DEBUG)
# Run evaluation with debug info
results = evaluate(dataset, metrics=metrics, verbose=True)
Integration Examples
With Streamlit Web App
import streamlit as st
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance
def create_evaluation_app():
st.title("RAG System Evaluation")
# File upload
uploaded_file = st.file_uploader("Upload evaluation data (CSV)")
if uploaded_file:
df = pd.read_csv(uploaded_file)
dataset = Dataset.from_pandas(df)
# Run evaluation
if st.button("Evaluate"):
with st.spinner("Evaluating..."):
results = evaluate(dataset, metrics=[faithfulness, answer_relevance])
# Display results
st.write("## Evaluation Results")
for metric, score in results.items():
st.metric(metric, f"{score:.4f}")
With MLflow Tracking
import mlflow
from ragas import evaluate
def track_rag_evaluation(dataset, metrics, experiment_name="rag_evaluation"):
"""Track RAG evaluation results with MLflow."""
mlflow.set_experiment(experiment_name)
with mlflow.start_run():
results = evaluate(dataset, metrics=metrics)
# Log metrics
for metric_name, score in results.items():
mlflow.log_metric(metric_name, score)
# Log parameters
mlflow.log_param("dataset_size", len(dataset))
mlflow.log_param("metrics_used", [m.name for m in metrics])
return results
Advanced Use Cases
A/B Testing RAG Systems
def compare_rag_systems(system_a_data, system_b_data, metrics):
"""Compare two RAG systems using RAGAS metrics."""
# Evaluate system A
results_a = evaluate(system_a_data, metrics=metrics)
# Evaluate system B
results_b = evaluate(system_b_data, metrics=metrics)
# Compare results
comparison = {}
for metric_name in results_a.keys():
score_a = results_a[metric_name]
score_b = results_b[metric_name]
improvement = ((score_b - score_a) / score_a) * 100
comparison[metric_name] = {
"system_a": score_a,
"system_b": score_b,
"improvement_percent": improvement
}
return comparison
Continuous Evaluation Pipeline
from datetime import datetime
import json
def continuous_evaluation_pipeline(evaluation_data, metrics):
"""Run continuous evaluation and store results."""
# Run evaluation
results = evaluate(evaluation_data, metrics=metrics)
# Store results with timestamp
evaluation_record = {
"timestamp": datetime.now().isoformat(),
"results": results,
"dataset_size": len(evaluation_data)
}
# Save to file
with open("evaluation_history.json", "a") as f:
f.write(json.dumps(evaluation_record) + "\n")
return results
Resources and References
Official Resources
- GitHub Repository: github.com/explodinggradients/ragas
- Documentation: docs.ragas.io
- Paper: RAGAS: Automated Evaluation of Retrieval Augmented Generation
Community
- Discord: Join the RAGAS community for discussions
- Issues: Report bugs and request features on GitHub
- Contributions: Welcome contributions and improvements
Related Tools
- LangChain: Popular RAG framework
- LlamaIndex: Data framework for LLM applications
- Weights & Biases: Experiment tracking and evaluation
- MLflow: Machine learning lifecycle management
License
RAGAS is licensed under the Apache 2.0 License, making it suitable for both commercial and academic use.