Skip to main content

RAGAS: Retrieval-Augmented Generation Assessment

RAGAS is a comprehensive framework for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a suite of metrics to assess the quality of RAG pipelines, helping developers and researchers understand how well their systems perform across different dimensions.

Overview

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source Python library designed to evaluate RAG systems systematically. It focuses on three main aspects:

  • Faithfulness: How well the generated answer is grounded in the retrieved context
  • Answer Relevance: How relevant the generated answer is to the user's question
  • Context Relevance: How relevant the retrieved context is to the user's question

Key Features

🔍 Core Evaluation Metrics

  • Faithfulness: Measures if the generated answer is factually consistent with the retrieved context
  • Answer Relevance: Evaluates how well the answer addresses the user's question
  • Context Relevance: Assesses the relevance of retrieved documents to the query
  • Context Recall: Measures how much of the relevant information is captured in the retrieved context
  • Answer Correctness: Evaluates factual accuracy of the generated answers

🚀 Advanced Capabilities

  • Reference-Free Evaluation: No need for human-annotated ground truth
  • Custom Metrics: Extensible framework for custom evaluation metrics
  • Batch Processing: Efficient evaluation of large datasets
  • Multiple Formats: Support for various data formats and RAG frameworks
  • Reproducible Results: Consistent evaluation across different runs

Installation

Basic Installation

# Install RAGAS
pip install ragas

# Install with additional dependencies for advanced features
pip install "ragas[all]"

Development Installation

# Clone the repository
git clone https://github.com/explodinggradients/ragas.git
cd ragas

# Install in development mode
pip install -e .

Optional Dependencies

# For specific evaluation metrics
pip install ragas[faithfulness]
pip install ragas[relevance]
pip install ragas[context_recall]

# For all features
pip install "ragas[all]"

Quick Start

Basic Usage

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance, context_relevance
from datasets import Dataset

# Prepare your data
data = {
"question": ["What is the capital of France?"],
"contexts": [["Paris is the capital of France.", "France is a country in Europe."]],
"answer": ["Paris is the capital of France."]
}

dataset = Dataset.from_dict(data)

# Evaluate your RAG system
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevance, context_relevance]
)

print(results)

Advanced Usage with Custom Data

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance, context_relevance
from datasets import Dataset
import pandas as pd

# Load your RAG evaluation data
df = pd.read_csv("rag_evaluation_data.csv")

# Convert to RAGAS format
dataset = Dataset.from_pandas(df)

# Define evaluation metrics
metrics = [
faithfulness,
answer_relevance,
context_relevance
]

# Run evaluation
results = evaluate(
dataset,
metrics=metrics,
batch_size=8
)

# Print detailed results
for metric_name, score in results.items():
print(f"{metric_name}: {score:.4f}")

Data Format

Required Schema

RAGAS expects data in the following format:

{
"question": ["What is machine learning?"],
"contexts": [["Machine learning is a subset of AI.", "It involves training models on data."]],
"answer": ["Machine learning is a subset of artificial intelligence that involves training models on data."]
}

Extended Schema with Metadata

{
"question": ["What is machine learning?"],
"contexts": [["Machine learning is a subset of AI.", "It involves training models on data."]],
"answer": ["Machine learning is a subset of artificial intelligence that involves training models on data."],
"ground_truth": ["Machine learning is a subset of AI."], # Optional
"metadata": [{"source": "wikipedia", "confidence": 0.95}] # Optional
}

Evaluation Metrics

1. Faithfulness

Measures if the generated answer is factually consistent with the retrieved context.

from ragas.metrics import faithfulness

# Evaluate faithfulness
results = evaluate(dataset, metrics=[faithfulness])
print(f"Faithfulness Score: {results['faithfulness']:.4f}")

Interpretation:

  • High Score (0.8-1.0): Generated answer is well-grounded in the context
  • Medium Score (0.5-0.8): Some inconsistencies or hallucinations
  • Low Score (0.0-0.5): Significant factual inconsistencies

2. Answer Relevance

Evaluates how well the answer addresses the user's question.

from ragas.metrics import answer_relevance

# Evaluate answer relevance
results = evaluate(dataset, metrics=[answer_relevance])
print(f"Answer Relevance Score: {results['answer_relevance']:.4f}")

Interpretation:

  • High Score (0.8-1.0): Answer directly addresses the question
  • Medium Score (0.5-0.8): Partially relevant answer
  • Low Score (0.0-0.5): Answer doesn't address the question

3. Context Relevance

Assesses the relevance of retrieved documents to the query.

from ragas.metrics import context_relevance

# Evaluate context relevance
results = evaluate(dataset, metrics=[context_relevance])
print(f"Context Relevance Score: {results['context_relevance']:.4f}")

Interpretation:

  • High Score (0.8-1.0): Retrieved context is highly relevant
  • Medium Score (0.5-0.8): Some relevant information
  • Low Score (0.0-0.5): Retrieved context is not relevant

4. Context Recall

Measures how much of the relevant information is captured in the retrieved context.

from ragas.metrics import context_recall

# Evaluate context recall
results = evaluate(dataset, metrics=[context_recall])
print(f"Context Recall Score: {results['context_recall']:.4f}")

5. Answer Correctness

Evaluates factual accuracy when ground truth is available.

from ragas.metrics import answer_correctness

# Evaluate answer correctness
results = evaluate(dataset, metrics=[answer_correctness])
print(f"Answer Correctness Score: {results['answer_correctness']:.4f}")

Integration with RAG Frameworks

LangChain Integration

from langchain.retrievers import VectorStoreRetriever
from langchain.llms import OpenAI
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance

# Your RAG pipeline
retriever = VectorStoreRetriever(...)
llm = OpenAI(...)

# Generate evaluation data
questions = ["What is machine learning?"]
evaluation_data = []

for question in questions:
# Retrieve context
contexts = retriever.get_relevant_documents(question)

# Generate answer
answer = llm.generate_answer(question, contexts)

evaluation_data.append({
"question": question,
"contexts": [ctx.page_content for ctx in contexts],
"answer": answer
})

# Evaluate
dataset = Dataset.from_list(evaluation_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevance])

LlamaIndex Integration

from llama_index import VectorStoreIndex, ServiceContext
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance

# Your LlamaIndex setup
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

# Generate evaluation data
questions = ["What is machine learning?"]
evaluation_data = []

for question in questions:
response = query_engine.query(question)

evaluation_data.append({
"question": question,
"contexts": [node.text for node in response.source_nodes],
"answer": str(response)
})

# Evaluate
dataset = Dataset.from_list(evaluation_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevance])

Custom Metrics

Creating Custom Evaluation Metrics

from ragas.metrics.base import Metric
from typing import Dict, Any
import numpy as np

class CustomMetric(Metric):
name: str = "custom_metric"

def score(self, data: Dict[str, Any]) -> float:
# Your custom evaluation logic here
questions = data["question"]
answers = data["answer"]

# Example: Calculate average answer length
answer_lengths = [len(answer.split()) for answer in answers]
return np.mean(answer_lengths) / 100 # Normalize to 0-1 range

# Use custom metric
custom_metric = CustomMetric()
results = evaluate(dataset, metrics=[custom_metric])

Custom Faithfulness Metric

from ragas.metrics.faithfulness import Faithfulness
from transformers import pipeline

class CustomFaithfulness(Faithfulness):
def __init__(self):
super().__init__()
self.nli_pipeline = pipeline("text-classification", model="microsoft/DialoGPT-medium")

def _compute_score(self, question: str, contexts: list, answer: str) -> float:
# Custom faithfulness computation
context_text = " ".join(contexts)

# Use NLI to check if answer is entailed by context
result = self.nli_pipeline(
hypothesis=answer,
premise=context_text
)

return result[0]["score"] if result[0]["label"] == "ENTAILMENT" else 0.0

Best Practices

1. Data Preparation

# Clean and validate your data
def prepare_ragas_data(questions, contexts, answers):
"""Prepare data for RAGAS evaluation."""
cleaned_data = []

for q, ctx, ans in zip(questions, contexts, answers):
# Validate data
if not q or not ctx or not ans:
continue

# Clean text
q = q.strip()
ans = ans.strip()
ctx = [c.strip() for c in ctx if c.strip()]

if q and ans and ctx:
cleaned_data.append({
"question": q,
"contexts": ctx,
"answer": ans
})

return cleaned_data

2. Comprehensive Evaluation

from ragas.metrics import (
faithfulness, answer_relevance, context_relevance,
context_recall, answer_correctness
)

# Evaluate all metrics
all_metrics = [
faithfulness,
answer_relevance,
context_relevance,
context_recall,
answer_correctness
]

results = evaluate(dataset, metrics=all_metrics)

# Analyze results
for metric_name, score in results.items():
print(f"{metric_name}: {score:.4f}")

# Provide interpretation
if score >= 0.8:
print(f" ✅ Excellent performance")
elif score >= 0.6:
print(f" ⚠️ Good performance, room for improvement")
else:
print(f" ❌ Needs significant improvement")

3. Batch Processing for Large Datasets

# Process large datasets in batches
def evaluate_large_dataset(dataset, batch_size=32):
"""Evaluate large datasets efficiently."""
results_list = []

for i in range(0, len(dataset), batch_size):
batch = dataset.select(range(i, min(i + batch_size, len(dataset))))
batch_results = evaluate(batch, metrics=all_metrics)
results_list.append(batch_results)

# Aggregate results
aggregated_results = {}
for metric_name in results_list[0].keys():
scores = [r[metric_name] for r in results_list]
aggregated_results[metric_name] = np.mean(scores)

return aggregated_results

4. Error Handling and Logging

import logging
from ragas import evaluate

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def safe_evaluate(dataset, metrics):
"""Safely evaluate with error handling."""
try:
results = evaluate(dataset, metrics=metrics)
logger.info("Evaluation completed successfully")
return results
except Exception as e:
logger.error(f"Evaluation failed: {str(e)}")
return None

Performance Optimization

GPU Acceleration

# Use GPU for faster evaluation
import torch

if torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"

# Configure metrics to use GPU
faithfulness_metric = faithfulness
faithfulness_metric.device = device

Parallel Processing

from concurrent.futures import ThreadPoolExecutor
import multiprocessing

# Use multiple CPU cores
num_workers = multiprocessing.cpu_count()

def parallel_evaluate(datasets, metrics):
"""Evaluate multiple datasets in parallel."""
with ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [
executor.submit(evaluate, dataset, metrics)
for dataset in datasets
]
results = [future.result() for future in futures]
return results

Troubleshooting

Common Issues

  1. Memory Issues

    # Reduce batch size
    results = evaluate(dataset, metrics=metrics, batch_size=4)

    # Use smaller models
    faithfulness_metric = faithfulness(model_name="small-model")
  2. CUDA Out of Memory

    # Force CPU usage
    import os
    os.environ["CUDA_VISIBLE_DEVICES"] = ""

    # Or use smaller batches
    results = evaluate(dataset, metrics=metrics, batch_size=1)
  3. Data Format Issues

    # Validate data format
    def validate_ragas_data(data):
    required_fields = ["question", "contexts", "answer"]
    for field in required_fields:
    if field not in data:
    raise ValueError(f"Missing required field: {field}")

Debug Mode

# Enable debug logging
import logging
logging.getLogger("ragas").setLevel(logging.DEBUG)

# Run evaluation with debug info
results = evaluate(dataset, metrics=metrics, verbose=True)

Integration Examples

With Streamlit Web App

import streamlit as st
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance

def create_evaluation_app():
st.title("RAG System Evaluation")

# File upload
uploaded_file = st.file_uploader("Upload evaluation data (CSV)")

if uploaded_file:
df = pd.read_csv(uploaded_file)
dataset = Dataset.from_pandas(df)

# Run evaluation
if st.button("Evaluate"):
with st.spinner("Evaluating..."):
results = evaluate(dataset, metrics=[faithfulness, answer_relevance])

# Display results
st.write("## Evaluation Results")
for metric, score in results.items():
st.metric(metric, f"{score:.4f}")

With MLflow Tracking

import mlflow
from ragas import evaluate

def track_rag_evaluation(dataset, metrics, experiment_name="rag_evaluation"):
"""Track RAG evaluation results with MLflow."""
mlflow.set_experiment(experiment_name)

with mlflow.start_run():
results = evaluate(dataset, metrics=metrics)

# Log metrics
for metric_name, score in results.items():
mlflow.log_metric(metric_name, score)

# Log parameters
mlflow.log_param("dataset_size", len(dataset))
mlflow.log_param("metrics_used", [m.name for m in metrics])

return results

Advanced Use Cases

A/B Testing RAG Systems

def compare_rag_systems(system_a_data, system_b_data, metrics):
"""Compare two RAG systems using RAGAS metrics."""

# Evaluate system A
results_a = evaluate(system_a_data, metrics=metrics)

# Evaluate system B
results_b = evaluate(system_b_data, metrics=metrics)

# Compare results
comparison = {}
for metric_name in results_a.keys():
score_a = results_a[metric_name]
score_b = results_b[metric_name]
improvement = ((score_b - score_a) / score_a) * 100

comparison[metric_name] = {
"system_a": score_a,
"system_b": score_b,
"improvement_percent": improvement
}

return comparison

Continuous Evaluation Pipeline

from datetime import datetime
import json

def continuous_evaluation_pipeline(evaluation_data, metrics):
"""Run continuous evaluation and store results."""

# Run evaluation
results = evaluate(evaluation_data, metrics=metrics)

# Store results with timestamp
evaluation_record = {
"timestamp": datetime.now().isoformat(),
"results": results,
"dataset_size": len(evaluation_data)
}

# Save to file
with open("evaluation_history.json", "a") as f:
f.write(json.dumps(evaluation_record) + "\n")

return results

Resources and References

Official Resources

Community

  • Discord: Join the RAGAS community for discussions
  • Issues: Report bugs and request features on GitHub
  • Contributions: Welcome contributions and improvements
  • LangChain: Popular RAG framework
  • LlamaIndex: Data framework for LLM applications
  • Weights & Biases: Experiment tracking and evaluation
  • MLflow: Machine learning lifecycle management

License

RAGAS is licensed under the Apache 2.0 License, making it suitable for both commercial and academic use.