- Test agent behavior: Run agents against predefined test cases
- Measure performance: Use custom evaluators to score responses
- Track experiments: All results are logged to Langfuse for analysis
- Automate CI/CD: Integrate evaluations into your deployment pipeline
Built on Langfuse experiments. Your evaluation results are automatically visualized in your Langfuse dashboard for comparison and analysis.
Prerequisites
Before using the evaluation framework, ensure you have:-
Langfuse credentials configured:
.env
- BB AI SDK installed in your project
Quick start
Get evaluations running in 5 steps:Initialize evals folder
Run the CLI command to scaffold the evals structure:This creates an
Configure observability in
evals/ folder with:| File | Purpose |
|---|---|
__init__.py | Initializes observability for your framework |
agents.py | Registers your task functions |
evaluators.py | Defines custom evaluators |
evals_config.yaml | Experiment configuration |
datasets/ | CSV dataset files |
evals/__init__.py based on your framework:- Agno
- OpenAI SDK
- LangChain/LangGraph
Register your agent task
Edit
evals/agents.py to register your agent as a task function:agents.py
The task name in
@register_task("my_agent") must match the name field in your config file.Create a custom evaluator
Edit
evals/evaluators.py to define how responses are scored:evaluators.py
The custom evaluator name in
@register_evaluator("accuracy_evaluator") must match the name used in your config file.Registering task functions
Task functions connect your agents to the evaluation framework. They define how to invoke your agent and return results.Task function signature
All task functions must:-
Accept keyword arguments including
item(with.inputattribute) -
Return a string result
Creating custom evaluators
Evaluators score agent responses against expected outputs or custom criteria.Evaluator function signature
Evaluators must:-
Accept keyword arguments:
input,output,expected_output,metadata -
Return a Langfuse
Evaluationobject withname,value(score), and optionalcomment
Configuration
Theevals_config.yaml file defines which agents to evaluate, their datasets, and evaluators.
Configuration structure
evals_config.yaml
Dataset format
Datasets are CSV files stored inevals/datasets/.
CSV structure
| Column | Required | Description |
|---|---|---|
input | Yes | The input prompt/question |
expected_output | No | Expected response (for comparison evaluators) |
metadata | No | Added to the metadata dictionary |
Example datasets
- Basic Q&A
- With Metadata
- Input Only
qa_dataset.csv
Running evaluations
Using the CLI
Using Python
How it works
The following diagram shows the evaluation framework flow:Auto-discovery
The framework automatically imports
evals.agents and evals.evaluators modules to discover registered functions.Dataset management
For each agent, the framework checks if the dataset exists in Langfuse. If not, it uploads the CSV file automatically.
Experiment execution
The framework calls the task function for each dataset item and captures the results as Langfuse traces.
Troubleshooting
Task not found error
Task not found error
Error:
ValueError: Task 'my_agent' is not registeredCause: Task name in config doesn’t match the @register_task decorator.Solution: Ensure names match exactly:Dataset CSV not found
Dataset CSV not found
Error:
FileNotFoundError: CSV file not found at default pathCause: The CSV file doesn’t exist at the expected location.Solution: Ensure the CSV file exists at evals/datasets/{dataset_name}.csv:Evaluator not found
Evaluator not found
Error:
ValueError: Evaluator 'my_evaluator' is not registeredCause: Evaluator name in config doesn’t match the @register_evaluator decorator.Solution: Ensure names match exactly in evaluators.py and the configuration file.Langfuse credentials error
Langfuse credentials error
Error: Or in
ValueError: Langfuse credentials not configuredSolution: Set environment variables:.env file:.env
Empty results in Langfuse
Empty results in Langfuse
Cause: Task function returning
None or empty string.Solution: Ensure your task function returns a valid string: