Skip to main content
Building AI agents without evaluations is like shipping software without tests. You don’t know whether they work correctly until users complain. An evaluation framework provides a structured way to test your agents against datasets and measure their performance using custom evaluators. The SDK includes a Langfuse-based evaluation framework that lets you:
  • Test agent behavior: Run agents against predefined test cases
  • Measure performance: Use custom evaluators to score responses
  • Track experiments: All results are logged to Langfuse for analysis
  • Automate CI/CD: Integrate evaluations into your deployment pipeline
Built on Langfuse experiments. Your evaluation results are automatically visualized in your Langfuse dashboard for comparison and analysis.

Prerequisites

Before using the evaluation framework, ensure you have:
  1. Langfuse credentials configured:
    .env
    LANGFUSE_PUBLIC_KEY=pk-xxx
    LANGFUSE_SECRET_KEY=sk-xxx
    LANGFUSE_HOST=https://cloud.langfuse.com  # Optional, defaults to cloud
    
  2. BB AI SDK installed in your project

Quick start

Get evaluations running in 5 steps:
1

Initialize evals folder

Run the CLI command to scaffold the evals structure:
bb-ai-sdk evals init
This creates an evals/ folder with:
FilePurpose
__init__.pyInitializes observability for your framework
agents.pyRegisters your task functions
evaluators.pyDefines custom evaluators
evals_config.yamlExperiment configuration
datasets/CSV dataset files
Configure observability in evals/__init__.py based on your framework:
from bb_ai_sdk.observability import get_tracer_provider, init
from openinference.instrumentation.agno import AgnoInstrumentor

init(agent_name="my-agent")
AgnoInstrumentor().instrument(tracer_provider=get_tracer_provider())
For more information about configuring observability, see the Observability documentation.
2

Register your agent task

Edit evals/agents.py to register your agent as a task function:
agents.py
from evals import register_task
from src.agents.my_agent import create_agent

# Create your agent instance
agent = create_agent()

@register_task("my_agent")
def my_agent_task(*, item, **kwargs):
    """
    Task function for Langfuse experiments.
    
    Args:
        item: Langfuse dataset item with .input attribute
        **kwargs: Additional keyword arguments
        
    Returns:
        String result from agent execution
    """
    result = agent.run(item.input)
    return result.content
The task name in @register_task("my_agent") must match the name field in your config file.
3

Create a custom evaluator

Edit evals/evaluators.py to define how responses are scored:
evaluators.py
from langfuse.experiment import Evaluation
from evals import register_evaluator

@register_evaluator("accuracy_evaluator")
def accuracy_evaluator(
    *,
    input: str,
    output: str | None,
    expected_output: str | None = None,
    metadata: dict | None = None,
    **kwargs,
) -> Evaluation:
    """Check if output matches expected output."""
    if output is None or expected_output is None:
        return Evaluation(name="accuracy", value=0.0, comment="Missing output or expected")
    
    is_match = output.strip().lower() == expected_output.strip().lower()
    return Evaluation(
        name="accuracy",
        value=1.0 if is_match else 0.0,
        comment="Match" if is_match else "No match",
    )
The custom evaluator name in @register_evaluator("accuracy_evaluator") must match the name used in your config file.
4

Configure the experiment

Edit evals/evals_config.yaml to define your evaluation:
evals_config.yaml
agents:
  - name: "my_agent"              # Must match @register_task name
    skipEval: false               # Set to true to skip this task
    dataset:
      name: "my_dataset"          # CSV file at evals/datasets/my_dataset.csv
    evaluators:
      - "accuracy_evaluator"      # Must match @register_evaluator name
5

Create a dataset

Add a CSV file at evals/datasets/my_dataset.csv:
my_dataset.csv
input,expected_output
"What is 2+2?","4"
"Hello, how are you?","I'm doing well, thank you!"
"What is the capital of France?","Paris"
Run evaluations with bb-ai-sdk evals run and view results in your Langfuse dashboard!

Registering task functions

Task functions connect your agents to the evaluation framework. They define how to invoke your agent and return results.

Task function signature

All task functions must:
  • Accept keyword arguments including item (with .input attribute)
  • Return a string result
    @register_task("task_name")
    def task_function(*, item, **kwargs) -> str:
        # item.input contains the test case input
        result = your_agent.invoke(item.input)
        return str(result)
    

Creating custom evaluators

Evaluators score agent responses against expected outputs or custom criteria.

Evaluator function signature

Evaluators must:
  • Accept keyword arguments: input, output, expected_output, metadata
  • Return a Langfuse Evaluation object with name, value (score), and optional comment
    from langfuse.experiment import Evaluation
    from evals import register_evaluator
    
    @register_evaluator("evaluator_name")
    def my_evaluator(
        *,
        input: str,
        output: str | None,
        expected_output: str | None = None,
        metadata: dict | None = None,
        **kwargs,
    ) -> Evaluation:
        # Calculate score (0.0 to 1.0)
        score = calculate_score(output, expected_output)
        
        return Evaluation(
            name="evaluator_name",
            value=score,
            comment="Optional explanation"
        )
    

Configuration

The evals_config.yaml file defines which agents to evaluate, their datasets, and evaluators.

Configuration structure

evals_config.yaml
agents:
  - name: "agent_name"            # Required: matches @register_task name
    skipEval: false               # Optional: skip this agent (default: false)
    dataset:
      name: "dataset_name"        # Required: CSV filename (without .csv)
    evaluators:                   # Optional: list of evaluator names
      - "evaluator_1"
      - "evaluator_2"

Dataset format

Datasets are CSV files stored in evals/datasets/.

CSV structure

ColumnRequiredDescription
inputYesThe input prompt/question
expected_outputNoExpected response (for comparison evaluators)
metadataNoAdded to the metadata dictionary

Example datasets

qa_dataset.csv
input,expected_output
"What is 2+2?","4"
"What color is the sky?","blue"
"How many days in a week?","7"

Running evaluations

Using the CLI

# Run with default config (evals/evals_config.yaml)
bb-ai-sdk evals run

# Run with custom config path
bb-ai-sdk evals run --config path/to/config.yaml

Using Python

# Run with default config
python -m evals

# Run with custom config
python -m evals --config path/to/config.yaml

How it works

The following diagram shows the evaluation framework flow:
1

Auto-discovery

The framework automatically imports evals.agents and evals.evaluators modules to discover registered functions.
2

Dataset management

For each agent, the framework checks if the dataset exists in Langfuse. If not, it uploads the CSV file automatically.
3

Experiment execution

The framework calls the task function for each dataset item and captures the results as Langfuse traces.
4

Evaluation

Each evaluator runs on the task output, and the framework logs scores to Langfuse.

Troubleshooting

Error: ValueError: Task 'my_agent' is not registeredCause: Task name in config doesn’t match the @register_task decorator.Solution: Ensure names match exactly:
# evals_config.yaml
agents:
  - name: "my_agent"  # Must match decorator
# agents.py
@register_task("my_agent")  # Must match config
def my_task(*, item, **kwargs):
    ...
Error: FileNotFoundError: CSV file not found at default pathCause: The CSV file doesn’t exist at the expected location.Solution: Ensure the CSV file exists at evals/datasets/{dataset_name}.csv:
evals/
└── datasets/
    └── my_dataset.csv  # Must match config dataset.name
Error: ValueError: Evaluator 'my_evaluator' is not registeredCause: Evaluator name in config doesn’t match the @register_evaluator decorator.Solution: Ensure names match exactly in evaluators.py and the configuration file.
Error: ValueError: Langfuse credentials not configuredSolution: Set environment variables:
export LANGFUSE_PUBLIC_KEY=pk-xxx
export LANGFUSE_SECRET_KEY=sk-xxx
Or in .env file:
.env
LANGFUSE_PUBLIC_KEY=pk-xxx
LANGFUSE_SECRET_KEY=sk-xxx
Cause: Task function returning None or empty string.Solution: Ensure your task function returns a valid string:
@register_task("my_agent")
def my_task(*, item, **kwargs):
    result = agent.run(item.input)
    # Ensure we return a string
    return str(result.content) if result.content else "No response"

Next steps