Evaluation framework

Building AI agents without evaluations is like shipping software without tests. You don’t know whether they work correctly until users complain. An evaluation framework provides a structured way to test your agents against datasets and measure their performance using custom evaluators. The SDK includes a Langfuse-based evaluation framework that lets you:

Test agent behavior: Run agents against predefined test cases
Measure performance: Use custom evaluators to score responses
Track experiments: All results are logged to Langfuse for analysis
Automate CI/CD: Integrate evaluations into your deployment pipeline

Built on Langfuse experiments. Your evaluation results are automatically visualized in your Langfuse dashboard for comparison and analysis.

Prerequisites

Before using the evaluation framework, ensure you have:

Langfuse credentials configured:

.env

LANGFUSE_PUBLIC_KEY=pk-xxx
LANGFUSE_SECRET_KEY=sk-xxx
LANGFUSE_HOST=https://cloud.langfuse.com  # Optional, defaults to cloud

BB AI SDK installed in your project

Quick start

Get evaluations running in 5 steps:

Initialize evals folder

Run the CLI command to scaffold the evals structure:

bb-ai-sdk evals init

This creates an evals/ folder with:

File	Purpose
`__init__.py`	Initializes observability for your framework
`agents.py`	Registers your task functions
`evaluators.py`	Defines custom evaluators
`evals_config.yaml`	Experiment configuration
`datasets/`	CSV dataset files

Configure observability in evals/__init__.py based on your framework:

Agno
OpenAI SDK
LangChain/LangGraph

from bb_ai_sdk.observability import get_tracer_provider, init
from openinference.instrumentation.agno import AgnoInstrumentor

init(agent_name="my-agent")
AgnoInstrumentor().instrument(tracer_provider=get_tracer_provider())

from bb_ai_sdk.observability import get_tracer_provider, init
from openinference.instrumentation.openai import OpenAIInstrumentor

init(agent_name="my-agent")
OpenAIInstrumentor().instrument(tracer_provider=get_tracer_provider())

from bb_ai_sdk.observability import init

init(agent_name="my-agent")
# Use callback handlers in your chain/graph invocations
# See Observability docs for LangChainOpenTelemetryCallbackHandler

For more information about configuring observability, see the Observability documentation.

Edit evals/agents.py to register your agent as a task function:

agents.py

from evals import register_task
from src.agents.my_agent import create_agent

# Create your agent instance
agent = create_agent()

@register_task("my_agent")
def my_agent_task(*, item, **kwargs):
    """
    Task function for Langfuse experiments.
    
    Args:
        item: Langfuse dataset item with .input attribute
        **kwargs: Additional keyword arguments
        
    Returns:
        String result from agent execution
    """
    result = agent.run(item.input)
    return result.content

The task name in @register_task("my_agent") must match the name field in your config file.

Create a custom evaluator

Edit evals/evaluators.py to define how responses are scored:

evaluators.py

from langfuse.experiment import Evaluation
from evals import register_evaluator

@register_evaluator("accuracy_evaluator")
def accuracy_evaluator(
    *,
    input: str,
    output: str | None,
    expected_output: str | None = None,
    metadata: dict | None = None,
    **kwargs,
) -> Evaluation:
    """Check if output matches expected output."""
    if output is None or expected_output is None:
        return Evaluation(name="accuracy", value=0.0, comment="Missing output or expected")
    
    is_match = output.strip().lower() == expected_output.strip().lower()
    return Evaluation(
        name="accuracy",
        value=1.0 if is_match else 0.0,
        comment="Match" if is_match else "No match",
    )

The custom evaluator name in @register_evaluator("accuracy_evaluator") must match the name used in your config file.

Configure the experiment

Edit evals/evals_config.yaml to define your evaluation:

evals_config.yaml

agents:
  - name: "my_agent"              # Must match @register_task name
    skipEval: false               # Set to true to skip this task
    dataset:
      name: "my_dataset"          # CSV file at evals/datasets/my_dataset.csv
    evaluators:
      - "accuracy_evaluator"      # Must match @register_evaluator name

Create a dataset

Add a CSV file at evals/datasets/my_dataset.csv:

my_dataset.csv

input,expected_output
"What is 2+2?","4"
"Hello, how are you?","I'm doing well, thank you!"
"What is the capital of France?","Paris"

Run evaluations with bb-ai-sdk evals run and view results in your Langfuse dashboard!

Registering task functions

Task functions connect your agents to the evaluation framework. They define how to invoke your agent and return results.

Task function signature

All task functions must:

Accept keyword arguments including item (with .input attribute)

Return a string result

@register_task("task_name")
def task_function(*, item, **kwargs) -> str:
    # item.input contains the test case input
    result = your_agent.invoke(item.input)
    return str(result)

Creating custom evaluators

Evaluators score agent responses against expected outputs or custom criteria.

Evaluator function signature

Evaluators must:

Accept keyword arguments: input, output, expected_output, metadata

Return a Langfuse Evaluation object with name, value (score), and optional comment

from langfuse.experiment import Evaluation
from evals import register_evaluator

@register_evaluator("evaluator_name")
def my_evaluator(
    *,
    input: str,
    output: str | None,
    expected_output: str | None = None,
    metadata: dict | None = None,
    **kwargs,
) -> Evaluation:
    # Calculate score (0.0 to 1.0)
    score = calculate_score(output, expected_output)
    
    return Evaluation(
        name="evaluator_name",
        value=score,
        comment="Optional explanation"
    )

Configuration

The evals_config.yaml file defines which agents to evaluate, their datasets, and evaluators.

Configuration structure

evals_config.yaml

agents:
  - name: "agent_name"            # Required: matches @register_task name
    skipEval: false               # Optional: skip this agent (default: false)
    dataset:
      name: "dataset_name"        # Required: CSV filename (without .csv)
    evaluators:                   # Optional: list of evaluator names
      - "evaluator_1"
      - "evaluator_2"

Dataset format

Datasets are CSV files stored in evals/datasets/.

CSV structure

Column	Required	Description
`input`	Yes	The input prompt/question
`expected_output`	No	Expected response (for comparison evaluators)
`metadata`	No	Added to the `metadata` dictionary

Example datasets

Basic Q&A
With Metadata
Input Only

qa_dataset.csv

input,expected_output
"What is 2+2?","4"
"What color is the sky?","blue"
"How many days in a week?","7"

support_dataset.csv

input,expected_output,metadata
"Tell me about Langfuse","Langfuse is an open source LLM ops platform.","{\"lang\":\"en\"}"
"How do I reset my password?","Navigate to Settings > Security > Reset Password","{\"category\":\"account\",\"priority\":\"high\"}"
"What are your business hours?","We're open Monday-Friday, 9am-5pm","{\"category\":\"general\",\"priority\":\"low\"}"

generation_dataset.csv

input
"Write a haiku about programming"
"Explain quantum computing in simple terms"
"Generate a product description for wireless headphones"

Running evaluations

Using the CLI

# Run with default config (evals/evals_config.yaml)
bb-ai-sdk evals run

# Run with custom config path
bb-ai-sdk evals run --config path/to/config.yaml

Using Python

# Run with default config
python -m evals

# Run with custom config
python -m evals --config path/to/config.yaml

How it works

The following diagram shows the evaluation framework flow:

Auto-discovery

The framework automatically imports evals.agents and evals.evaluators modules to discover registered functions.

Dataset management

For each agent, the framework checks if the dataset exists in Langfuse. If not, it uploads the CSV file automatically.

Experiment execution

The framework calls the task function for each dataset item and captures the results as Langfuse traces.

Evaluation

Each evaluator runs on the task output, and the framework logs scores to Langfuse.

Troubleshooting

Task not found error

Error: ValueError: Task 'my_agent' is not registeredCause: Task name in config doesn’t match the @register_task decorator.Solution: Ensure names match exactly:

# evals_config.yaml
agents:
  - name: "my_agent"  # Must match decorator

# agents.py
@register_task("my_agent")  # Must match config
def my_task(*, item, **kwargs):
    ...

Dataset CSV not found

Error: FileNotFoundError: CSV file not found at default pathCause: The CSV file doesn’t exist at the expected location.Solution: Ensure the CSV file exists at evals/datasets/{dataset_name}.csv:

evals/
└── datasets/
    └── my_dataset.csv  # Must match config dataset.name

Evaluator not found

Error: ValueError: Evaluator 'my_evaluator' is not registeredCause: Evaluator name in config doesn’t match the @register_evaluator decorator.Solution: Ensure names match exactly in evaluators.py and the configuration file.

Langfuse credentials error

Error: ValueError: Langfuse credentials not configuredSolution: Set environment variables:

export LANGFUSE_PUBLIC_KEY=pk-xxx
export LANGFUSE_SECRET_KEY=sk-xxx

Or in .env file:

.env

LANGFUSE_PUBLIC_KEY=pk-xxx
LANGFUSE_SECRET_KEY=sk-xxx

Empty results in Langfuse

Cause: Task function returning None or empty string.Solution: Ensure your task function returns a valid string:

@register_task("my_agent")
def my_task(*, item, **kwargs):
    result = agent.run(item.input)
    # Ensure we return a string
    return str(result.content) if result.content else "No response"

Next steps

Observability

Learn about tracing and monitoring your agents

AI Gateway

Connect to AI models through the gateway

CI/CD workflows

Integrate evals into your deployment pipeline

Starter kits

See evals integrated in production-ready templates

Overview

Architecture and technology

Agent Development Lifecycle

Getting started

Starter kits

BB AI SDK

MCP support

CI/CD workflows

Prerequisites

Quick start

Registering task functions

Task function signature

Creating custom evaluators

Evaluator function signature

Configuration

Configuration structure

Dataset format

CSV structure

Example datasets

Running evaluations

Using the CLI

Using Python

How it works

Troubleshooting

Next steps

Observability

AI Gateway

CI/CD workflows

Starter kits

Overview

Architecture and technology

Agent Development Lifecycle

Getting started

Starter kits

BB AI SDK

MCP support

CI/CD workflows

​Prerequisites

​Quick start

​Registering task functions

​Task function signature

​Creating custom evaluators

​Evaluator function signature

​Configuration

​Configuration structure

​Dataset format

​CSV structure

​Example datasets

​Running evaluations

​Using the CLI

​Using Python

​How it works

​Troubleshooting

​Next steps

Observability

AI Gateway

CI/CD workflows

Starter kits

Prerequisites

Quick start

Registering task functions

Task function signature

Creating custom evaluators

Evaluator function signature

Configuration

Configuration structure

Dataset format

CSV structure

Example datasets

Running evaluations

Using the CLI

Using Python

How it works

Troubleshooting

Next steps