Eval-Ception: Testing an Agent That Speaks for You

Using Promptfoo to evaluate whether an AI agent can represent your organization

This is a hands-on tutorial. You’ll learn about evals by running them yourself - against this very site. Impatient? Jump to the repo.

The Problem

You want an AI agent to answer questions about your product, your docs, your company. But how do you know it’s ready? That it won’t hallucinate features, miss key details, or mislead your users?

You give it an exam.

Think of it like hiring a spokesperson. Before they represent your organization publicly, you verify they understand the domain well enough not to embarrass you. Passing doesn’t make them an expert - it makes them qualified to start.

This applies to any agent that acts on your behalf - not just chatbots. An agent that responds to production incidents at 3am, sends emails to your clients, or submits regulatory filings.

If these sound scary, I agree. Evals are a first step to understanding the risk and whether it’s manageable. Building a qualification exam is the point.

What You’ll Build

In this tutorial, you treat ai-evals.io as the organization and build an exam for agents that want to speak on its behalf. The agent crawls the site, reads the content, and answers questions. Promptfoo (one of several eval tools) checks whether the answers are correct.

Tip

This isn’t really about ai-evals.io. You could do exactly this for your own site, product, or docs. Swap the URL, write questions that matter to your domain, and you have a qualification gate for any agent that claims to represent you.

The agent is a black box - it could be a local model running on Ollama, Claude, GPT, or your own custom setup. The exam doesn’t care how the agent is built. It only cares whether the answers are right.

How It Works

Question  →  [Your Agent]  →  Answer  →  [Eval Tool (Promptfoo)]  →  Pass/Fail

Two pieces, cleanly separated:

  1. The agent receives a question and produces an answer. It could be anything - a CLI tool, a Python script, a hosted API. The eval doesn’t care.

  2. The eval receives the agent’s answer and checks it. Promptfoo supports two kinds of checks:

    • Deterministic - does the answer contain “Alex”? Does it say “yes”? Simple keyword matching, no LLM needed. When possible, prefer these - they don’t compound errors.
    • LLM-as-judge - is this answer a reasonable explanation of the site’s methodology? A second model judges the quality.

The agent and the eval never talk to each other. The agent doesn’t know it’s being tested. This is the point - you’re testing what it would actually say to a user.

Choose Your Agent

The exam is the same regardless of which agent you use. Pick the one that matches how you work.

Note

I use pip in the examples below. I prefer hatch with uv and the repo supports those too.

You already use a CLI coding agent like Claude Code, Codex, or OpenCode. You’ll point the eval at your CLI and it will ask the questions directly.

Best for: people who already have a CLI agent installed and want to test it as-is.

Prerequisites:

git clone https://github.com/Alexhans/eval-ception
cd eval-ception
pip install -e .

You want to run everything locally. The agent uses Playwright to crawl the site and a local model to reason about the content. No API keys, no cloud calls.

Best for: people who care about local models, privacy, or want to experiment without cost.

Prerequisites:

  • Ollama or llama.cpp installed and running
  • Node.js (for Promptfoo)
  • Python 3.10+
ollama pull qwen3:8b        # the agent
ollama pull deepseek-r1:14b  # the judge (for LLM-as-judge tests)

git clone https://github.com/Alexhans/eval-ception
cd eval-ception

pip install -e .

You build your own agents or already have a service running. You’ll write a small provider that Promptfoo calls with each question - it could call your local code, or hit an HTTP endpoint you already have deployed.

Best for: developers building agents, or teams who already have a chatbot/assistant endpoint they want to test.

Prerequisites:

  • Node.js (for Promptfoo)
  • Python 3.10+
  • Your agent code
git clone https://github.com/Alexhans/eval-ception
cd eval-ception

See promptfoo/ollama_provider.py for the interface your provider needs to implement - it’s a single call_api(prompt, options, context) function that returns {"output": "your answer"}.

The Exam: promptfooconfig.yaml

This is the exam. Open promptfoo/promptfooconfig.yaml and you’ll see:

providers:
  - id: "python:provider.py"
    label: "ollama_agent"

prompts:
  - "{{question}}"

tests:
  # Deterministic: does it know who made the site?
  - vars:
      question: "Who created this website?"
    assert:
      - type: icontains
        value: "Alex"
      - type: icontains
        value: "Guglielmone"

  # Deterministic: does it know what's open source?
  - vars:
      question: "What frameworks are open source?"
    assert:
      - type: icontains
        value: "promptfoo"
      - type: not-icontains
        value: "Braintrust"

  # LLM-as-judge: can it explain the methodology?
  - vars:
      question: "What methodology does this website use to evaluate frameworks?"
    assert:
      - type: llm-rubric
        value: >
          The answer should explain that the site uses evidence-based
          evaluation with distinct tags: 'proven' means tested and
          validated, 'docs' means only mentioned in documentation
          but not verified.

Each test is a question with assertions. Deterministic assertions (icontains, not-icontains) are fast, free, and unambiguous. LLM-as-judge assertions (llm-rubric) handle questions where the “right answer” can be phrased many ways.

Run It

baseline-agent --log-level DEBUG "Who created the website ai-evals.io?"

cd promptfoo
npx promptfoo eval -c promptfooconfig.yaml --filter-providers "^ollama_agent$" -n 1 --verbose
npx promptfoo view

# full exam
npx promptfoo eval -c promptfooconfig.yaml --filter-providers "^ollama_agent$" --verbose

Promptfoo runs each question through the agent, collects the answers, and checks the assertions. promptfoo view opens a local dashboard where you can inspect every question, answer, and assertion result.

What You’ll See

In our runs, the agent scored 6/7 (85.7%) on one baseline run. A failing test is still useful information - that’s the point.

You’ll probably look at some of the tests and think “that assertion is wrong” or “I’d ask the question differently” or “this should check for structured output, not just keywords”. Good. That instinct is the whole point. You’re already thinking about what “correct” means for this domain - and that’s exactly what building evals teaches you.

How would you improve the exam? Change a test case, run it again, and see what happens.

Why This Gives You Control

Quality. You just scored 6/7. Now swap qwen3:8b for llama3:8b and run it again. Did the score go up? You just compared two models in 5 minutes, no manual review.

Regressions. Your LLM provider quietly updates their model next Tuesday. You wouldn’t know - but your eval would. Run it on a schedule and the regression shows up before your users notice.

Learning from mistakes. A customer reports your agent got something wrong? Add the question and the correct answer as a new test case. That mistake becomes a regression test - it can’t silently happen again.

Cost. Each run can tell you how many tokens it consumed. Multiply by your rate and you know what running at scale will cost before you commit.

Privacy. You ran the local model path? Everything stayed on your machine. The eval, the agent, the judge - no data left your network.

This is what control looks like. Not trusting that it works, but knowing when it doesn’t.