Skip to content
reaatechREAATECH
All services
$300 per agent · written report delivered

AI Agent Testing & Evaluation

Independent testing of an AI agent you own or operate. We run REAA's testing tools across adversarial, correctness, grounding, cost, and reliability scenarios, then deliver a written report with severity ratings and prioritized fixes.

Legal permission and ownership proof required

Before any testing begins, you must provide (a) explicit written legal permission from the agent's owner to perform adversarial testing, and (b) proof you are that owneror have been authorized by them (operator role, account ownership, written delegation). Adversarial testing of an agent you don't own or control is, depending on jurisdiction, anywhere from a contract violation to a felony. This requirement is non-negotiable; we protect both parties and the legitimate operator.

What we test

Five scenario categories, each covering multiple probes. Findings are categorized and severity-rated; you get the full transcript for anything that lands in the report.

Adversarial

  • Prompt-injection probes (direct + indirect)
  • Jailbreak attempts across known patterns
  • Role-confusion / system-prompt extraction probes
  • Tool-call hijacking via crafted inputs

Correctness

  • Tool selection on ambiguous requests
  • Tool-argument schema compliance
  • Multi-step task completion vs. early stopping
  • Refusal handling (does the agent decline what it should)

Grounding

  • Retrieval citation accuracy
  • Hallucination rate on out-of-corpus questions
  • Scope adherence — does the agent stay in role?
  • Drift across long multi-turn conversations

Cost & latency

  • Per-turn token usage distribution
  • P50 / P95 / P99 latency
  • Cost profile on typical + worst-case prompts
  • Cost-of-failure (retries, dead-end loops)

Reliability

  • Behavior under concurrent load
  • Recovery from transient upstream failures
  • Determinism across identical prompts
  • Graceful degradation when tools time out

Each engagement is scoped to your agent's actual surface — we don't run telephony probes against a chat-only agent or RAG drift tests against an agent without retrieval. The scenarios above are the menu; the report covers what's applicable.

How a testing engagement works

  1. 01

    First conversation (free)

    30–45 min. We confirm the agent is in scope, you confirm you can provide written permission + ownership proof.

  2. 02

    Permission verification

    You provide written permission from the agent owner + proof of your role/authorization. No testing starts until this is in place.

  3. 03

    Scope and quote

    Per-agent fee at the published rate. SOW lists which scenario categories apply to your agent's surface.

  4. 04

    Test execution

    Typical 1–2 weeks. We run the scenarios, capture transcripts, score findings against severity criteria.

  5. 05

    Report delivery

    Written report with findings, severity, repro transcripts, and prioritized recommendations. PDF + Markdown.

  6. 06

    Optional follow-up

    If you want help implementing the recommendations, that's a separate engagement under our Implementation or Engineering Advisory services.

Pricing

$300 per agent. Flat fee, no hourly uncertainty. Multiple agents are quoted as separate engagements with their own reports.

Price excludes LLM API spend (you pay your own provider directly, no markup from us). Heavy load-testing scenarios that would generate meaningful upstream cost are reviewed with you before the run so there's no surprise on the bill.

FAQ

What exactly do you test?

Adversarial prompts (jailbreaks, prompt injection, role-confusion probes), tool-call correctness (does the agent invoke the right tools with the right args), hallucination + retrieval grounding (does it cite or fabricate), conversational coherence over multi-turn flows, refusal handling (does it decline what it should), scope adherence (does it stay in role), cost and latency profile under typical and worst-case prompts, and reliability under concurrent load.

What do I get at the end?

A written report (PDF + Markdown). Each finding has a severity rating, a reproduction transcript where applicable, an explanation of why it matters, and a prioritized recommendation. The report is yours; you can share it with your team, your auditor, or your customers.

How long does it take?

Typical engagement: 1–2 weeks from kickoff to report delivery. We don't bill the calendar — we run the tests, write the report, deliver it.

Why $300 per agent? What if I have ten?

Each agent is its own engagement with its own report. Multiple agents are quoted as separate engagements — there's no bulk discount, and there's no shared report. The work is real for each one.

Why do you require written permission and proof of ownership?

Because adversarial testing of an agent you don't own or control is, depending on jurisdiction, anywhere from a contract violation to a felony. We protect ourselves and the legitimate operator by requiring (a) explicit written legal permission from the owner and (b) evidence the buyer is that owner or has been authorized by them. This is non-negotiable.

Can you test an agent built on a vendor platform (OpenAI Assistants, Anthropic Skills, etc.)?

Yes, as long as the platform's terms of service allow it and you have permission from the operator. We check that as part of permission verification.

Ready to evaluate an agent you operate?