Evals & Quality

Agent and RAG eval harnesses, LLM-as-judge, prompt versioning.

Sort

8 repos

reaatech/agent-eval-harness

★ 0

These packages give you a full evaluation pipeline for AI agent trajectories—loading multi-turn conversations, scoring them on quality, tool correctness, cost, and latency, then running those scores through CI/CD regression gates. You'd adopt them to catch regressions in agent behavior before deploying, replacing ad-hoc manual review or single-metric checks with structured, repeatable evaluation. The monorepo is organized as independent packages (trajectory loading, tool-use validation, cost tracking, LLM-as-judge, golden comparisons, suite orchestration, CI gates, MCP server, observability) that each export plain TypeScript functions and Zod schemas, so you compose exactly the pieces you need without a framework lock-in.

View packages Overview

packages: 13
updated: 13 days ago

agent-replay

reaatech/agent-replay

★ 0

These packages give you a deterministic recording and replay system for AI agent interactions. You'd adopt them to debug agent behavior without burning LLM tokens on every iteration — record a trace once, then replay it in stubbed, partial, or diff modes for zero-cost debugging and regression testing. The system is built around a trace-based data model with hierarchical spans and events, with interceptors that monkey-patch OpenAI and Anthropic SDKs transparently, so recording happens without modifying your agent code.

View packages Overview

packages: 7
updated: 7 days ago

agents-md-kit

reaatech/agents-md-kit

★ 1

These packages give you a linter, validator, and scaffolder for AGENTS.md and SKILL.md files — the markdown documents that define how AI agents behave and what skills they have. You would adopt them to enforce a consistent, machine-readable structure across agent definitions in a multi-agent system, catching formatting errors, missing sections, and broken skill references before they cause runtime issues. The toolkit is built as a pipeline of independent packages (parser → validator → linter → reporter) that share canonical Zod schemas, with an MCP server exposing the same tools directly to AI agents over Stdio or HTTP.

View packages Overview

packages: 9
updated: 14 days ago

classifier-evals

reaatech/classifier-evals

★ 0

These packages give you a complete offline evaluation harness for intent classification systems, covering dataset loading, metrics calculation, LLM-as-judge evaluation, regression quality gates, and result export. You would adopt them to run rigorous, repeatable evaluations of classifier models in CI pipelines and production workflows, catching regressions before they ship. The most distinctive thing is that every component—from Zod-validated schemas to OpenTelemetry spans to MCP server tools—shares a single set of canonical types, so you can compose dataset loaders, metric calculators, judge engines, and gate checkers into a single pipeline without adapter code.

View packages Overview

packages: 8
updated: 18 days ago

context-window-planner

reaatech/context-window-planner

★ 0

These packages give you a deterministic engine for deciding what content to include, summarize, or drop when packing prompts into an LLM's context window. You'd adopt them to solve the problem of overflowing a model's token budget—replacing ad-hoc truncation with a configurable planner that enforces budgets, reserves space for generation, and emits structured warnings about every decision. The most distinctive thing is the pluggable strategy system: you can swap between priority-greedy, sliding-window, summarize-and-replace, or RAG relevance selection strategies, or compose custom ones, all while using typed context item primitives and tokenizer adapters for different model families.

View packages Overview

packages: 2
updated: 18 days ago

llm-judge-toolkit

reaatech/llm-judge-toolkit

★ 0

These packages give you a complete system for using LLMs to evaluate generated text, with built-in prompt templates for five criteria (faithfulness, relevance, coherence, safety, tool-use), multi-provider support (OpenAI, Anthropic, local endpoints), and a judgment engine that handles retries, caching, and rate limiting. You would adopt them to replace ad-hoc LLM evaluation scripts with a structured pipeline that includes statistical calibration against human labels, bias detection (position, length, style), and multi-judge consensus strategies. The packages are designed as independently installable modules that share a common type system and pluggable interfaces—the engine, providers, templates, cache backends, and bias detectors each implement a shared contract, so you can swap implementations or use only the pieces you need without pulling in the rest.

View packages Overview

packages: 10
updated: 14 days ago

prompt-version-control

reaatech/prompt-version-control

★ 0

These packages give you Git-like version control for AI prompts — an API server, TypeScript SDK, CLI, and MCP server that let you track prompt changes, gate promotions on eval results, and serve A/B deployments. You'd adopt them to solve the problem of managing prompt iterations across development, staging, and production without manual copy-pasting or ad-hoc versioning. The most distinctive thing is that the entire lifecycle — from creating a draft to promoting it to production — is gated by evaluation harness results, with AI agents able to pull managed prompts at runtime via the MCP server.

View packages Overview

packages: 5
updated: 11 days ago

rag-eval-pack

reaatech/rag-eval-pack

★ 0

These packages give you a full RAG evaluation pipeline—heuristic scorers for faithfulness, relevance, context precision, and context recall, plus an LLM-as-judge with multi-provider support, cost tracking with budget enforcement, and CI quality gates that can fail a build. You'd adopt them to catch regressions in a RAG system before deployment, whether that's a pre-commit smoke check or a nightly regression suite. The distinctive design is that every metric can run at three fidelity levels—free lexical scoring, embedding-based semantic scoring, or LLM judging—so you can trade cost for accuracy per use case without changing the evaluation interface.

View packages Overview

packages: 10
updated: 18 days ago

Back to products