Files · xAI Grok Agent Eval Harness for SMB Support QA

68 (1 binary, 637.5 kB total)attempt 1

README.md·3752 B·markdown

markdown

# xAI Grok Agent Eval Harness for SMB Support QA
 
> Continuously evaluate your xAI Grok-powered customer support agents to catch regressions before they affect customers.
 
Small businesses using xAI Grok for customer support agents have no automated way to verify response quality across prompt changes, model updates, or conversation scenarios. Manual spot-checks miss regressions, leading to incorrect answers, safety issues, and lost trust. This recipe builds a CLI-based evaluation harness that pairs the REAA agent-eval-harness suite with xAI Grok as both the system-under-test and the LLM judge.
 
## Features
 
- **YAML-driven golden test suites** — define metrics, judge model, budget limits, and parallel workers in a single YAML config
- **Grok-as-LLM-judge** — scores agent responses on faithfulness, relevance, tool correctness, and overall quality using xAI Grok via `@ai-sdk/xai`
- **CI regression gates** — quality, relevance, faithfulness, tool correctness, and pass-rate thresholds block bad deployments (supports GitHub Actions annotations + JUnit XML)
- **OTel observability with Langfuse** — OpenTelemetry tracing, 7 pre-configured metrics, structured logging, and an in-memory dashboard with trend analysis and alerting
- **Run comparison** — statistical comparison between baseline and candidate runs with regression detection
 
## Quick Start
 
```bash
pnpm install
pnpm eval --golden ./golden-tests/ --format markdown
pnpm gate --results ./results.json --preset standard
pnpm test            # vitest run with coverage
```
 
## CLI Commands
 
| Command | Description |
|---------|-------------|
| `eval` | Run an evaluation suite against golden trajectories |
| `gate` | Evaluate results against quality thresholds |
| `results` | List stored evaluation results |
| `compare` | Compare two evaluation runs |
 
### eval
 
```bash
npx xai-grok-eval eval --golden <path> [--format json|junit|csv|markdown] [--output <dir>] [--concurrency <n>]
```
 
### gate
 
```bash
npx xai-grok-eval gate --results <path> [--preset standard|strict|lenient] [--baseline <path>] [--output <dir>]
```
 
### results
 
```bash
npx xai-grok-eval results [--format json|markdown]
```
 
### compare
 
```bash
npx xai-grok-eval compare --baseline <path> --candidate <path>
```
 
## API
 
### GET /api/results
 
Returns stored evaluation runs as JSON. Supports `?runId=<id>` for single-run lookup.
 
### GET /api/results?format=markdown
 
Returns results formatted as a Markdown table.
 
## CI Integration
 
Example GitHub Actions workflow:
 
```yaml
name: Agent Evaluation Gates
on: [pull_request]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npx xai-grok-eval eval --golden ./golden-tests/ --output results/
      - run: npx xai-grok-eval gate --results results/results.json --preset standard
```
 
## Configuration
 
| Env Var | Description |
|---------|-------------|
| `XAI_API_KEY` | xAI Grok API key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key |
| `LANGFUSE_HOST` | Langfuse host URL |
| `LANGFUSE_OTLP_ENDPOINT` | Langfuse OTLP collector endpoint for traces |
 
## Packages
 
| Package | Role |
|---------|------|
| `@reaatech/agent-eval-harness-suite` | Orchestrated batch evaluation runner with results aggregation and run comparison |
| `@reaatech/agent-eval-harness-judge` | Provider-agnostic LLM-as-judge engine with calibration and multi-model consensus |
| `@reaatech/agent-eval-harness-gate` | CI/CD regression gates with JUnit XML, GitHub annotations, and pass/fail summaries |
| `@reaatech/agent-eval-harness-observability` | OTel tracing, metrics, structured logging, and in-memory dashboards |
 
## License
 
MIT — see [LICENSE](./LICENSE).