Files · LangChain Agent Eval Harness for Small Business Reliability
32 (0 binary, 277.6 kB total)attempt 1
README.md·2935 B·markdown
markdown
# LangChain Agent Eval Harness for Small Business Reliability
**Continuous evaluation of your AI agents using LangChain and REAA's eval harness suite to ensure reliable business outcomes.**
SMBs deploying AI agents have no way to systematically test if updates or new prompts break business-critical tasks, leading to customer-facing errors and trust erosion. This CLI integrates the REAA agent eval harness suite to orchestrate test runs, manage golden datasets, run LLM-as-judge scoring, benchmark performance latency, and send results to monitoring platforms.
## Installation
```bash
pnpm install
```
## Usage
### Initialize a new evaluation configuration
```bash
pnpm tsx src/cli/index.ts init
```
This scaffolds an `eval.config.json` file in the current directory with sensible defaults for scenarios, judge, latency, and observability sections.
### Run an evaluation suite
```bash
pnpm tsx src/cli/index.ts run --config ./eval.config.json
```
Executes the evaluation suite defined in the configuration file. The config specifies which scenarios to run, which judge model to use, latency budgets, and observability settings.
### Generate a report
```bash
pnpm tsx src/cli/index.ts report --output ./report.md
```
Generates a summary report from the most recent evaluation run. Supports JSON and Markdown output formats.
## eval.config.json Schema
The configuration file defines the evaluation run:
```json
{
"name": "my-evaluation",
"description": "Optional description",
"scenarios": [
{
"name": "example-scenario",
"description": "Description of the scenario"
}
],
"judge": {
"model": "claude-opus",
"provider": "claude",
"temperature": 0.1
},
"latency": {
"preset": "moderate"
},
"observability": {
"logLevel": "info",
"logFormat": "pretty",
"metricsEnabled": true,
"tracingEnabled": false,
"dashboardEnabled": true
}
}
```
### Top-level fields
| Field | Type | Description |
| --------------- | -------------- | ----------------------------- |
| `name` | string | Name of the evaluation run |
| `description` | string? | Optional description |
| `scenarios` | Scenario[] | Array of evaluation scenarios |
| `judge` | JudgeConfig? | LLM judge configuration |
| `latency` | LatencyConfig? | Latency SLA configuration |
| `observability` | ObsConfig? | Observability configuration |
## Environment Variables
See `.env.example` for all required environment variables. The judge engine uses these to authenticate with LLM providers:
- `OPENAI_API_KEY` - OpenAI API key
- `ANTHROPIC_API_KEY` - Anthropic API key
- `GEMINI_API_KEY` - Google Gemini API key
- `OPENROUTER_API_KEY` - OpenRouter API key
- `EVAL_OBSERVABILITY_URL` - Observability endpoint URL
- `LOG_LEVEL` - Logging level (default: info)
## License
MIT