Files · LangChain Agent Eval Harness for Small Business Reliability

32 (0 binary, 277.6 kB total)attempt 1

README.md·2935 B·markdown

markdown

# LangChain Agent Eval Harness for Small Business Reliability
 
**Continuous evaluation of your AI agents using LangChain and REAA's eval harness suite to ensure reliable business outcomes.**
 
SMBs deploying AI agents have no way to systematically test if updates or new prompts break business-critical tasks, leading to customer-facing errors and trust erosion. This CLI integrates the REAA agent eval harness suite to orchestrate test runs, manage golden datasets, run LLM-as-judge scoring, benchmark performance latency, and send results to monitoring platforms.
 
## Installation
 
```bash
pnpm install
```
 
## Usage
 
### Initialize a new evaluation configuration
 
```bash
pnpm tsx src/cli/index.ts init
```
 
This scaffolds an `eval.config.json` file in the current directory with sensible defaults for scenarios, judge, latency, and observability sections.
 
### Run an evaluation suite
 
```bash
pnpm tsx src/cli/index.ts run --config ./eval.config.json
```
 
Executes the evaluation suite defined in the configuration file. The config specifies which scenarios to run, which judge model to use, latency budgets, and observability settings.
 
### Generate a report
 
```bash
pnpm tsx src/cli/index.ts report --output ./report.md
```
 
Generates a summary report from the most recent evaluation run. Supports JSON and Markdown output formats.
 
## eval.config.json Schema
 
The configuration file defines the evaluation run:
 
```json
{
  "name": "my-evaluation",
  "description": "Optional description",
  "scenarios": [
    {
      "name": "example-scenario",
      "description": "Description of the scenario"
    }
  ],
  "judge": {
    "model": "claude-opus",
    "provider": "claude",
    "temperature": 0.1
  },
  "latency": {
    "preset": "moderate"
  },
  "observability": {
    "logLevel": "info",
    "logFormat": "pretty",
    "metricsEnabled": true,
    "tracingEnabled": false,
    "dashboardEnabled": true
  }
}
```
 
### Top-level fields
 
| Field           | Type           | Description                   |
| --------------- | -------------- | ----------------------------- |
| `name`          | string         | Name of the evaluation run    |
| `description`   | string?        | Optional description          |
| `scenarios`     | Scenario[]     | Array of evaluation scenarios |
| `judge`         | JudgeConfig?   | LLM judge configuration       |
| `latency`       | LatencyConfig? | Latency SLA configuration     |
| `observability` | ObsConfig?     | Observability configuration   |
 
## Environment Variables
 
See `.env.example` for all required environment variables. The judge engine uses these to authenticate with LLM providers:
 
- `OPENAI_API_KEY` - OpenAI API key
- `ANTHROPIC_API_KEY` - Anthropic API key
- `GEMINI_API_KEY` - Google Gemini API key
- `OPENROUTER_API_KEY` - OpenRouter API key
- `EVAL_OBSERVABILITY_URL` - Observability endpoint URL
- `LOG_LEVEL` - Logging level (default: info)
 
## License
 
MIT