Skip to content
reaatechREAATECH

@reaatech/classifier-evals-cli

npm v0.1.0

Provides a CLI for executing classifier evaluations, comparing model performance, enforcing regression gates, and running LLM-as-judge workflows. It outputs results in JSON, HTML, or JUnit formats and is designed for integration into CI pipelines.

@reaatech/classifier-evals-cli

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

CLI tool for classifier evaluation. Run evaluations, compare models, check regression gates, run LLM-as-judge, and export results — all from the command line. Built on Commander.js and the full @reaatech/classifier-evals-* ecosystem.

Installation

terminal
npm install -g @reaatech/classifier-evals-cli
# or
pnpm add @reaatech/classifier-evals-cli

Quick Start

terminal
# Run a full evaluation
classifier-evals eval --dataset test-set.csv --format json --output results.json
 
# Compare two models
classifier-evals compare --baseline results/v1.json --candidate results/v2.json
 
# Check regression gates
classifier-evals gates --results results/latest.json --gates gates.yaml
 
# Run LLM-as-judge
classifier-evals judge --dataset test-set.csv --model claude-opus --budget 10.00
 
# Export a report
classifier-evals export --results results/latest.json --format html --output report.html

Commands

eval

Run a full classifier evaluation against a dataset.

terminal
classifier-evals eval --dataset test-set.csv [options]
OptionTypeDefaultDescription
--datasetstring(required)Path to dataset file (CSV, JSON, JSONL)
--formatjson" | "htmljsonOutput format
--outputstringOutput file path (writes to stdout if omitted)
--namestringDataset display name (defaults to filename)

Loads the dataset, computes the confusion matrix and all 14 classification metrics, builds an EvalRun, and exports the results.

compare

Compare two model evaluation results with statistical significance.

terminal
classifier-evals compare --baseline results/v1.json --candidate results/v2.json
OptionTypeDefaultDescription
--baselinestring(required)Path to baseline evaluation results JSON
--candidatestring(required)Path to candidate evaluation results JSON
--outputstringOutput file path (writes to stdout if omitted)

Computes accuracy difference, McNemar’s test p-value, Cohen’s d effect size, and per-class F1 comparison.

gates

Evaluate regression gates against evaluation results.

terminal
classifier-evals gates --results results/latest.json --gates gates.yaml [options]
OptionTypeDefaultDescription
--resultsstring(required)Path to evaluation results JSON
--gatesstring(required)Path to gate configuration YAML
--baselinestringPath to baseline results for comparison gates
--outputstringOutput file path
--formattext" | "junittextOutput format (text or JUnit XML)

Exits with code 1 if any gate fails, 0 if all pass. Suitable for CI pipelines.

judge

Run LLM-as-judge on samples with cost tracking.

terminal
classifier-evals judge --dataset test-set.csv --model claude-opus [options]
OptionTypeDefaultDescription
--datasetstring(required)Path to dataset file
--modelstring(required)LLM model for judging
--consensusnumber1Number of judges for consensus voting
--budgetnumber50.00Maximum budget in USD
--templatestringclassification-evalPrompt template name
--outputstringOutput file path

Evaluates each sample, applies consensus voting if --consensus > 1, tracks costs in real-time, and exports the judged results.

export

Generate a report from evaluation results.

terminal
classifier-evals export --results results/latest.json --format html --output report.html
OptionTypeDefaultDescription
--resultsstring(required)Path to evaluation results JSON
--formatjson" | "htmljsonReport format
--outputstringOutput file path
--phoenixstringPhoenix endpoint URL
--langfuseExport to Langfuse (uses env vars for auth)

Supports HTML reports with inline SVGs, JSON output, Phoenix traces, and Langfuse ingestion.

CI Integration

GitHub Actions

yaml
name: Classifier Evaluation
on:
  pull_request:
    branches: [main]
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: 'pnpm'
      - run: pnpm install --frozen-lockfile
      - run: pnpm build
 
      - name: Run evaluation
        run: |
          mkdir -p results
          classifier-evals eval \
            --dataset datasets/examples/sample.csv \
            --format json \
            --output results/latest.json
 
      - name: Check gates
        run: |
          classifier-evals gates \
            --results results/latest.json \
            --gates datasets/examples/gates.yaml
 
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/
          retention-days: 30

Exit Codes

CodeMeaning
0Success — all gates passed, or comparison completed
1Gate failure — one or more regression gates did not pass
2Error — invalid arguments, missing files, or runtime error

Usage Patterns

Full Pipeline

terminal
# 1. Evaluate
classifier-evals eval --dataset production.csv --format json --output prod.json
 
# 2. Check gates
classifier-evals gates --results prod.json --gates production-gates.yaml
 
# 3. Generate HTML report
classifier-evals export --results prod.json --format html --output report.html
 
# 4. Compare against previous release
classifier-evals compare --baseline prod-v1.json --candidate prod.json

LLM-as-Judge Pipeline

terminal
# Judge misclassifications with multiple models
classifier-evals judge \
  --dataset errors.csv \
  --model claude-opus \
  --consensus 3 \
  --budget 25.00 \
  --output judged-results.json

License

MIT