These packages give you a complete offline evaluation harness for intent classification systems, covering dataset loading, metrics calculation, LLM-as-judge evaluation, regression quality gates, and result export. You would adopt them to run rigorous, repeatable evaluations of classifier models in CI pipelines and production workflows, catching regressions before they ship. The most distinctive thing is that every component—from Zod-validated schemas to OpenTelemetry spans to MCP server tools—shares a single set of canonical types, so you can compose dataset loaders, metric calculators, judge engines, and gate checkers into a single pipeline without adapter code.
A CLI for running classifier evaluations, comparing models, checking regression gates, and exporting results, built on Commander.js and the `@reaatech/classifier-evals-*` ecosystem.
A dataset loading and validation utility for classifier evaluation, supporting CSV, JSON, and JSONL formats. Provides functions (`loadDataset`, `validateDataset`, `splitDataset`) for loading, schema validation, stratified train/test splitting, K-fold cross-validation, label normalization, alias resolution, and hierarchical label handling.
Export classifier evaluation results as JSON, HTML, Arize Phoenix traces, or Langfuse traces. Provides four functions (`exportToJson`, `exportToHtml`, `exportToPhoenix`, `exportToLangfuse`) that accept an `EvalRun` object and format-specific options.
A gate evaluation engine that checks classifier metrics (accuracy, F1, precision, recall) against threshold, baseline-comparison, and distribution gates, returning pass/fail results and CI output formats (GitHub Actions annotations, JUnit XML, PR comment markdown). It provides a `createGateEngine()` function that returns an object with `evaluateGates()`, `formatForGitHubActions()`, and `formatAsJUnit()` methods, and pairs with `@reaatech/classifier-evals-metrics` for metric calculation.
A function that creates an LLM-as-judge engine for evaluating classifier outputs, supporting Anthropic and OpenAI models with configurable consensus voting, real-time cost tracking, and built-in prompt templates for classification evaluation, ambiguity detection, and error categorization.
An MCP server that exposes five tools (`run_eval`, `check_gates`, `compare_models`, `llm_judge`, `generate_report`) for running classifier evaluation pipelines, checking regression gates, comparing models, and generating reports, communicating over stdio transport with any MCP-compatible client.
A function that computes confusion matrices, 14 classification metrics (accuracy, macro/micro/weighted precision/recall/F1, MCC, Cohen's Kappa), model comparison with McNemar's test and Cohen's d, and evaluation run construction from classification results.