Skip to content
reaatechREAATECH

reaatech/classifier-evals

0Last commit: Jun 4, 2026GitHub →

These packages give you a complete offline evaluation harness for intent classification systems, covering dataset loading, metrics calculation, LLM-as-judge evaluation, regression quality gates, and result export. You would adopt them to run rigorous, repeatable evaluations of classifier models in CI pipelines and production workflows, catching regressions before they ship. The most distinctive thing is that every component—from Zod-validated schemas to OpenTelemetry spans to MCP server tools—shares a single set of canonical types, so you can compose dataset loaders, metric calculators, judge engines, and gate checkers into a single pipeline without adapter code.

Packages

8 packages

@reaatech/classifier-evals

v0.1.1
Canonical TypeScript types, Zod schemas, and shared utilities (structured logging, OpenTelemetry tracing/metrics, PII redaction, hashing) for the classifier-evals evaluation ecosystem. Exports 40+ Zod-validated types and schemas covering classification results, datasets, confusion matrices, metrics, and evaluation runs, plus a Pino-based logger and OpenTelemetry instrumentation.
status
published
published
2 days ago

@reaatech/classifier-evals-cli

v0.1.1
A CLI for running classifier evaluations, comparing models, checking regression gates, and exporting results, built on Commander.js and the `@reaatech/classifier-evals-*` ecosystem.
status
published
published
2 days ago

@reaatech/classifier-evals-dataset

v0.1.1
A dataset loading and validation utility for classifier evaluation, supporting CSV, JSON, and JSONL formats. Provides functions (`loadDataset`, `validateDataset`, `splitDataset`) for loading, schema validation, stratified train/test splitting, K-fold cross-validation, label normalization, alias resolution, and hierarchical label handling.
status
published
published
2 days ago

@reaatech/classifier-evals-exporters

v0.1.1
Export classifier evaluation results as JSON, HTML, Arize Phoenix traces, or Langfuse traces. Provides four functions (`exportToJson`, `exportToHtml`, `exportToPhoenix`, `exportToLangfuse`) that accept an `EvalRun` object and format-specific options.
status
published
published
2 days ago

@reaatech/classifier-evals-gates

v0.1.1
A gate evaluation engine that checks classifier metrics (accuracy, F1, precision, recall) against threshold, baseline-comparison, and distribution gates, returning pass/fail results and CI output formats (GitHub Actions annotations, JUnit XML, PR comment markdown). It provides a `createGateEngine()` function that returns an object with `evaluateGates()`, `formatForGitHubActions()`, and `formatAsJUnit()` methods, and pairs with `@reaatech/classifier-evals-metrics` for metric calculation.
status
published
published
2 days ago

@reaatech/classifier-evals-judge

v0.1.1
A function that creates an LLM-as-judge engine for evaluating classifier outputs, supporting Anthropic and OpenAI models with configurable consensus voting, real-time cost tracking, and built-in prompt templates for classification evaluation, ambiguity detection, and error categorization.
status
published
published
2 days ago

@reaatech/classifier-evals-mcp-server

v0.1.1
An MCP server that exposes five tools (`run_eval`, `check_gates`, `compare_models`, `llm_judge`, `generate_report`) for running classifier evaluation pipelines, checking regression gates, comparing models, and generating reports, communicating over stdio transport with any MCP-compatible client.
status
published
published
2 days ago

@reaatech/classifier-evals-metrics

v0.1.1
A function that computes confusion matrices, 14 classification metrics (accuracy, macro/micro/weighted precision/recall/F1, MCC, Cohen's Kappa), model comparison with McNemar's test and Cohen's d, and evaluation run construction from classification results.
status
published
published
2 days ago

Comments

Sign in with GitHub to comment and vote.

Loading comments…