reaatech/classifier-evals

★ 0Last commit: Jun 4, 2026GitHub →

These packages give you a complete offline evaluation harness for intent classification systems, covering dataset loading, metrics calculation, LLM-as-judge evaluation, regression quality gates, and result export. You would adopt them to run rigorous, repeatable evaluations of classifier models in CI pipelines and production workflows, catching regressions before they ship. The most distinctive thing is that every component—from Zod-validated schemas to OpenTelemetry spans to MCP server tools—shares a single set of canonical types, so you can compose dataset loaders, metric calculators, judge engines, and gate checkers into a single pipeline without adapter code.

agentic-ai arize-phoenix ci-cd classifier confusion-matrix evaluation-harness intent-classification langfuse llm-as-judge llm-eval mlops observability regression-testing testing-tools typescript

Packages

Sort

8 packages

classifier-evals

@reaatech/classifier-evals

v0.1.1

Canonical TypeScript types, Zod schemas, and shared utilities (structured logging, OpenTelemetry tracing/metrics, PII redaction, hashing) for the classifier-evals evaluation ecosystem. Exports 40+ Zod-validated types and schemas covering classification results, datasets, confusion matrices, metrics, and evaluation runs, plus a Pino-based logger and OpenTelemetry instrumentation.

View package View on npm

status: published
published: 1 month ago

classifier-evals-cli

@reaatech/classifier-evals-cli

v0.1.1

A CLI for running classifier evaluations, comparing models, checking regression gates, and exporting results, built on Commander.js and the `@reaatech/classifier-evals-*` ecosystem.

View package View on npm

status: published
published: 1 month ago

classifier-evals-dataset

@reaatech/classifier-evals-dataset

v0.1.1

A dataset loading and validation utility for classifier evaluation, supporting CSV, JSON, and JSONL formats. Provides functions (`loadDataset`, `validateDataset`, `splitDataset`) for loading, schema validation, stratified train/test splitting, K-fold cross-validation, label normalization, alias resolution, and hierarchical label handling.

View package View on npm

status: published
published: 1 month ago

classifier-evals-exporters

@reaatech/classifier-evals-exporters

v0.1.1

Export classifier evaluation results as JSON, HTML, Arize Phoenix traces, or Langfuse traces. Provides four functions (`exportToJson`, `exportToHtml`, `exportToPhoenix`, `exportToLangfuse`) that accept an `EvalRun` object and format-specific options.

View package View on npm

status: published
published: 1 month ago

classifier-evals-gates

@reaatech/classifier-evals-gates

v0.1.1

A gate evaluation engine that checks classifier metrics (accuracy, F1, precision, recall) against threshold, baseline-comparison, and distribution gates, returning pass/fail results and CI output formats (GitHub Actions annotations, JUnit XML, PR comment markdown). It provides a `createGateEngine()` function that returns an object with `evaluateGates()`, `formatForGitHubActions()`, and `formatAsJUnit()` methods, and pairs with `@reaatech/classifier-evals-metrics` for metric calculation.

View package View on npm

status: published
published: 1 month ago

classifier-evals-judge

@reaatech/classifier-evals-judge

v0.1.1

A function that creates an LLM-as-judge engine for evaluating classifier outputs, supporting Anthropic and OpenAI models with configurable consensus voting, real-time cost tracking, and built-in prompt templates for classification evaluation, ambiguity detection, and error categorization.

View package View on npm

status: published
published: 1 month ago

classifier-evals-mcp-server

@reaatech/classifier-evals-mcp-server

v0.1.1

An MCP server that exposes five tools (`run_eval`, `check_gates`, `compare_models`, `llm_judge`, `generate_report`) for running classifier evaluation pipelines, checking regression gates, comparing models, and generating reports, communicating over stdio transport with any MCP-compatible client.

View package View on npm

status: published
published: 1 month ago

classifier-evals-metrics

@reaatech/classifier-evals-metrics

v0.1.1

A function that computes confusion matrices, 14 classification metrics (accuracy, macro/micro/weighted precision/recall/F1, MCC, Cohen's Kappa), model comparison with McNemar's test and Cohen's d, and evaluation run construction from classification results.