Skip to content
reaatechREAATECH

@reaatech/llm-judge-cli

npm v0.1.0

Provides a CLI for batch-evaluating LLM responses and calibrating judgment criteria against human-labeled datasets using JSONL input. It supports multiple LLM providers and configurable concurrency, outputting scored results directly to stdout or a file.

@reaatech/llm-judge-cli

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Command-line interface for LLM Judge Toolkit with evaluate and calibrate subcommands. Reads JSONL input, runs judgments through any provider+template combination, and outputs scored results.

Installation

terminal
npm install @reaatech/llm-judge-cli
# or
pnpm add @reaatech/llm-judge-cli

Feature Overview

  • evaluate command for batch judgment evaluation
  • calibrate command for calibration against human labels
  • Multi-provider support (openai, anthropic, local) with env-var API keys
  • JSONL input/output format compatible with jq and shell pipelines
  • Configurable concurrency for batch processing
  • Cache support for deduplication

Quick Start

terminal
# Evaluate responses
llm-judge evaluate \
  --input ./input.jsonl \
  --output ./results.jsonl \
  --criteria faithfulness \
  --provider openai \
  --model gpt-4o-mini \
  --concurrency 5
 
# Calibrate against human labels
llm-judge calibrate \
  --input ./labeled.jsonl \
  --output ./report.json \
  --criteria faithfulness \
  --provider openai \
  --model gpt-4o-mini

Input JSONL Format

evaluate:

jsonl
{"id":"1","query":"What is X?","response":"X is...","context":"Source material..."}
{"id":"2","query":"What is Y?","response":"Y is...","context":"Source material..."}

calibrate:

jsonl
{"id":"1","query":"What is X?","response":"X is...","context":"Source...","humanLabel":0.95}
{"id":"2","query":"What is Y?","response":"Y is...","context":"Source...","humanLabel":0.40}

API Reference

evaluate command

ExportDescription
--input (-i)Input JSONL file path (required)
--output (-o)Output JSONL file path (default: stdout)
--criteria (-c)Evaluation criteria: faithfulness, relevance, coherence, safety, tool-use
--provider (-p)Provider: openai, anthropic, local (default: openai)
--model (-m)Model name (default: gpt-4o-mini)
--base-url (-b)Custom base URL for API endpoint
--concurrency (-n)Concurrent evaluations (default: 3)
--no-cacheDisable caching
Environment VariableDescription
OPENAI_API_KEYAPI key for OpenAI provider
ANTHROPIC_API_KEYAPI key for Anthropic provider
LLM_JUDGE_API_KEYGeneric API key (fallback)

calibrate command

ExportDescription
--input (-i)Input JSONL file with humanLabel field (required)
--output (-o)Output JSON report path (default: stdout)
--criteria (-c)Evaluation criteria
--provider (-p)Provider name (default: openai)
--model (-m)Model name (default: gpt-4o-mini)

Programmatic API

ExportDescription
parseArgs()Parse CLI arguments into key-value record
createProvider()Instantiate LLMProvider from parsed args
createTemplate()Instantiate a JudgmentTemplate by criteria name
readJsonlFile()Read and parse a JSONL file into an array

License

MIT