Skip to content
reaatechREAATECH

@reaatech/llm-judge-templates

npm v0.1.0

Provides a set of TypeScript classes implementing a `JudgmentTemplate` interface to generate LLM evaluation prompts and parse their structured JSON responses. Each template includes built-in logic for cleaning markdown, handling malformed output, and normalizing scores for criteria like faithfulness, relevance, and safety.

@reaatech/llm-judge-templates

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Evaluation prompt templates implementing the JudgmentTemplate interface. Each template (faithfulness, relevance, coherence, safety, tool-use) builds structured prompts with JSON-parsed responses and fallback parsing.

Installation

terminal
npm install @reaatech/llm-judge-templates
# or
pnpm add @reaatech/llm-judge-templates

Feature Overview

  • Five evaluation templates — faithfulness, relevance, coherence, safety, and tool-use criteria
  • JudgmentTemplate interface — build custom templates with buildPrompt and parseResponse
  • Structured JSON output — every template returns score, reasoning, confidence, and metadata
  • Robust parsingcleanAndParse strips markdown fences; parseFallback handles malformed output
  • Input validation — 500K character length limit and required field checks enforced per template
  • Score normalizationsafeScore clamps and validates scores to 0–1, defaults to 0.5

Quick Start

typescript
import { FaithfulnessTemplate } from "@reaatech/llm-judge-templates";
 
const template = new FaithfulnessTemplate();
 
const prompt = template.buildPrompt({
  query: "What is the capital of France?",
  response: "The capital of France is Paris.",
  context: "Paris is the capital and most populous city of France.",
});
// prompt = { system: "...", user: "..." }
 
const parsed = template.parseResponse(`{
  "score": 1.0,
  "reasoning": "All claims are supported by the source material.",
  "confidence": 0.95
}`);
 
console.log(parsed.score, parsed.confidence);
// 1.0, 0.95

API Reference

FaithfulnessTemplate

PropertyValue
Criteriafaithfulness
Required contextcontext (source material), response
Metadata outputunsupportedClaims — claims not found in source

RelevanceTemplate

PropertyValue
Criteriarelevance
Required contextquery, response
Metadata outputmissingAspects — query aspects not addressed

CoherenceTemplate

PropertyValue
Criteriacoherence
Required contextresponse
Metadata outputcontradictions — logical inconsistencies detected

SafetyTemplate

PropertyValue
Criteriasafety
Required contextresponse
ChecksHarmful content, bias/discrimination, PII leaks, misinformation
Metadata outputviolations — safety violations found

ToolUseTemplate

PropertyValue
Criteriatool-use
Required contextquery, toolCalls
Metadata outputparameterErrors — incorrect tool parameters

JudgmentTemplate Interface

MemberTypeDescription
namestringTemplate identifier
versionstringSemantic version of the template
criteriaEvaluationCriteriaThe evaluation criterion this template assesses
buildPrompt(context)(context: TemplateContext) => PromptRequestBuild a { system, user } prompt from template context
parseResponse(response)(response: string) => ParsedJudgmentParse raw LLM output into { score, reasoning, confidence, metadata }

TemplateContext

PropertyTypeDescription
querystringThe original user query
responsestringThe generated response to evaluate
contextstringSource material for faithfulness checks
candidatesCandidate[]Multiple response candidates for comparison
toolCallsToolCall[]Tool calls made by the model
toolOutputsunknown[]Outputs from tool calls
conversationArray<{ role: 'user' | 'assistant'; content: string }>Multi-turn conversation history
customRecord<string, unknown>Arbitrary custom data

Helpers

FunctionSignatureDescription
safeScore(value: unknown) => numberClamp and validate a score to 0–1, default to 0.5
cleanAndParse(response: string) => Record<string, unknown>Strip markdown fences and parse JSON
parseFallback(response: string) => ParsedJudgmentRegex-based fallback when JSON parsing fails (sets confidence to 0.3)

License

MIT