Skip to content
reaatechREAATECH

@reaatech/agent-eval-harness-observability

npm v0.1.0

Provides OpenTelemetry instrumentation, Pino-based structured logging with PII redaction, and an in-memory dashboard manager for tracking agent evaluation pipelines. It exposes a set of singleton managers for recording metrics, tracing execution spans, and aggregating performance trends.

@reaatech/agent-eval-harness-observability

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

OpenTelemetry tracing, metrics collection, structured logging, and in-memory dashboards for agent evaluation pipelines. Provides 7 pre-configured OTel instruments, Pino-based structured logging with automatic PII redaction, and a 24-hour dashboard manager with trend analysis and alerting.

Installation

terminal
npm install @reaatech/agent-eval-harness-observability
# or
pnpm add @reaatech/agent-eval-harness-observability

Feature Overview

  • OTel tracing — automatic span generation for eval.runtrajectory.loadjudge.evaluategate.check pipelines
  • 7 pre-configured metricsruns.total, trajectories.evaluated, judge.calls, judge.cost, gates.result, cost.per_task, latency.p99
  • Pino structured logging — JSON logs with automatic PII redaction (emails, phones, SSNs, API keys, tokens)
  • Tracing decoratorswithTracing() wrapper for adding custom spans with automatic context propagation
  • Dashboard manager — in-memory 24-hour data retention with quality, cost, latency, and pass-rate panels and 4 alert types
  • Multiple exporters — OTLP gRPC, Zipkin, and Console for local development

Quick Start

typescript
import {
  getLogger,
  getTracingManager,
  getMetricsManager,
  getDashboardManager,
} from "@reaatech/agent-eval-harness-observability";
 
// Structured logging with automatic PII redaction
const logger = getLogger();
logger.info({ runId: "eval-123", trajectories: 50 }, "Evaluation started");
logger.error({ err: new Error("Connection lost") }, "Judge API call failed");
 
// Metrics recording
const metrics = getMetricsManager();
metrics.recordRun("success", 1);
metrics.recordTrajectories("production", 50);
metrics.recordJudgeCall("claude-opus", "success");
metrics.recordJudgeCost("claude-opus", 0.0234);
metrics.recordGateResult("overall-quality", true);
metrics.recordLatencyP99("evaluation", 3200);
 
// Dashboard with trend analysis and alerting
const dashboard = getDashboardManager();
dashboard.recordRun({
  overallMetrics: { overallScore: 0.87, avgCostPerTask: 0.05, latencyP99: 3200 },
  summary: { totalTrajectories: 50, passRate: 92 },
  metricBreakdown: { faithfulness: { avgScore: 0.85 } },
});
 
console.log(`Quality trend: ${dashboard.getSummary().trends.score}`);
console.log(`Active alerts: ${dashboard.getAlerts().length}`);

API Reference

Logger

typescript
import { getLogger, createChildLogger, setGlobalRunId, getGlobalRunId } from "@reaatech/agent-eval-harness-observability";
ExportDescription
getLogger(config?)Returns the singleton Logger instance, configured lazily
createChildLogger(bindings)Creates a child logger with additional context fields
setGlobalRunId(runId)Sets the run ID for log correlation
getGlobalRunId()Returns the current global run ID, or null

LoggerConfig

PropertyTypeDefaultDescription
levelstringinfoMinimum log level (trace, debug, info, warn, error, fatal)
formatjson" | "prettypretty (dev), json (prod)Log output format
includeRunIdbooleantrueInclude run ID on every log line
piiPatternsRegExp[]emails, phones, SSNs, API keys, tokensPII redaction patterns
redactFieldsstring[]password, secret, token, apiKey, api_key, authorizationField-level redaction

Logger Instance Methods

MethodDescription
trace(msg, ...args)Log at trace level
debug(msg, ...args)Log at debug level
info(msg, ...args)Log at info level
warn(msg, ...args)Log at warn level
error(msg, ...args)Log at error level
fatal(msg, ...args)Log at fatal level
child(bindings)Create child logger with additional context
logEvalRunStart(runId, trajectoryCount, config)Log evaluation run start
logEvalRunEnd(runId, metrics, duration)Log evaluation run completion
logGateEvaluation(gateName, passed, reason)Log gate result
logCost(runId, cost, breakdown)Log cost tracking
logError(error, context?)Log error with optional context

Metrics

typescript
import { getMetricsManager, recordMetric, incrementCounter } from "@reaatech/agent-eval-harness-observability";

MetricsConfig

PropertyTypeDefaultDescription
serviceNamestringagent-eval-harnessService name for OTel resource
enabledbooleantrueEnable metrics collection
exporterotlp" | "prometheus" | "console" | "noneconsoleMetrics exporter type
otlpEndpointstringOTLP collector endpoint
prometheusPortnumberPrometheus scrape port
exportIntervalnumber60000Export interval in milliseconds

MetricsManager Instance Methods

MethodDescription
init()Initialize metrics and register instruments
recordRun(status, count?)Record evaluation run as counter
recordTrajectories(dataset, count?)Record trajectories evaluated
recordJudgeCall(model, status)Record judge API call
recordJudgeCost(model, cost)Record judge cost as histogram
recordCostPerTask(taskType, cost)Record cost per task
recordGateResult(gateName, passed)Record gate pass/fail (1/0)
recordLatencyP99(component, latencyMs)Record P99 latency
recordBatchMetrics(metrics)Record multiple metrics in one call
forceFlush()Force flush pending metrics
shutdown()Shutdown metrics provider

Standalone Helpers

ExportDescription
recordMetric(name, value, attributes?)Record a metric by name to the current provider
incrementCounter(name, value?, attributes?)Increment a counter by name

7 Pre-Configured OTel Instruments

NameTypeUnitDescription
agent_eval.runs.totalCounterrunsTotal evaluation runs
agent_eval.trajectories.evaluatedCountertrajectoriesTrajectories processed
agent_eval.judge.callsCountercallsLLM judge API calls
agent_eval.judge.costHistogramUSDJudge cost per run
agent_eval.gates.resultHistogrambooleanGate pass/fail (1/0)
agent_eval.cost.per_taskHistogramUSDCost per task
agent_eval.latency.p99HistogrammsP99 latency per run

Tracing

typescript
import { getTracingManager, withTracing, addSpanAttributes } from "@reaatech/agent-eval-harness-observability";

TracingConfig

PropertyTypeDefaultDescription
serviceNamestringagent-eval-harnessService name for OTel resource
versionstring1.0.0Service version
enabledbooleantrueEnable tracing
exporterotlp" | "zipkin" | "console" | "noneconsoleSpan exporter type
otlpEndpointstringOTLP collector endpoint
zipkinEndpointstringZipkin collector endpoint
sampleRatenumber1.0Sampling rate (0–1)

TracingManager Instance Methods

MethodDescription
init()Initialize tracing provider and register exporters
startEvalRunSpan(runId, config)Create span for evaluation run
startTrajectoryLoadSpan(path, format)Create span for trajectory loading
startJudgeSpan(model, metric)Create span for judge evaluation
startGateSpan(gateCount)Create span for gate checking
endSpan(span, result?, error?)End span with optional result or error
getCurrentContext()Get current OTel context
injectContext(headers)Inject context into carrier headers
extractContext(headers)Extract context from carrier headers
shutdown()Shutdown tracing provider

Standalone Helpers

ExportDescription
withTracing(spanName, fn, attributes?)Wrap async function with tracing span
addSpanAttributes(attributes)Add attributes to current active span

Dashboard

typescript
import { getDashboardManager } from "@reaatech/agent-eval-harness-observability";

DashboardConfig

PropertyTypeDefaultDescription
trendHoursnumber24Time range for trend data in hours
alertThresholds.qualityScorenumber0.8Alert when overall score drops below
alertThresholds.costPerTasknumber0.05Alert when cost exceeds this value
alertThresholds.latencyP99number5000Alert when P99 latency exceeds (ms)
alertThresholds.passRatenumber0.95Alert when pass rate drops below
trendWindownumber3Number of data points for trend calculation

DashboardManager Instance Methods

MethodDescription
recordRun(results)Record evaluation metrics from AggregatedResults
getMetrics()Get all metric series data
getAlerts()Get current alert messages
getTrendData(metric, points?)Get trend data for a specific metric
getSummary()Get dashboard summary with trends and alerts
generateDashboard()Generate full dashboard panels

4 Alert Types

Alert MetricConditionWhen
quality_dropoverallScore < alertThresholds.qualityScoreQuality falls below threshold
cost_spikeavgCostPerTask > alertThresholds.costPerTaskCost exceeds threshold
latency_spikelatencyP99 > alertThresholds.latencyP99P99 latency exceeds threshold
pass_rate_droppassRate / 100 < alertThresholds.passRatePass rate falls below threshold

4 Dashboard Panels

PanelTypeMetrics Tracked
Qualitychartoverall_score, pass_rate
Performancechartlatency_p99, cost_per_task
Key StatisticsstatCurrent score and pass rate with trend direction
AlertsalertActive alert messages with values and thresholds

DashboardSummary

typescript
interface DashboardSummary {
  totalRuns: number;
  currentScore: number | null;
  currentPassRate: number | null;
  currentCostPerTask: number | null;
  currentLatencyP99: number | null;
  activeAlerts: number;
  trends: {
    score: "up" | "down" | "stable";
    passRate: "up" | "down" | "stable";
  };
}

Usage Patterns

Custom Spans with withTracing

Wrap any async operation to automatically create, time, and finalize an OTel span:

typescript
import { withTracing, addSpanAttributes } from "@reaatech/agent-eval-harness-observability";
 
const result = await withTracing(
  "custom_validation",
  async (span) => {
    // Span is active throughout this block
    addSpanAttributes({ validation_type: "schema", schema_version: "2.1" });
 
    const isValid = await validateInput(payload);
    return { isValid, timestamp: Date.now() };
  },
  { "custom.attribute": "value" },
);
 
// Span automatically ends — success status on return, error on throw

Dashboards and Alerting

Record evaluation runs to populate the dashboard, then query trends and alerts:

typescript
import { getDashboardManager } from "@reaatech/agent-eval-harness-observability";
 
const dashboard = getDashboardManager({
  alertThresholds: { qualityScore: 0.85, costPerTask: 0.03, latencyP99: 3000, passRate: 0.90 },
});
 
// Record a run
dashboard.recordRun({
  overallMetrics: { overallScore: 0.82, avgCostPerTask: 0.04, latencyP99: 4500 },
  summary: { totalTrajectories: 100, passRate: 88 },
  metricBreakdown: {},
});
 
// Check summary
const summary = dashboard.getSummary();
console.log(`Runs: ${summary.totalRuns}`);
console.log(`Score trend: ${summary.trends.score}`);
console.log(`Active alerts: ${summary.activeAlerts}`);
 
// Inspect alerts
for (const alert of dashboard.getAlerts()) {
  console.log(`[${alert.level}] ${alert.metric}: ${alert.message}`);
}
 
// Generate full dashboard panels (chart, stat, alert)
const panels = dashboard.generateDashboard();
for (const panel of panels) {
  console.log(`${panel.title} (${panel.type}): ${panel.metrics.length} metrics`);
}

Structured Logging with Context

typescript
import { getLogger, createChildLogger, setGlobalRunId } from "@reaatech/agent-eval-harness-observability";
 
const logger = getLogger();
setGlobalRunId("eval-run-42");
 
// All subsequent log lines include run_id: "eval-run-42"
logger.info({ taskType: "password-reset" }, "Starting evaluation");
 
// Create per-component child loggers
const judgeLogger = createChildLogger({ component: "judge" });
judgeLogger.info({ model: "claude-opus", metric: "faithfulness" }, "Judge evaluating");
 
// Errors carry stack traces
try {
  await doWork();
} catch (err) {
  logger.logError(err as Error, { taskId: "task-7" });
}

Metrics Batching

typescript
import { getMetricsManager } from "@reaatech/agent-eval-harness-observability";
 
const metrics = getMetricsManager();
 
metrics.recordBatchMetrics({
  runs: { status: "success" },
  trajectories: { dataset: "production" },
  judgeCalls: { model: "claude-opus", status: "success" },
  judgeCost: { model: "claude-opus", cost: 0.0234 },
  costPerTask: { taskType: "password-reset", cost: 0.0045 },
  gateResult: { gateName: "overall-quality", passed: true },
  latencyP99: { component: "evaluation", latencyMs: 3200 },
});

License

MIT