@reaatech/agent-eval-harness-observability
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
OpenTelemetry tracing, metrics collection, structured logging, and in-memory dashboards for agent evaluation pipelines. Provides 7 pre-configured OTel instruments, Pino-based structured logging with automatic PII redaction, and a 24-hour dashboard manager with trend analysis and alerting.
Installation
npm install @reaatech/agent-eval-harness-observability
# or
pnpm add @reaatech/agent-eval-harness-observability
Feature Overview
OTel tracing — automatic span generation for eval.run → trajectory.load → judge.evaluate → gate.check pipelines
7 pre-configured metrics — runs.total, trajectories.evaluated, judge.calls, judge.cost, gates.result, cost.per_task, latency.p99
Pino structured logging — JSON logs with automatic PII redaction (emails, phones, SSNs, API keys, tokens)
Tracing decorators — withTracing() wrapper for adding custom spans with automatic context propagation
Dashboard manager — in-memory 24-hour data retention with quality, cost, latency, and pass-rate panels and 4 alert types
Multiple exporters — OTLP gRPC, Zipkin, and Console for local development
Quick Start
import {
getLogger,
getTracingManager,
getMetricsManager,
getDashboardManager,
} from "@reaatech/agent-eval-harness-observability" ;
// Structured logging with automatic PII redaction
const logger = getLogger ();
logger. info ({ runId: "eval-123" , trajectories: 50 }, "Evaluation started" );
logger. error ({ err: new Error ( "Connection lost" ) }, "Judge API call failed" );
// Metrics recording
const metrics = getMetricsManager ();
metrics. recordRun ( "success" , 1 );
metrics. recordTrajectories ( "production" , 50 );
metrics. recordJudgeCall ( "claude-opus" , "success" );
metrics. recordJudgeCost ( "claude-opus" , 0.0234 );
metrics. recordGateResult ( "overall-quality" , true );
metrics. recordLatencyP99 ( "evaluation" , 3200 );
// Dashboard with trend analysis and alerting
const dashboard = getDashboardManager ();
dashboard. recordRun ({
overallMetrics: { overallScore: 0.87 , avgCostPerTask: 0.05 , latencyP99: 3200 },
summary: { totalTrajectories: 50 , passRate: 92 },
metricBreakdown: { faithfulness: { avgScore: 0.85 } },
});
console. log ( `Quality trend: ${ dashboard . getSummary (). trends . score }` );
console. log ( `Active alerts: ${ dashboard . getAlerts ().length }` );
API Reference
Logger
import { getLogger, createChildLogger, setGlobalRunId, getGlobalRunId } from "@reaatech/agent-eval-harness-observability" ;
Export Description getLogger(config?)Returns the singleton Logger instance, configured lazily createChildLogger(bindings)Creates a child logger with additional context fields setGlobalRunId(runId)Sets the run ID for log correlation getGlobalRunId()Returns the current global run ID, or null
LoggerConfig
Property Type Default Description levelstringinfoMinimum log level (trace, debug, info, warn, error, fatal) formatjson" | "prettypretty (dev), json (prod)Log output format includeRunIdbooleantrueInclude run ID on every log line piiPatternsRegExp[]emails, phones, SSNs, API keys, tokens PII redaction patterns redactFieldsstring[]password, secret, token, apiKey, api_key, authorizationField-level redaction
Logger Instance Methods
Method Description trace(msg, ...args)Log at trace level debug(msg, ...args)Log at debug level info(msg, ...args)Log at info level warn(msg, ...args)Log at warn level error(msg, ...args)Log at error level fatal(msg, ...args)Log at fatal level child(bindings)Create child logger with additional context logEvalRunStart(runId, trajectoryCount, config)Log evaluation run start logEvalRunEnd(runId, metrics, duration)Log evaluation run completion logGateEvaluation(gateName, passed, reason)Log gate result logCost(runId, cost, breakdown)Log cost tracking logError(error, context?)Log error with optional context
Metrics
import { getMetricsManager, recordMetric, incrementCounter } from "@reaatech/agent-eval-harness-observability" ;
MetricsConfig
Property Type Default Description serviceNamestringagent-eval-harnessService name for OTel resource enabledbooleantrueEnable metrics collection exporterotlp" | "prometheus" | "console" | "noneconsoleMetrics exporter type otlpEndpointstring— OTLP collector endpoint prometheusPortnumber— Prometheus scrape port exportIntervalnumber60000Export interval in milliseconds
MetricsManager Instance Methods
Method Description init()Initialize metrics and register instruments recordRun(status, count?)Record evaluation run as counter recordTrajectories(dataset, count?)Record trajectories evaluated recordJudgeCall(model, status)Record judge API call recordJudgeCost(model, cost)Record judge cost as histogram recordCostPerTask(taskType, cost)Record cost per task recordGateResult(gateName, passed)Record gate pass/fail (1/0) recordLatencyP99(component, latencyMs)Record P99 latency recordBatchMetrics(metrics)Record multiple metrics in one call forceFlush()Force flush pending metrics shutdown()Shutdown metrics provider
Standalone Helpers
Export Description recordMetric(name, value, attributes?)Record a metric by name to the current provider incrementCounter(name, value?, attributes?)Increment a counter by name
7 Pre-Configured OTel Instruments
Name Type Unit Description agent_eval.runs.totalCounter runsTotal evaluation runs agent_eval.trajectories.evaluatedCounter trajectoriesTrajectories processed agent_eval.judge.callsCounter callsLLM judge API calls agent_eval.judge.costHistogram USDJudge cost per run agent_eval.gates.resultHistogram booleanGate pass/fail (1/0) agent_eval.cost.per_taskHistogram USDCost per task agent_eval.latency.p99Histogram msP99 latency per run
Tracing
import { getTracingManager, withTracing, addSpanAttributes } from "@reaatech/agent-eval-harness-observability" ;
TracingConfig
Property Type Default Description serviceNamestringagent-eval-harnessService name for OTel resource versionstring1.0.0Service version enabledbooleantrueEnable tracing exporterotlp" | "zipkin" | "console" | "noneconsoleSpan exporter type otlpEndpointstring— OTLP collector endpoint zipkinEndpointstring— Zipkin collector endpoint sampleRatenumber1.0Sampling rate (0–1)
TracingManager Instance Methods
Method Description init()Initialize tracing provider and register exporters startEvalRunSpan(runId, config)Create span for evaluation run startTrajectoryLoadSpan(path, format)Create span for trajectory loading startJudgeSpan(model, metric)Create span for judge evaluation startGateSpan(gateCount)Create span for gate checking endSpan(span, result?, error?)End span with optional result or error getCurrentContext()Get current OTel context injectContext(headers)Inject context into carrier headers extractContext(headers)Extract context from carrier headers shutdown()Shutdown tracing provider
Standalone Helpers
Export Description withTracing(spanName, fn, attributes?)Wrap async function with tracing span addSpanAttributes(attributes)Add attributes to current active span
Dashboard
import { getDashboardManager } from "@reaatech/agent-eval-harness-observability" ;
DashboardConfig
Property Type Default Description trendHoursnumber24Time range for trend data in hours alertThresholds.qualityScorenumber0.8Alert when overall score drops below alertThresholds.costPerTasknumber0.05Alert when cost exceeds this value alertThresholds.latencyP99number5000Alert when P99 latency exceeds (ms) alertThresholds.passRatenumber0.95Alert when pass rate drops below trendWindownumber3Number of data points for trend calculation
DashboardManager Instance Methods
Method Description recordRun(results)Record evaluation metrics from AggregatedResults getMetrics()Get all metric series data getAlerts()Get current alert messages getTrendData(metric, points?)Get trend data for a specific metric getSummary()Get dashboard summary with trends and alerts generateDashboard()Generate full dashboard panels
4 Alert Types
Alert Metric Condition When quality_dropoverallScore < alertThresholds.qualityScoreQuality falls below threshold cost_spikeavgCostPerTask > alertThresholds.costPerTaskCost exceeds threshold latency_spikelatencyP99 > alertThresholds.latencyP99P99 latency exceeds threshold pass_rate_droppassRate / 100 < alertThresholds.passRatePass rate falls below threshold
4 Dashboard Panels
Panel Type Metrics Tracked Quality chartoverall_score, pass_ratePerformance chartlatency_p99, cost_per_taskKey Statistics statCurrent score and pass rate with trend direction Alerts alertActive alert messages with values and thresholds
DashboardSummary
interface DashboardSummary {
totalRuns : number ;
currentScore : number | null ;
currentPassRate : number | null ;
currentCostPerTask : number | null ;
currentLatencyP99 : number | null ;
activeAlerts : number ;
trends : {
score : "up" | "down" | "stable" ;
passRate : "up" | "down" | "stable" ;
};
}
Usage Patterns
Custom Spans with withTracing
Wrap any async operation to automatically create, time, and finalize an OTel span:
import { withTracing, addSpanAttributes } from "@reaatech/agent-eval-harness-observability" ;
const result = await withTracing (
"custom_validation" ,
async (span) => {
// Span is active throughout this block
addSpanAttributes ({ validation_type: "schema" , schema_version: "2.1" });
const isValid = await validateInput (payload);
return { isValid, timestamp: Date. now () };
},
{ "custom.attribute" : "value" },
);
// Span automatically ends — success status on return, error on throw
Dashboards and Alerting
Record evaluation runs to populate the dashboard, then query trends and alerts:
import { getDashboardManager } from "@reaatech/agent-eval-harness-observability" ;
const dashboard = getDashboardManager ({
alertThresholds: { qualityScore: 0.85 , costPerTask: 0.03 , latencyP99: 3000 , passRate: 0.90 },
});
// Record a run
dashboard. recordRun ({
overallMetrics: { overallScore: 0.82 , avgCostPerTask: 0.04 , latencyP99: 4500 },
summary: { totalTrajectories: 100 , passRate: 88 },
metricBreakdown: {},
});
// Check summary
const summary = dashboard. getSummary ();
console. log ( `Runs: ${ summary . totalRuns }` );
console. log ( `Score trend: ${ summary . trends . score }` );
console. log ( `Active alerts: ${ summary . activeAlerts }` );
// Inspect alerts
for ( const alert of dashboard. getAlerts ()) {
console. log ( `[${ alert . level }] ${ alert . metric }: ${ alert . message }` );
}
// Generate full dashboard panels (chart, stat, alert)
const panels = dashboard. generateDashboard ();
for ( const panel of panels) {
console. log ( `${ panel . title } (${ panel . type }): ${ panel . metrics .length } metrics` );
}
Structured Logging with Context
import { getLogger, createChildLogger, setGlobalRunId } from "@reaatech/agent-eval-harness-observability" ;
const logger = getLogger ();
setGlobalRunId ( "eval-run-42" );
// All subsequent log lines include run_id: "eval-run-42"
logger. info ({ taskType: "password-reset" }, "Starting evaluation" );
// Create per-component child loggers
const judgeLogger = createChildLogger ({ component: "judge" });
judgeLogger. info ({ model: "claude-opus" , metric: "faithfulness" }, "Judge evaluating" );
// Errors carry stack traces
try {
await doWork ();
} catch (err) {
logger. logError (err as Error , { taskId: "task-7" });
}
Metrics Batching
import { getMetricsManager } from "@reaatech/agent-eval-harness-observability" ;
const metrics = getMetricsManager ();
metrics. recordBatchMetrics ({
runs: { status: "success" },
trajectories: { dataset: "production" },
judgeCalls: { model: "claude-opus" , status: "success" },
judgeCost: { model: "claude-opus" , cost: 0.0234 },
costPerTask: { taskType: "password-reset" , cost: 0.0045 },
gateResult: { gateName: "overall-quality" , passed: true },
latencyP99: { component: "evaluation" , latencyMs: 3200 },
});
Related Packages
License
MIT