Cohere Document Pipeline for HR Policy Compliance

Extract, structure, and monitor HR policy compliance automatically from employee handbooks, sick leave rules, and state mandates.

cohere document-pipeline hr-compliance nextjs express typescript policy-extraction structured-output

The problem

Small businesses waste hours manually cross-referencing PDF and Word policy documents to stay compliant with changing regulations, risking fines and employee disputes when policies are outdated or contradictory.

Built from

Intro

This tutorial walks you through building an automated HR policy compliance pipeline. You’ll create a Next.js application that ingests PDF and DOCX policy documents, extracts structured policy clauses using Cohere’s command model, repairs malformed LLM output into guaranteed-valid JSON, identifies compliance gaps against regulations, and surfaces everything in a searchable dashboard. The pipeline uses REAA’s @reaatech/media-pipeline-mcp-core for orchestration, @reaatech/media-pipeline-mcp-doc-extraction for document extraction operations, and @reaatech/structured-repair-core to enforce Zod schemas on LLM output.

Prerequisites

Node.js >=22 and pnpm 10 installed
PostgreSQL instance running (local or remote) with pgvector extension available
Cohere API key — sign up at dashboard.cohere.com and create a trial key
Langfuse account (optional, for observability) — sign up at langfuse.com
Familiarity with TypeScript, Next.js App Router, and basic SQL

Step 1: Configure environment variables

The project reads several environment variables at runtime. Copy the example file and fill in your values.

terminal

cp .env.example .env

Open .env and replace each placeholder with your real credentials:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

186 kB·90 tests·92.1% coverage·vitest passing

SHA-256172a49989fb5a0864b8f0a87683a2c6afc67ed3b21339f6f92c01db58bd76942

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js >=22 and pnpm 10 installed
PostgreSQL instance running (local or remote) with pgvector extension available
Cohere API key — sign up at dashboard.cohere.com and create a trial key
Langfuse account (optional, for observability) — sign up at langfuse.com
Familiarity with TypeScript, Next.js App Router, and basic SQL

Step 1: Configure environment variables

The project reads several environment variables at runtime. Copy the example file and fill in your values.

terminal

cp .env.example .env

Open .env and replace each placeholder with your real credentials:

import { CohereClientV2, CohereError, CohereTimeoutError } from "cohere-ai"; export class CohereLlmError extends Error { code = "COHERE_LLM_ERROR" as const; statusCode?: number; constructor(message: string, statusCode?: number) { super(message); this.name = "CohereLlmError"; this.statusCode = statusCode; } } export const cohere = new CohereClientV2({}); function extractTextContent(response: { message: { content?: Array<{ type: string; text?: string }> } }): string { const content = response.message.content; if (!content) return ""; const textBlock = content.find((b) => b.type === "text"); return textBlock?.text ?? ""; } export async function extractPolicyClauses(text: string): Promise<string> { try { const response = await cohere.chat({ model: "command-a-03-2025", messages: [ { role: "system", content: "You are an HR compliance analyst. Extract all policy clauses as a JSON array. Each clause must have: clause_text (string), clause_type ('sick_leave'|'paternity_leave'|'overtime'|'termination'|'anti_discrimination'|'other'), compliance_status ('compliant'|'at_risk'|'non_compliant'), confidence (number 0-1).", }, { role: "user", content: text }, ], }); return extractTextContent(response); } catch (error) { if (error instanceof CohereError) { throw new CohereLlmError(error.message, error.statusCode); } if (error instanceof CohereTimeoutError) { throw new CohereLlmError(error.message); } throw error; } } export async function identifyComplianceGaps( clauses: { clause_text: string; clause_type: string }[], jurisdiction: string, ): Promise<string> { try { const response = await cohere.chat({ model: "command-a-03-2025", messages: [ { role: "system", content: "You are an HR compliance analyst. Identify compliance gaps in the provided policy clauses for the given jurisdiction. Return a JSON array of gaps. Each gap: jurisdiction (string), requirement (string), current_policy (string), gap_description (string), severity ('high'|'medium'|'low'), recommended_action (string).", }, { role: "user", content: JSON.stringify({ clauses, jurisdiction }), }, ], }); return extractTextContent(response); } catch (error) { if (error instanceof CohereError) { throw new CohereLlmError(error.message, error.statusCode); } if (error instanceof CohereTimeoutError) { throw new CohereLlmError(error.message); } throw error; } }

import { NextRequest, NextResponse } from "next/server"; import { parseDocument, UnsupportedDocumentError } from "../../lib/document-parser.js"; import { db } from "../../db/index.js"; import { documents } from "../../db/schema.js"; import { runExtractionPipeline } from "../../pipeline/policy-extractor.js"; import { updateDocumentStatus } from "../../lib/compliance-service.js"; import { desc } from "drizzle-orm"; export async function POST(req: NextRequest) { const formData = await req.formData(); const file = formData.get("file"); if (!(file instanceof File)) { return NextResponse.json({ error: "No file provided" }, { status: 400 }); } const allowedTypes = [ "application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ]; if (!allowedTypes.includes(file.type)) { return NextResponse.json({ error: "Unsupported file type" }, { status: 400 }); } if (file.size > 20 * 1024 * 1024) { return NextResponse.json({ error: "File too large" }, { status: 413 }); } const buffer = await file.arrayBuffer(); let parsed; try { parsed = await parseDocument({ buffer, mimeType: file.type, filename: file.name }); } catch (error) { if (error instanceof UnsupportedDocumentError) { return NextResponse.json({ error: error.message }, { status: 400 }); } throw error; } const sourceType = file.type === "application/pdf" ? "pdf" : "docx"; const [doc] = await db.insert(documents).values({ filename: file.name, mime_type: file.type, content_text: parsed.text, title: parsed.title, source_type: sourceType, status: "pending", }).returning({ id: documents.id }); runExtractionPipeline(doc.id, parsed.text).catch(async (error: unknown) => { console.error("Pipeline failed:", error); await updateDocumentStatus(doc.id, "failed").catch(() => {}); }); return NextResponse.json({ id: doc.id, status: "pending" }, { status: 202 }); } export async function GET() { const docs = await db.select().from(documents).orderBy(desc(documents.created_at)).limit(50); return NextResponse.json(docs); }

import { describe, it, expect } from "vitest"; import { repairClauseArray, repairGapArray, repairWithDiagnostics, PolicyClauseSchema } from "../../src/lib/repair.js"; import { UnrepairableError } from "@reaatech/structured-repair-core"; describe("repair", () => { it("repairClauseArray produces valid typed data from clean JSON input", async () => { const input = JSON.stringify([{ clause_text: "Employees are entitled to 10 sick days per year", clause_type: "sick_leave" as const, compliance_status: "compliant" as const, confidence: 0.95, }]); const result = await repairClauseArray(input); expect(result).toHaveLength(1); expect(result[0].clause_text).toBe("Employees are entitled to 10 sick days per year"); expect(result[0].clause_type).toBe("sick_leave"); expect(result[0].compliance_status).toBe("compliant"); expect(result[0].confidence).toBe(0.95); }); it("repairClauseArray fixes markdown code fences", async () => { const input = '```json\n[{"clause_text":"Test","clause_type":"other","compliance_status":"compliant","confidence":0.8}]\n```'; const result = await repairClauseArray(input); expect(result).toHaveLength(1); expect(result[0].clause_text).toBe("Test"); }); it("repairClauseArray recovers from trailing commas and single-quoted strings", async () => { const input = "[{'clause_text': 'test', 'clause_type': 'other', 'compliance_status': 'compliant', 'confidence': 0.8,}]"; const result = await repairClauseArray(input); expect(result).toHaveLength(1); expect(result[0].clause_text).toBe("test"); }); it("repairClauseArray throws UnrepairableError on completely invalid input", async () => { await expect(repairClauseArray("not json at all")).rejects.toThrow(UnrepairableError); }); it("repairClauseArray strips extra hallucinated fields not in schema", async () => { const input = JSON.stringify([{ clause_text: "Test", clause_type: "other", compliance_status: "compliant", confidence: 0.8, extra_field: "should be removed", }]); const result = await repairClauseArray(input); expect(result).toHaveLength(1); expect("extra_field" in result[0]).toBe(false); }); it("repairClauseArray handles empty array", async () => { const result = await repairClauseArray("[]"); expect(result).toEqual([]); }); });

Cohere Document Pipeline for HR Policy Compliance

The problem

Built from

Intro

Prerequisites

Step 1: Configure environment variables

Example artifact

Comments

Intro

Prerequisites

Step 1: Configure environment variables

Step 2: Define the database schema with Drizzle ORM

Step 3: Create the database migration

Step 4: Build the document parser

Step 5: Set up the Cohere LLM client

Step 6: Create the JSON repair service

Step 7: Set up the Artifact Registry and Pipeline Executor

Step 8: Build the compliance service

Step 9: Build the extraction pipeline

Step 10: Create the API routes

Step 11: Set up observability with Langfuse

Step 12: Run the tests

Next steps