Skip to content
reaatechREAATECH

Files · Cohere Document Pipeline for HR Policy Compliance

78 (1 binary, 605.9 kB total)attempt 1

README.md·3100 B·markdown
markdown
# Cohere Document Pipeline for HR Policy Compliance
 
Extract, structure, and monitor HR policy compliance automatically from employee handbooks, sick leave rules, and state mandates.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build a document processing pipeline with Cohere AI and the `@reaatech/*` package family.
 
## Problem
 
Small businesses waste hours manually cross-referencing PDF and Word policy documents to stay compliant with changing regulations, risking fines and employee disputes when policies are outdated or contradictory.
 
## Architecture
 
```
Upload (PDF/DOCX) → Parse (unpdf/mammoth) → LLM Extract (Cohere) → Repair (structured-repair-core) → Store (PostgreSQL via Drizzle) → Dashboard (Next.js)
```
 
- **Document Ingestion**: PDF files parsed via `unpdf`, DOCX files via `mammoth`
- **LLM Extraction**: Cohere's `command-a-03-2025` model extracts policy clauses as structured JSON
- **JSON Repair**: `@reaatech/structured-repair-core` fixes malformed LLM output using Zod schemas
- **Pipeline Orchestration**: `@reaatech/media-pipeline-mcp-core` manages the extraction pipeline
- **Document Extraction**: `@reaatech/media-pipeline-mcp-doc-extraction` provides OCR, table extraction, and field extraction
- **Storage**: PostgreSQL with pgvector for embedding storage, accessed via Drizzle ORM
- **Observability**: Langfuse tracing for LLM calls
- **Dashboard**: Next.js App Router pages showing compliance summary and policy search
 
## Setup
 
```bash
cp .env.example .env
# Fill in DATABASE_URL, COHERE_API_KEY, LANGFUSE keys
pnpm install
pnpm db:push
pnpm dev
```
 
## API
 
| Endpoint | Method | Description |
|---|---|---|
| `/api/upload` | POST | Upload a PDF or DOCX policy document (multipart/form-data) |
| `/api/upload` | GET | List uploaded documents |
| `/api/documents/{id}` | GET | Get document details and extracted clauses |
| `/api/compliance` | GET | Compliance summary (total clauses, compliant %, gaps by severity) |
| `/api/search?q=` | GET | Full-text search across policy clauses |
 
## Packages Used
 
- **@reaatech/media-pipeline-mcp-doc-extraction** — OCR, table extraction, field extraction, and summarization via registered LLM providers
- **@reaatech/media-pipeline-mcp-core** — Pipeline execution engine, artifact registry, quality gate evaluation, budget enforcement
- **@reaatech/structured-repair-core** — Zod schema-driven JSON repair for malformed LLM outputs
- **cohere-ai** — Cohere Command model for policy clause extraction and compliance gap analysis
 
## Project Layout
 
```
app/                  Next.js App Router pages + API routes
src/
  db/                 Drizzle ORM schema and database client
  lib/                Business logic (document parser, Cohere client, repair, compliance service)
  pipeline/           Extraction pipeline orchestration
  instrumentation.ts  Next.js instrumentation (Langfuse)
tests/                Vitest suite (mirrors src/)
packages/             API references for every dependency
```
 
## License
 
MIT — see [LICENSE](./LICENSE).