Skip to content
reaatech

Files · Cohere Document Pipeline for BigCommerce SMB Order Processing

81 (1 binary, 671.2 kB total)attempt 1

README.md·5891 B·markdown
markdown
# Cohere Document Pipeline for BigCommerce SMB Order Processing
 
> Automatically scan and process emailed purchase orders and quote requests into BigCommerce, cutting order entry time from minutes to seconds.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## Problem
 
Small and medium businesses often receive purchase orders via email as PDFs, DOCX files, or images. Manually transcribing these into BigCommerce is slow, error-prone, and does not scale. This recipe automates the entire flow: document ingestion, AI-powered classification, structured data repair, and BigCommerce order creation.
 
## Architecture
 
```
Email/Upload → Fastify (port 3001) → File Detector → Document Extractor / OCR → Cohere Classification → JSON Repair → BigCommerce API → Langfuse Telemetry

                                                                            LLM Cache (OpenAI embeddings)

                                                                        Cost Telemetry (daily budgets)
```
 
### Pipeline stages
 
1. **Fastify server** — accepts multipart file uploads on `POST /api/ingest`
2. **File detection** — identifies MIME type via `file-type` (PDF, DOCX, image)
3. **Document extraction** — extracts text from PDFs (pdfjs-dist), DOCX files (mammoth), or images via OCR (tesseract.js + sharp preprocessing)
4. **Cohere classification** — sends extracted text to `command-a-03-2025` for structured order JSON output
5. **JSON repair** — repairs malformed LLM output via `@reaatech/structured-repair-core`
6. **LLM cache** — caches classification results via `@reaatech/llm-cache` (OpenAI embeddings, cosine similarity)
7. **BigCommerce API** — creates orders with retry logic via `p-retry`
8. **Telemetry** — cost tracking via `@reaatech/llm-cost-telemetry`, observability via Langfuse
 
## Prerequisites
 
Create a `.env` file from `.env.example` with these variables:
 
| Variable | Description |
|---|---|
| `CO_API_KEY` | Cohere API key for order classification |
| `OPENAI_API_KEY` | OpenAI API key for LLM cache embeddings |
| `BIGCOMMERCE_STORE_HASH` | BigCommerce store hash |
| `BIGCOMMERCE_API_USERNAME` | BigCommerce API username |
| `BIGCOMMERCE_API_TOKEN` | BigCommerce API token |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key (optional) |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key (optional) |
| `LANGFUSE_HOST` | Langfuse host URL (optional) |
| `PORT` | Fastify server port (default: 3001) |
 
## Quick Start
 
```bash
# Clone and install
pnpm install
 
# Configure environment
cp .env.example .env
# Edit .env with your API keys
 
# Start the Fastify ingest server
pnpm dev:fastify
 
# In another terminal, start Next.js dashboard
pnpm dev
```
 
## API Reference
 
### `POST /api/ingest`
 
Upload a PDF, DOCX, or image file for processing.
 
- **Method**: `POST`
- **Content-Type**: `multipart/form-data`
- **Body**: file field named `file`
- **Size limit**: 10 MB
 
**Success response (200):**
```json
{
  "success": true,
  "orderId": "123456",
  "order": { ... },
  "processedByCache": false,
  "costUsd": 0.0023,
  "ingestionId": "abc123"
}
```
 
**Error response (400/413/500):**
```json
{
  "success": false,
  "error": "Unsupported file format",
  "code": "UNSUPPORTED_FORMAT",
  "ingestionId": "abc123"
}
```
 
### `GET /api/health`
 
Returns server health status.
 
```json
{ "status": "ok", "uptime": 1234.56 }
```
 
### `GET /api/metrics/cost`
 
Returns daily LLM cost summary.
 
```json
{
  "dailyTotal": 0.045,
  "budgetStatus": { "withinBudget": true, "dailyTotal": 0.045, "dailyLimit": 1.0 }
}
```
 
### `GET /api/orders`
 
Returns recently processed orders (Next.js dashboard).
 
```json
{ "orders": [] }
```
 
### `curl` Example
 
```bash
curl -X POST http://localhost:3001/api/ingest \
  -F "file=@/path/to/purchase-order.pdf"
```
 
## Tech Stack
 
| Package | Role |
|---|---|
| `next` 16.2.9 | App Router dashboard and API routes |
| `fastify` 5.x | Ingest server (multipart file upload) |
| `@fastify/multipart` | File upload handling |
| `cohere-ai` | Order classification via `command-a-03-2025` |
| `zod` | Runtime schema validation and type inference |
| `pdfjs-dist` | PDF text extraction |
| `mammoth` | DOCX text extraction |
| `tesseract.js` | OCR for image-based documents |
| `sharp` | Image preprocessing for OCR |
| `file-type` | MIME type detection |
| `p-retry` | Retry logic for BigCommerce API calls |
| `@reaatech/llm-cache` | LLM response caching (OpenAI embeddings) |
| `@reaatech/llm-cost-telemetry` | LLM cost tracking and daily budgets |
| `@reaatech/structured-repair-core` | Malformed JSON repair |
| `@reaatech/media-pipeline-mcp-doc-extraction` | Structured field extraction |
| `langfuse` | LLM observability and tracing |
| `tsx` | TypeScript execution for Fastify server |
| `vitest` | Test runner with coverage |
 
## Running locally
 
```bash
pnpm install
pnpm test            # vitest run with coverage
pnpm dev             # next dev (dashboard)
pnpm dev:fastify     # tsx watch src/server.ts (ingest API)
```
 
## Project layout
 
```
app/                  Next.js App Router pages + API routes
src/
  schemas/            Zod schemas and inferred types
  lib/                Service wrappers (BigCommerce, cache, repair, telemetry, Langfuse)
  services/           Business logic (file detector, extractor, OCR, classifier, orchestrator)
  server.ts           Fastify server entry point
tests/                vitest suite (mirrors src/)
packages/             API references for every dependency (read these first)
DEV_PLAN.md           build plan for this recipe
```
 
## License
 
MIT — see [LICENSE](./LICENSE).