Files · Cohere Document Pipeline for BigCommerce SMB Order Processing
81 (1 binary, 671.2 kB total)attempt 1
README.md·5891 B·markdown
markdown
# Cohere Document Pipeline for BigCommerce SMB Order Processing
> Automatically scan and process emailed purchase orders and quote requests into BigCommerce, cutting order entry time from minutes to seconds.
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
## Problem
Small and medium businesses often receive purchase orders via email as PDFs, DOCX files, or images. Manually transcribing these into BigCommerce is slow, error-prone, and does not scale. This recipe automates the entire flow: document ingestion, AI-powered classification, structured data repair, and BigCommerce order creation.
## Architecture
```
Email/Upload → Fastify (port 3001) → File Detector → Document Extractor / OCR → Cohere Classification → JSON Repair → BigCommerce API → Langfuse Telemetry
↓
LLM Cache (OpenAI embeddings)
↓
Cost Telemetry (daily budgets)
```
### Pipeline stages
1. **Fastify server** — accepts multipart file uploads on `POST /api/ingest`
2. **File detection** — identifies MIME type via `file-type` (PDF, DOCX, image)
3. **Document extraction** — extracts text from PDFs (pdfjs-dist), DOCX files (mammoth), or images via OCR (tesseract.js + sharp preprocessing)
4. **Cohere classification** — sends extracted text to `command-a-03-2025` for structured order JSON output
5. **JSON repair** — repairs malformed LLM output via `@reaatech/structured-repair-core`
6. **LLM cache** — caches classification results via `@reaatech/llm-cache` (OpenAI embeddings, cosine similarity)
7. **BigCommerce API** — creates orders with retry logic via `p-retry`
8. **Telemetry** — cost tracking via `@reaatech/llm-cost-telemetry`, observability via Langfuse
## Prerequisites
Create a `.env` file from `.env.example` with these variables:
| Variable | Description |
|---|---|
| `CO_API_KEY` | Cohere API key for order classification |
| `OPENAI_API_KEY` | OpenAI API key for LLM cache embeddings |
| `BIGCOMMERCE_STORE_HASH` | BigCommerce store hash |
| `BIGCOMMERCE_API_USERNAME` | BigCommerce API username |
| `BIGCOMMERCE_API_TOKEN` | BigCommerce API token |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key (optional) |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key (optional) |
| `LANGFUSE_HOST` | Langfuse host URL (optional) |
| `PORT` | Fastify server port (default: 3001) |
## Quick Start
```bash
# Clone and install
pnpm install
# Configure environment
cp .env.example .env
# Edit .env with your API keys
# Start the Fastify ingest server
pnpm dev:fastify
# In another terminal, start Next.js dashboard
pnpm dev
```
## API Reference
### `POST /api/ingest`
Upload a PDF, DOCX, or image file for processing.
- **Method**: `POST`
- **Content-Type**: `multipart/form-data`
- **Body**: file field named `file`
- **Size limit**: 10 MB
**Success response (200):**
```json
{
"success": true,
"orderId": "123456",
"order": { ... },
"processedByCache": false,
"costUsd": 0.0023,
"ingestionId": "abc123"
}
```
**Error response (400/413/500):**
```json
{
"success": false,
"error": "Unsupported file format",
"code": "UNSUPPORTED_FORMAT",
"ingestionId": "abc123"
}
```
### `GET /api/health`
Returns server health status.
```json
{ "status": "ok", "uptime": 1234.56 }
```
### `GET /api/metrics/cost`
Returns daily LLM cost summary.
```json
{
"dailyTotal": 0.045,
"budgetStatus": { "withinBudget": true, "dailyTotal": 0.045, "dailyLimit": 1.0 }
}
```
### `GET /api/orders`
Returns recently processed orders (Next.js dashboard).
```json
{ "orders": [] }
```
### `curl` Example
```bash
curl -X POST http://localhost:3001/api/ingest \
-F "file=@/path/to/purchase-order.pdf"
```
## Tech Stack
| Package | Role |
|---|---|
| `next` 16.2.9 | App Router dashboard and API routes |
| `fastify` 5.x | Ingest server (multipart file upload) |
| `@fastify/multipart` | File upload handling |
| `cohere-ai` | Order classification via `command-a-03-2025` |
| `zod` | Runtime schema validation and type inference |
| `pdfjs-dist` | PDF text extraction |
| `mammoth` | DOCX text extraction |
| `tesseract.js` | OCR for image-based documents |
| `sharp` | Image preprocessing for OCR |
| `file-type` | MIME type detection |
| `p-retry` | Retry logic for BigCommerce API calls |
| `@reaatech/llm-cache` | LLM response caching (OpenAI embeddings) |
| `@reaatech/llm-cost-telemetry` | LLM cost tracking and daily budgets |
| `@reaatech/structured-repair-core` | Malformed JSON repair |
| `@reaatech/media-pipeline-mcp-doc-extraction` | Structured field extraction |
| `langfuse` | LLM observability and tracing |
| `tsx` | TypeScript execution for Fastify server |
| `vitest` | Test runner with coverage |
## Running locally
```bash
pnpm install
pnpm test # vitest run with coverage
pnpm dev # next dev (dashboard)
pnpm dev:fastify # tsx watch src/server.ts (ingest API)
```
## Project layout
```
app/ Next.js App Router pages + API routes
src/
schemas/ Zod schemas and inferred types
lib/ Service wrappers (BigCommerce, cache, repair, telemetry, Langfuse)
services/ Business logic (file detector, extractor, OCR, classifier, orchestrator)
server.ts Fastify server entry point
tests/ vitest suite (mirrors src/)
packages/ API references for every dependency (read these first)
DEV_PLAN.md build plan for this recipe
```
## License
MIT — see [LICENSE](./LICENSE).