Files · Vertex AI Document Pipeline for DocuSign SMB Contract Review
83 (1 binary, 607.9 kB total)attempt 1
README.md·8060 B·markdown
markdown
# Vertex AI Document Pipeline for DocuSign SMB Contract Review
> Automatically extract, summarize, and validate key clauses from incoming DocuSign contracts using Vertex AI and hybrid RAG.
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
## What this builds
This recipe builds a complete contract review pipeline that:
1. Listens for DocuSign webhook events or accepts envelope IDs
2. Fetches the signed PDF from DocuSign eSignature API
3. Extracts text via `pdf-parse` (with OCR preprocessing via `sharp` for scanned docs)
4. Chunks the text with `@reaatech/hybrid-rag-ingestion` (supports fixed-size, semantic, recursive, and sliding-window strategies)
5. Embeds each chunk with `@reaatech/hybrid-rag-embedding` (OpenAI `text-embedding-3-small`)
6. Stores embeddings in an in-memory vector store alongside BM25 inverted index
7. Answers natural-language queries via hybrid retrieval (vector cosine similarity + BM25, fused with reciprocal rank fusion)
8. Extracts structured clauses (termination, payment, liability, etc.) using Google Gemini (`@google/genai`)
9. Repairs malformed LLM JSON with `jsonrepair`
10. Supports regression evaluation against golden trajectory datasets
## Architecture
```
DocuSign Connect (webhook) → DocuSignService → DocumentProcessor → RagStore → ClauseExtractor
↕
EmbeddingService + BM25
```
### Components
| Component | File | Role |
|-----------|------|------|
| `DocuSignService` | `src/services/docusign-service.ts` | Fetches envelopes and documents via DocuSign eSignature API; parses webhook events |
| `DocumentProcessor` | `src/services/document-processor.ts` | Extracts text from PDFs; normalizes scanned documents with `sharp`; validates with `DocumentValidator` |
| `RagStore` | `src/services/rag-store.ts` | In-memory hybrid store: chunks via `ChunkingEngine`, embeds via `EmbeddingService`, indexes with BM25 |
| `ClauseExtractor` | `src/services/clause-extractor.ts` | Uses Gemini to extract, summarize, and answer questions about contract clauses |
| `EvaluationService` | `src/services/evaluation-service.ts` | Loads golden trajectories, compares candidate runs, detects regressions |
| `ContractPipelineOrchestrator` | `src/services/pipeline-orchestrator.ts` | Orchestrates the full pipeline: ingest → process → store → review → evaluate |
## Prerequisites
- **Node.js** >= 22
- **pnpm** 10.x
- **GCP project** with Vertex AI API enabled (for Gemini)
- **DocuSign developer account** ([free sandbox](https://go.docusign.com/o/sandbox/))
- **OpenAI API key** (for embeddings)
## Environment Variables
| Variable | Description |
|----------|-------------|
| `GOOGLE_CLOUD_PROJECT` | GCP project ID for Vertex AI |
| `GOOGLE_CLOUD_LOCATION` | GCP region (default: `us-central1`) |
| `GOOGLE_GENAI_USE_VERTEXAI` | Set to `true` to use Vertex AI endpoint |
| `GOOGLE_APPLICATION_CREDENTIALS` | Path to service account JSON key |
| `DOCUSIGN_ACCESS_TOKEN` | DocuSign API access token |
| `DOCUSIGN_ACCOUNT_ID` | DocuSign account ID |
| `DOCUSIGN_BASE_URL` | DocuSign API base URL (e.g. `https://demo.docusign.net/restapi`) |
| `DOCUSIGN_HMAC_SECRET` | HMAC secret for webhook verification (optional) |
| `OPENAI_API_KEY` | OpenAI API key for embeddings |
| `API_KEY` | API key for the `/api/contracts/evaluate` endpoint |
## Getting Started
```bash
# Install dependencies
pnpm install
# Run tests with coverage
pnpm test
# Start development server
pnpm dev
# Type-check
pnpm typecheck
# Lint
pnpm lint
```
## API Reference
### `POST /api/contracts/ingest`
Ingest a contract from DocuSign.
```json
// Request — with envelopeId
{ "envelopeId": "env-001" }
// Request — from webhook
{ "webhookPayload": { "envelopeId": "env-001", "status": "completed" } }
```
```json
// Response 202
{
"documentId": "doc-abc123",
"chunkCount": 12,
"chunkIds": ["chunk-0", "chunk-1", ...],
"timestamp": "2024-01-15T00:00:00.000Z"
}
```
### `POST /api/contracts/review`
Query the ingested contract corpus.
```json
// Request
{ "query": "What are the termination terms?", "topK": 5 }
```
```json
// Response 200
{
"answer": "The contract allows either party to terminate with 30 days notice.",
"clauses": [
{ "type": "termination", "text": "...", "pageRef": 1, "confidence": 0.95 }
],
"sources": ["chunk-0", "chunk-3"],
"confidence": 0.92
}
```
### `POST /api/contracts/evaluate`
Run regression evaluation against golden trajectories. Requires `X-API-Key` header.
```json
// Request Headers
{ "X-API-Key": "your-api-key" }
// Response 200
{
"runId": "eval-run-001",
"timestamp": "2024-01-15T00:00:00.000Z",
"sampleCount": 10,
"accuracy": 87.5,
"regressions": 1
}
```
### `GET /api/contracts/health`
Health check endpoint.
```json
// Response 200
{
"status": "ok",
"timestamp": "2024-01-15T00:00:00.000Z",
"stats": {
"documentCount": 5,
"chunkCount": 47
}
}
```
## Key Design Decisions
1. **`pdf-parse` over `@unstructured-io/sdk`**: `@unstructured-io/sdk` was dropped from the npm registry. `pdf-parse` provides reliable PDF text extraction in pure TypeScript with zero native dependencies, covering the common case for digitally-signed DocuSign PDFs.
2. **`jsonrepair` over `@reaatech/structured-output-repair`**: The structured output repair package is not vendored in this recipe. `jsonrepair` handles the common failure modes of LLM JSON output (trailing commas, unquoted keys, single quotes) without added complexity.
3. **`@google/genai` over `@google-cloud/vertexai`**: Google's `@google-cloud/vertexai` SDK is deprecated in favor of the unified `@google/genai` SDK. This recipe targets the current recommended client.
4. **In-memory RAG store**: This recipe uses an in-memory vector store (`Map<string, number[]>`) and BM25 index rather than an external vector database. This keeps the reference implementation self-contained and deployable without infrastructure, while demonstrating the full hybrid retrieval pattern that would scale to production with Qdrant or Pinecone.
## Project Structure
```
app/
api/
contracts/
ingest/route.ts POST — ingest contract
review/route.ts POST — query contracts
evaluate/route.ts POST — run evaluation
health/route.ts GET — health check
src/
types/
contract.ts Core domain types (ClauseType, ContractClause, etc.)
index.ts Re-exports from contract.ts + @reaatech/hybrid-rag
lib/
errors.ts PipelineError hierarchy
bm25.ts In-memory BM25 scorer
vector-search.ts Cosine similarity, vector search, fusion
contract-parser.ts Heuristic clause type and party extraction
services/
docusign-service.ts DocuSign API client wrapper
document-processor.ts PDF extraction with pdf-parse + sharp
rag-store.ts In-memory hybrid RAG store
clause-extractor.ts Gemini-based clause extraction
evaluation-service.ts Golden trajectory evaluation
pipeline-orchestrator.ts End-to-end pipeline coordinator
tests/
factories.ts Test data factories
setup.ts MSW server setup
fixtures/
sample-contract.txt Mock contract fixture
sample-golden.jsonl Golden trajectory fixture
lib/
bm25.test.ts
vector-search.test.ts
contract-parser.test.ts
services/
docusign-service.test.ts
document-processor.test.ts
rag-store.test.ts
clause-extractor.test.ts
evaluation-service.test.ts
pipeline-orchestrator.test.ts
api/
contracts-ingest.test.ts
contracts-review.test.ts
contracts-health.test.ts
contracts-evaluate.test.ts
```
## License
MIT — see [LICENSE](./LICENSE).