Skip to content
reaatechREAATECH

Files · Vertex AI Document Pipeline for DocuSign SMB Contract Review

83 (1 binary, 607.9 kB total)attempt 1

README.md·8060 B·markdown
markdown
# Vertex AI Document Pipeline for DocuSign SMB Contract Review
 
> Automatically extract, summarize, and validate key clauses from incoming DocuSign contracts using Vertex AI and hybrid RAG.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## What this builds
 
This recipe builds a complete contract review pipeline that:
 
1. Listens for DocuSign webhook events or accepts envelope IDs
2. Fetches the signed PDF from DocuSign eSignature API
3. Extracts text via `pdf-parse` (with OCR preprocessing via `sharp` for scanned docs)
4. Chunks the text with `@reaatech/hybrid-rag-ingestion` (supports fixed-size, semantic, recursive, and sliding-window strategies)
5. Embeds each chunk with `@reaatech/hybrid-rag-embedding` (OpenAI `text-embedding-3-small`)
6. Stores embeddings in an in-memory vector store alongside BM25 inverted index
7. Answers natural-language queries via hybrid retrieval (vector cosine similarity + BM25, fused with reciprocal rank fusion)
8. Extracts structured clauses (termination, payment, liability, etc.) using Google Gemini (`@google/genai`)
9. Repairs malformed LLM JSON with `jsonrepair`
10. Supports regression evaluation against golden trajectory datasets
 
## Architecture
 
```
DocuSign Connect (webhook) → DocuSignService → DocumentProcessor → RagStore → ClauseExtractor

                                                              EmbeddingService + BM25
```
 
### Components
 
| Component | File | Role |
|-----------|------|------|
| `DocuSignService` | `src/services/docusign-service.ts` | Fetches envelopes and documents via DocuSign eSignature API; parses webhook events |
| `DocumentProcessor` | `src/services/document-processor.ts` | Extracts text from PDFs; normalizes scanned documents with `sharp`; validates with `DocumentValidator` |
| `RagStore` | `src/services/rag-store.ts` | In-memory hybrid store: chunks via `ChunkingEngine`, embeds via `EmbeddingService`, indexes with BM25 |
| `ClauseExtractor` | `src/services/clause-extractor.ts` | Uses Gemini to extract, summarize, and answer questions about contract clauses |
| `EvaluationService` | `src/services/evaluation-service.ts` | Loads golden trajectories, compares candidate runs, detects regressions |
| `ContractPipelineOrchestrator` | `src/services/pipeline-orchestrator.ts` | Orchestrates the full pipeline: ingest → process → store → review → evaluate |
 
## Prerequisites
 
- **Node.js** >= 22
- **pnpm** 10.x
- **GCP project** with Vertex AI API enabled (for Gemini)
- **DocuSign developer account** ([free sandbox](https://go.docusign.com/o/sandbox/))
- **OpenAI API key** (for embeddings)
 
## Environment Variables
 
| Variable | Description |
|----------|-------------|
| `GOOGLE_CLOUD_PROJECT` | GCP project ID for Vertex AI |
| `GOOGLE_CLOUD_LOCATION` | GCP region (default: `us-central1`) |
| `GOOGLE_GENAI_USE_VERTEXAI` | Set to `true` to use Vertex AI endpoint |
| `GOOGLE_APPLICATION_CREDENTIALS` | Path to service account JSON key |
| `DOCUSIGN_ACCESS_TOKEN` | DocuSign API access token |
| `DOCUSIGN_ACCOUNT_ID` | DocuSign account ID |
| `DOCUSIGN_BASE_URL` | DocuSign API base URL (e.g. `https://demo.docusign.net/restapi`) |
| `DOCUSIGN_HMAC_SECRET` | HMAC secret for webhook verification (optional) |
| `OPENAI_API_KEY` | OpenAI API key for embeddings |
| `API_KEY` | API key for the `/api/contracts/evaluate` endpoint |
 
## Getting Started
 
```bash
# Install dependencies
pnpm install
 
# Run tests with coverage
pnpm test
 
# Start development server
pnpm dev
 
# Type-check
pnpm typecheck
 
# Lint
pnpm lint
```
 
## API Reference
 
### `POST /api/contracts/ingest`
 
Ingest a contract from DocuSign.
 
```json
// Request — with envelopeId
{ "envelopeId": "env-001" }
 
// Request — from webhook
{ "webhookPayload": { "envelopeId": "env-001", "status": "completed" } }
```
 
```json
// Response 202
{
  "documentId": "doc-abc123",
  "chunkCount": 12,
  "chunkIds": ["chunk-0", "chunk-1", ...],
  "timestamp": "2024-01-15T00:00:00.000Z"
}
```
 
### `POST /api/contracts/review`
 
Query the ingested contract corpus.
 
```json
// Request
{ "query": "What are the termination terms?", "topK": 5 }
```
 
```json
// Response 200
{
  "answer": "The contract allows either party to terminate with 30 days notice.",
  "clauses": [
    { "type": "termination", "text": "...", "pageRef": 1, "confidence": 0.95 }
  ],
  "sources": ["chunk-0", "chunk-3"],
  "confidence": 0.92
}
```
 
### `POST /api/contracts/evaluate`
 
Run regression evaluation against golden trajectories. Requires `X-API-Key` header.
 
```json
// Request Headers
{ "X-API-Key": "your-api-key" }
 
// Response 200
{
  "runId": "eval-run-001",
  "timestamp": "2024-01-15T00:00:00.000Z",
  "sampleCount": 10,
  "accuracy": 87.5,
  "regressions": 1
}
```
 
### `GET /api/contracts/health`
 
Health check endpoint.
 
```json
// Response 200
{
  "status": "ok",
  "timestamp": "2024-01-15T00:00:00.000Z",
  "stats": {
    "documentCount": 5,
    "chunkCount": 47
  }
}
```
 
## Key Design Decisions
 
1. **`pdf-parse` over `@unstructured-io/sdk`**: `@unstructured-io/sdk` was dropped from the npm registry. `pdf-parse` provides reliable PDF text extraction in pure TypeScript with zero native dependencies, covering the common case for digitally-signed DocuSign PDFs.
 
2. **`jsonrepair` over `@reaatech/structured-output-repair`**: The structured output repair package is not vendored in this recipe. `jsonrepair` handles the common failure modes of LLM JSON output (trailing commas, unquoted keys, single quotes) without added complexity.
 
3. **`@google/genai` over `@google-cloud/vertexai`**: Google's `@google-cloud/vertexai` SDK is deprecated in favor of the unified `@google/genai` SDK. This recipe targets the current recommended client.
 
4. **In-memory RAG store**: This recipe uses an in-memory vector store (`Map<string, number[]>`) and BM25 index rather than an external vector database. This keeps the reference implementation self-contained and deployable without infrastructure, while demonstrating the full hybrid retrieval pattern that would scale to production with Qdrant or Pinecone.
 
## Project Structure
 
```
app/
  api/
    contracts/
      ingest/route.ts           POST — ingest contract
      review/route.ts           POST — query contracts
      evaluate/route.ts         POST — run evaluation
      health/route.ts           GET  — health check
src/
  types/
    contract.ts                 Core domain types (ClauseType, ContractClause, etc.)
    index.ts                    Re-exports from contract.ts + @reaatech/hybrid-rag
  lib/
    errors.ts                   PipelineError hierarchy
    bm25.ts                     In-memory BM25 scorer
    vector-search.ts            Cosine similarity, vector search, fusion
    contract-parser.ts          Heuristic clause type and party extraction
  services/
    docusign-service.ts         DocuSign API client wrapper
    document-processor.ts       PDF extraction with pdf-parse + sharp
    rag-store.ts                In-memory hybrid RAG store
    clause-extractor.ts         Gemini-based clause extraction
    evaluation-service.ts       Golden trajectory evaluation
    pipeline-orchestrator.ts    End-to-end pipeline coordinator
tests/
  factories.ts                  Test data factories
  setup.ts                      MSW server setup
  fixtures/
    sample-contract.txt         Mock contract fixture
    sample-golden.jsonl         Golden trajectory fixture
  lib/
    bm25.test.ts
    vector-search.test.ts
    contract-parser.test.ts
  services/
    docusign-service.test.ts
    document-processor.test.ts
    rag-store.test.ts
    clause-extractor.test.ts
    evaluation-service.test.ts
    pipeline-orchestrator.test.ts
  api/
    contracts-ingest.test.ts
    contracts-review.test.ts
    contracts-health.test.ts
    contracts-evaluate.test.ts
```
 
## License
 
MIT — see [LICENSE](./LICENSE).