Skip to content
reaatech

Files · Anthropic RAG Pipeline for Google Workspace SMB Email Knowledge Search

83 (1 binary, 645.8 kB total)attempt 1

README.md·5612 B·markdown
markdown
# Anthropic RAG Pipeline for Google Workspace SMB Email Knowledge Search
 
> Ask questions in plain English and get answers with citations from your entire Google Workspace email and documents – no more manual searching.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## Overview
 
Small and medium businesses store critical knowledge across Gmail and Google Drive — contracts, proposals, client emails, internal documentation. Finding the right information means manually searching through hundreds of threads and files.
 
This project solves that with a retrieval-augmented generation (RAG) pipeline that ingests Google Workspace data nightly, embeds it into a PostgreSQL + pgvector vector store, and lets you ask natural-language questions through a chat interface. Answers include source citations so you can verify every response.
 
## Architecture
 
The pipeline follows five stages:
 
```
Ingestion → Embedding → Storage → Retrieval → Generation
```
 
1. **Ingestion**`gmail-sync.ts` and `drive-sync.ts` fetch recent emails and documents via the Google APIs, then `content-parser.ts` normalizes HTML, PDF, and DOCX content to plain text.
2. **Embedding**`embedder.ts` uses the VoyageAI client to convert text chunks into vector embeddings with `voyage-3-lite`.
3. **Storage**`embedAndStore` writes chunk vectors to a PostgreSQL table via `pgvector`, along with source metadata.
4. **Retrieval**`searchSimilar` performs cosine-similarity search at query time to find the most relevant chunks.
5. **Generation**`app/api/chat/route.ts` sends retrieved context to Anthropic Claude and streams an answer with source citations.
 
## Prerequisites
 
- **Node.js** 22+
- **pnpm** 10
- **PostgreSQL** with [pgvector](https://github.com/pgvector/pgvector) extension
- **Anthropic API key** (Claude)
- **VoyageAI API key** (embeddings)
- **Google Workspace service account** with Gmail and Drive API access
 
## Environment Variables
 
| Variable | Required | Default | Description |
|---|---|---|---|
| `NODE_ENV` | No | `development` | Runtime environment |
| `ANTHROPIC_API_KEY` | Yes | — | Anthropic API key for Claude |
| `VOYAGE_API_KEY` | Yes | — | VoyageAI API key for embeddings |
| `DATABASE_URL` | Yes | — | PostgreSQL connection string with pgvector |
| `GOOGLE_CLIENT_EMAIL` | Yes | — | Google Workspace service account email |
| `GOOGLE_PRIVATE_KEY` | Yes | — | Google Workspace service account private key |
| `GOOGLE_DELEGATED_USER` | No | `""` | Admin email for domain-wide delegation |
| `LANGFUSE_PUBLIC_KEY` | No | — | Langfuse observability public key |
| `LANGFUSE_SECRET_KEY` | No | — | Langfuse observability secret key |
| `NEXT_PUBLIC_APP_URL` | No | `http://localhost:3000` | Public application URL |
 
## Getting Started
 
```bash
pnpm install
pnpm dev             # next dev — starts the chat UI on http://localhost:3000
```
 
Before running, copy `.env.example` to `.env` and fill in all required values.
 
```bash
cp .env.example .env
```
 
## Usage
 
**Nightly ingestion** — the `runNightlyIngestion` function fetches the last 24 hours of emails and documents, chunks them, embeds them, and stores vectors in pgvector. Deploy as a cron job or trigger manually.
 
**Query via chat** — open the UI at `http://localhost:3000` and type a question in plain English. The API retrieves relevant context and streams an answer from Claude with source citations.
 
**View eval reports** — evaluation gate results and cost telemetry are available through the built-in reporting functions.
 
## Testing
 
```bash
pnpm test            # vitest run with coverage
pnpm test -- --ui    # vitest UI mode
```
 
Unit tests live in `tests/` mirroring the `src/` structure. Integration tests are in `tests/integration/`.
 
## Project Structure
 
```
app/                  Next.js App Router pages + API routes
  api/chat/           Chat API route (POST handler)
src/                  Services, lib, and adapters
  config/             Environment variable parsing and validation
  db/                 PostgreSQL connection and schema management
  eval/               Evaluation metrics, gate engine, cost tracking
  ingestion/          Gmail sync, Drive sync, content parsing, orchestrator
  rag/                Embedding, chunking, vector pipeline, cache engine
  services/           Session management, LLM cost telemetry
  types/              TypeScript interfaces and types
tests/                Vitest suite
  api/                API route tests
  config/             Config tests
  db/                 Database tests
  eval/               Eval gate and metrics tests
  ingestion/          Ingestion unit tests
  integration/        Integration tests
  rag/                Pipeline, embedder, and cache tests
  services/           Session and cost-telemetry tests
packages/             API references for every dependency (read these first)
DEV_PLAN.md           Build plan for this recipe
```
 
## Tech Stack
 
- **Next.js 16** (App Router) — frontend and API routes
- **Anthropic Claude** (`@anthropic-ai/sdk`) — LLM generation
- **VoyageAI** (`voyageai`) — text embeddings
- **PostgreSQL + pgvector** — vector storage and similarity search
- **Google APIs** (`googleapis`) — Gmail and Drive data ingestion
- **`@reaatech/*` packages** — LLM cache, session continuity, cost telemetry, RAG eval gates
- **Zod** — environment variable validation
- **Vitest** — testing framework
 
## License
 
MIT — see [LICENSE](./LICENSE).