Files · Anthropic RAG Pipeline for Google Workspace SMB Email Knowledge Search
83 (1 binary, 645.8 kB total)attempt 1
README.md·5612 B·markdown
markdown
# Anthropic RAG Pipeline for Google Workspace SMB Email Knowledge Search
> Ask questions in plain English and get answers with citations from your entire Google Workspace email and documents – no more manual searching.
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
## Overview
Small and medium businesses store critical knowledge across Gmail and Google Drive — contracts, proposals, client emails, internal documentation. Finding the right information means manually searching through hundreds of threads and files.
This project solves that with a retrieval-augmented generation (RAG) pipeline that ingests Google Workspace data nightly, embeds it into a PostgreSQL + pgvector vector store, and lets you ask natural-language questions through a chat interface. Answers include source citations so you can verify every response.
## Architecture
The pipeline follows five stages:
```
Ingestion → Embedding → Storage → Retrieval → Generation
```
1. **Ingestion** — `gmail-sync.ts` and `drive-sync.ts` fetch recent emails and documents via the Google APIs, then `content-parser.ts` normalizes HTML, PDF, and DOCX content to plain text.
2. **Embedding** — `embedder.ts` uses the VoyageAI client to convert text chunks into vector embeddings with `voyage-3-lite`.
3. **Storage** — `embedAndStore` writes chunk vectors to a PostgreSQL table via `pgvector`, along with source metadata.
4. **Retrieval** — `searchSimilar` performs cosine-similarity search at query time to find the most relevant chunks.
5. **Generation** — `app/api/chat/route.ts` sends retrieved context to Anthropic Claude and streams an answer with source citations.
## Prerequisites
- **Node.js** 22+
- **pnpm** 10
- **PostgreSQL** with [pgvector](https://github.com/pgvector/pgvector) extension
- **Anthropic API key** (Claude)
- **VoyageAI API key** (embeddings)
- **Google Workspace service account** with Gmail and Drive API access
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| `NODE_ENV` | No | `development` | Runtime environment |
| `ANTHROPIC_API_KEY` | Yes | — | Anthropic API key for Claude |
| `VOYAGE_API_KEY` | Yes | — | VoyageAI API key for embeddings |
| `DATABASE_URL` | Yes | — | PostgreSQL connection string with pgvector |
| `GOOGLE_CLIENT_EMAIL` | Yes | — | Google Workspace service account email |
| `GOOGLE_PRIVATE_KEY` | Yes | — | Google Workspace service account private key |
| `GOOGLE_DELEGATED_USER` | No | `""` | Admin email for domain-wide delegation |
| `LANGFUSE_PUBLIC_KEY` | No | — | Langfuse observability public key |
| `LANGFUSE_SECRET_KEY` | No | — | Langfuse observability secret key |
| `NEXT_PUBLIC_APP_URL` | No | `http://localhost:3000` | Public application URL |
## Getting Started
```bash
pnpm install
pnpm dev # next dev — starts the chat UI on http://localhost:3000
```
Before running, copy `.env.example` to `.env` and fill in all required values.
```bash
cp .env.example .env
```
## Usage
**Nightly ingestion** — the `runNightlyIngestion` function fetches the last 24 hours of emails and documents, chunks them, embeds them, and stores vectors in pgvector. Deploy as a cron job or trigger manually.
**Query via chat** — open the UI at `http://localhost:3000` and type a question in plain English. The API retrieves relevant context and streams an answer from Claude with source citations.
**View eval reports** — evaluation gate results and cost telemetry are available through the built-in reporting functions.
## Testing
```bash
pnpm test # vitest run with coverage
pnpm test -- --ui # vitest UI mode
```
Unit tests live in `tests/` mirroring the `src/` structure. Integration tests are in `tests/integration/`.
## Project Structure
```
app/ Next.js App Router pages + API routes
api/chat/ Chat API route (POST handler)
src/ Services, lib, and adapters
config/ Environment variable parsing and validation
db/ PostgreSQL connection and schema management
eval/ Evaluation metrics, gate engine, cost tracking
ingestion/ Gmail sync, Drive sync, content parsing, orchestrator
rag/ Embedding, chunking, vector pipeline, cache engine
services/ Session management, LLM cost telemetry
types/ TypeScript interfaces and types
tests/ Vitest suite
api/ API route tests
config/ Config tests
db/ Database tests
eval/ Eval gate and metrics tests
ingestion/ Ingestion unit tests
integration/ Integration tests
rag/ Pipeline, embedder, and cache tests
services/ Session and cost-telemetry tests
packages/ API references for every dependency (read these first)
DEV_PLAN.md Build plan for this recipe
```
## Tech Stack
- **Next.js 16** (App Router) — frontend and API routes
- **Anthropic Claude** (`@anthropic-ai/sdk`) — LLM generation
- **VoyageAI** (`voyageai`) — text embeddings
- **PostgreSQL + pgvector** — vector storage and similarity search
- **Google APIs** (`googleapis`) — Gmail and Drive data ingestion
- **`@reaatech/*` packages** — LLM cache, session continuity, cost telemetry, RAG eval gates
- **Zod** — environment variable validation
- **Vitest** — testing framework
## License
MIT — see [LICENSE](./LICENSE).