Files · Perplexity Market Research Document Pipeline for SMBs

76 (1 binary, 529.0 kB total)attempt 1

README.md·4394 B·markdown

markdown

# Perplexity Market Research Document Pipeline for SMBs
 
> Transform scattered market reports into a searchable research assistant powered by your documents and live market data.
 
A reference solution demonstrating how to build a document-grounded RAG pipeline using Perplexity AI, Qdrant vector search, and the `@reaatech/*` package family. Ingest PDF/DOCX/XLSX files, index them into a hybrid vector + BM25 search index, then ask questions with automatic budget enforcement and Langfuse tracing.
 
## Architecture
 
```
Upload (PDF/DOCX/XLSX) ──▶ Ingestion ──▶ Text Extraction ──▶ Chunking ──▶ Qdrant Index
                                            │                              │
                                            │                              ▼
User Query ──▶ Budget Check ──▶ Hybrid Retrieval ──▶ Context Assembly ──▶ Perplexity API ──▶ Answer
                    │                                                                    │
                    └──── Langfuse Tracing (retrieval + generation) ──────────────────────┘
```
 
## Quick Start
 
```bash
pnpm install
pnpm test            # vitest run with coverage
pnpm dev             # next dev (localhost:3000)
```
 
## Environment Variables
 
| Variable | Description | Default |
|----------|-------------|---------|
| `QDRANT_URL` | Qdrant vector database URL | `http://localhost:6333` |
| `QDRANT_API_KEY` | Qdrant API key (optional) | — |
| `PERPLEXITY_API_KEY` | Perplexity AI API key | required |
| `OPENAI_API_KEY` | OpenAI API key for embeddings | required |
| `LANGFUSE_PUBLIC_KEY` | Langfuse tracing public key (optional) | — |
| `LANGFUSE_SECRET_KEY` | Langfuse tracing secret key (optional) | — |
| `LANGFUSE_BASEURL` | Langfuse base URL (optional) | Langfuse cloud |
| `BUDGET_LIMIT_DEFAULT` | Default budget limit in USD per user | `10.0` |
 
Copy `.env.example` to `.env.local` and fill in the values.
 
## API Endpoints
 
### `POST /api/ingest`
Upload a document (PDF, DOCX, XLSX) for ingestion. Accepts `multipart/form-data` with a `file` field.
 
### `POST /api/chat`
Send a query with context retrieval. Accepts JSON body: `{ "query": string, "userId": string }`.
 
### `POST /api/evaluate`
Run retrieval evaluation against a dataset. Accepts JSON body: `{ "datasetPath": string }`.
 
## Supported File Formats
 
- **PDF** — text extraction via `unpdf`
- **DOCX** — text extraction via `mammoth`
- **XLSX** — text extraction via `xlsx`
 
## Testing
 
```bash
pnpm test
```
 
The test suite covers all services, API routes, budget enforcement, pricing, and Langfuse integration with mocked external dependencies.
 
## Project Structure
 
```
app/                    Next.js App Router pages + API routes
  api/ingest/route.ts   Document upload and ingestion
  api/chat/route.ts     Chat with RAG + Perplexity + budget
  api/evaluate/route.ts Evaluation endpoint
  actions.ts            Server actions for frontend
  page.tsx              Chat UI with file upload
src/                    Source code
  services/             Domain services (ingestion, retrieval, budget, chat, etc.)
  lib/                  Utilities, config, types, schemas, pricing
tests/                  Vitest test suite (unit + integration)
  unit/                 Service-level tests
  integration/          API route integration tests
packages/               Package API references
datasets/               Evaluation datasets
```
 
## How It Works
 
1. **Ingestion Pipeline**: PDF/DOCX/XLSX files are parsed, text is extracted and preprocessed, then chunked using semantic chunking strategy. Chunks are indexed into Qdrant via `@reaatech/hybrid-rag-ingestion` and `@reaatech/hybrid-rag-retrieval`.
 
2. **Retrieval Pipeline**: User queries retrieve top-k chunks through hybrid search (vector + BM25 with RRF fusion) via `HybridRetriever`.
 
3. **Budget Enforcement**: Before calling Perplexity, the `BudgetController` checks the user's spend against a configurable limit. Exceeding the limit returns a 402 response.
 
4. **Generation**: Perplexity API generates answers grounded in the retrieved context. Usage and cost are recorded to the budget tracker.
 
5. **Tracing**: Langfuse traces each retrieval and generation round (gracefully disabled when keys are absent).
 
## License
 
MIT — see [LICENSE](./LICENSE).