Files · Perplexity Market Research Document Pipeline for SMBs
76 (1 binary, 529.0 kB total)attempt 1
README.md·4394 B·markdown
markdown
# Perplexity Market Research Document Pipeline for SMBs
> Transform scattered market reports into a searchable research assistant powered by your documents and live market data.
A reference solution demonstrating how to build a document-grounded RAG pipeline using Perplexity AI, Qdrant vector search, and the `@reaatech/*` package family. Ingest PDF/DOCX/XLSX files, index them into a hybrid vector + BM25 search index, then ask questions with automatic budget enforcement and Langfuse tracing.
## Architecture
```
Upload (PDF/DOCX/XLSX) ──▶ Ingestion ──▶ Text Extraction ──▶ Chunking ──▶ Qdrant Index
│ │
│ ▼
User Query ──▶ Budget Check ──▶ Hybrid Retrieval ──▶ Context Assembly ──▶ Perplexity API ──▶ Answer
│ │
└──── Langfuse Tracing (retrieval + generation) ──────────────────────┘
```
## Quick Start
```bash
pnpm install
pnpm test # vitest run with coverage
pnpm dev # next dev (localhost:3000)
```
## Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `QDRANT_URL` | Qdrant vector database URL | `http://localhost:6333` |
| `QDRANT_API_KEY` | Qdrant API key (optional) | — |
| `PERPLEXITY_API_KEY` | Perplexity AI API key | required |
| `OPENAI_API_KEY` | OpenAI API key for embeddings | required |
| `LANGFUSE_PUBLIC_KEY` | Langfuse tracing public key (optional) | — |
| `LANGFUSE_SECRET_KEY` | Langfuse tracing secret key (optional) | — |
| `LANGFUSE_BASEURL` | Langfuse base URL (optional) | Langfuse cloud |
| `BUDGET_LIMIT_DEFAULT` | Default budget limit in USD per user | `10.0` |
Copy `.env.example` to `.env.local` and fill in the values.
## API Endpoints
### `POST /api/ingest`
Upload a document (PDF, DOCX, XLSX) for ingestion. Accepts `multipart/form-data` with a `file` field.
### `POST /api/chat`
Send a query with context retrieval. Accepts JSON body: `{ "query": string, "userId": string }`.
### `POST /api/evaluate`
Run retrieval evaluation against a dataset. Accepts JSON body: `{ "datasetPath": string }`.
## Supported File Formats
- **PDF** — text extraction via `unpdf`
- **DOCX** — text extraction via `mammoth`
- **XLSX** — text extraction via `xlsx`
## Testing
```bash
pnpm test
```
The test suite covers all services, API routes, budget enforcement, pricing, and Langfuse integration with mocked external dependencies.
## Project Structure
```
app/ Next.js App Router pages + API routes
api/ingest/route.ts Document upload and ingestion
api/chat/route.ts Chat with RAG + Perplexity + budget
api/evaluate/route.ts Evaluation endpoint
actions.ts Server actions for frontend
page.tsx Chat UI with file upload
src/ Source code
services/ Domain services (ingestion, retrieval, budget, chat, etc.)
lib/ Utilities, config, types, schemas, pricing
tests/ Vitest test suite (unit + integration)
unit/ Service-level tests
integration/ API route integration tests
packages/ Package API references
datasets/ Evaluation datasets
```
## How It Works
1. **Ingestion Pipeline**: PDF/DOCX/XLSX files are parsed, text is extracted and preprocessed, then chunked using semantic chunking strategy. Chunks are indexed into Qdrant via `@reaatech/hybrid-rag-ingestion` and `@reaatech/hybrid-rag-retrieval`.
2. **Retrieval Pipeline**: User queries retrieve top-k chunks through hybrid search (vector + BM25 with RRF fusion) via `HybridRetriever`.
3. **Budget Enforcement**: Before calling Perplexity, the `BudgetController` checks the user's spend against a configurable limit. Exceeding the limit returns a 402 response.
4. **Generation**: Perplexity API generates answers grounded in the retrieved context. Usage and cost are recorded to the budget tracker.
5. **Tracing**: Langfuse traces each retrieval and generation round (gracefully disabled when keys are absent).
## License
MIT — see [LICENSE](./LICENSE).