Perplexity Market Research Document Pipeline for SMBs

Transform scattered market reports into a searchable research assistant powered by your documents and live market data.

perplexity hybrid-rag market-research document-pipeline qdrant budget-guardrails nextjs express langfuse

The problem

Small businesses drown in market reports, PDFs, and industry analyses but lack a way to turn them into actionable insights quickly. Manually sifting through hundreds of pages is slow and often misses critical connections across internal and external data.

Built from

Intro

This recipe builds a market research pipeline for small businesses. You upload PDF, DOCX, and XLSX files containing market reports, industry analyses, and competitive data. The pipeline processes those documents, indexes them into a Qdrant vector database, and gives you a chat interface that answers questions by retrieving relevant chunks and passing them to Perplexity’s language models — with spending limits enforced per user.

Prerequisites

Node.js 22 or higher
pnpm installed
Perplexity API key from perplexity.ai
OpenAI API key (for text-embedding-3-small embeddings)
Qdrant running locally or at a remote URL (Docker: docker run -p 6333:6333 qdrant/qdrant)

Step 1: Install dependencies

The scaffold is already in your working directory. Install everything in one shot:

terminal

pnpm install

This pulls in all pinned packages including @reaatech/hybrid-rag-ingestion, @reaatech/hybrid-rag-retrieval, perplexity-sdk, langfuse, unpdf, mammoth, xlsx, and the budget engine stack.

Step 2: Configure environment variables

Copy the example env file and fill in your keys:

terminal

cp .env.example

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

170 kB·92 tests·99.3% coverage·vitest passing

SHA-2561518c21a9e025915390f99fba9b4aad89d66af926a03ab018a771ff077290248

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js 22 or higher
pnpm installed
Perplexity API key from perplexity.ai
OpenAI API key (for text-embedding-3-small embeddings)
Qdrant running locally or at a remote URL (Docker: docker run -p 6333:6333 qdrant/qdrant)

Step 1: Install dependencies

The scaffold is already in your working directory. Install everything in one shot:

terminal

pnpm install

This pulls in all pinned packages including @reaatech/hybrid-rag-ingestion, @reaatech/hybrid-rag-retrieval, perplexity-sdk, langfuse, unpdf, mammoth, xlsx, and the budget engine stack.

Step 2: Configure environment variables

Copy the example env file and fill in your keys:

terminal

cp .env.example

import { TextPreprocessor, DocumentValidator, chunkDocument } from '@reaatech/hybrid-rag-ingestion'; import { ChunkingStrategy } from '@reaatech/hybrid-rag'; import type { Chunk, ChunkingConfig } from '@reaatech/hybrid-rag'; import { getDocumentProxy, extractText } from 'unpdf'; import mammoth from 'mammoth'; import * as XLSX from 'xlsx'; const preprocessor = new TextPreprocessor({ normalizeUnicode: true, normalizeWhitespace: true, }); const validator = new DocumentValidator({ maxFileSize: 10 * 1024 * 1024, minContentLength: 1, }); async function extractPdfText(buffer: Uint8Array): Promise<string> { const pdf = await getDocumentProxy(buffer); const { text } = await extractText(pdf, { mergePages: true }); return text; } async function extractDocxText(buffer: Uint8Array): Promise<string> { const result = await mammoth.extractRawText({ buffer: Buffer.from(buffer) }); return result.value; } function extractXlsxText(buffer: Uint8Array): string { const workbook = XLSX.read(buffer, { type: 'array' }); const texts: string[] = []; for (const sheetName of workbook.SheetNames) { const sheet = workbook.Sheets[sheetName]; if (!sheet) continue; const values = XLSX.utils.sheet_to_json<string[]>(sheet, { header: 1 }); for (const row of values) { if (Array.isArray(row)) { texts.push(row.filter((cell): cell is string => typeof cell === 'string' && cell.length > 0).join(' ')); } } } return texts.join('\n'); } async function extractTextByMime(buffer: Uint8Array, mimeType: string): Promise<string> { switch (mimeType) { case 'application/pdf': return extractPdfText(buffer); case 'application/vnd.openxmlformats-officedocument.wordprocessingml.document': return extractDocxText(buffer); case 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': return extractXlsxText(buffer); default: throw new Error(`Unsupported MIME type: ${mimeType}`); } } export async function processDocument( fileBuffer: Uint8Array, mimeType: string, docId: string, ): Promise<Chunk[]> { const rawText = await extractTextByMime(fileBuffer, mimeType); const preprocessed = preprocessor.preprocess(rawText); const content = preprocessed.content; const doc = { id: docId, content, source: `upload-${docId}`, metadata: { mimeType }, }; const validation = validator.validate(doc); if (!validation.isValid) { throw new Error(`Document validation failed: ${validation.errors.join(', ')}`); } const chunkConfig: ChunkingConfig = { strategy: ChunkingStrategy.SEMANTIC, chunkSize: 512, overlap: 50, similarityThreshold: 0.5, }; const chunks = await chunkDocument(content, docId, chunkConfig, doc.metadata); return chunks; }

import { NextRequest, NextResponse } from 'next/server'; import { EnforcementAction } from '@reaatech/agent-budget-types'; import { ChatCompletionsPostRequestModelEnum } from 'perplexity-sdk'; import { retrieveContext } from '../../../src/services/retrieval.js'; import { generateAnswer } from '../../../src/services/chat.js'; import { checkBudget, recordSpend } from '../../../src/services/budget.js'; import { traceGeneration, getTracer, startTrace, recordSpan } from '../../../src/services/langfuse.js'; import { ChatRequestSchema } from '../../../src/lib/schemas.js'; import { PerplexityPricingProvider } from '../../../src/lib/pricing.js'; const pricing = new PerplexityPricingProvider(); const DEFAULT_MODEL: string = ChatCompletionsPostRequestModelEnum.Mistral7bInstruct; export async function POST(request: NextRequest): Promise<NextResponse> { try { const body = ChatRequestSchema.parse(await request.json()); const { query, userId } = body; const trace = startTrace('market-research-chat'); const tracer = getTracer(); const retrievalResults = await retrieveContext(query, { topK: 10 }); const context = retrievalResults.map((r) => r.content).join('\n\n'); recordSpan(trace, 'retrieval', { query, topK: 10 }, { resultCount: retrievalResults.length }); const estimatedTokens = query.length / 4; const estimatedCost = pricing.estimateCost(DEFAULT_MODEL, estimatedTokens); const budgetCheck = checkBudget(userId, estimatedCost, DEFAULT_MODEL); if (!budgetCheck.allowed || budgetCheck.action === EnforcementAction.HardStop) { return NextResponse.json({ error: 'Budget exceeded' }, { status: 402 }); } const actualModel = budgetCheck.suggestedModel ?? DEFAULT_MODEL; const { answer, usage } = await generateAnswer(query, context, actualModel); recordSpend(userId, estimatedCost, usage.inputTokens, usage.outputTokens, actualModel, 'perplexity'); traceGeneration(tracer, 'chat-generation', actualModel, { query, context }, { answer }, usage); const sources = retrievalResults.map((r) => ({ chunkId: r.chunkId, documentId: r.documentId, content: r.content, score: r.score, })); return NextResponse.json({ answer, sources, usage }, { status: 200 }); } catch (error) { if (error instanceof Error && 'issues' in error) { return NextResponse.json({ error: 'Invalid request' }, { status: 400 }); } const message = error instanceof Error ? error.message : 'Internal server error'; return NextResponse.json({ error: message }, { status: 500 }); } }

Perplexity Market Research Document Pipeline for SMBs

The problem

Built from

Intro

Prerequisites

Step 1: Install dependencies

Step 2: Configure environment variables

Example artifact

Comments

Intro

Prerequisites

Step 1: Install dependencies

Step 2: Configure environment variables

Step 3: Define the shared config

Step 4: Define API request schemas

Step 5: Implement document ingestion

Step 6: Index chunks into Qdrant

Step 7: Build the retrieval service

Step 8: Wire up the Perplexity chat service

Step 9: Integrate the budget controller

Step 10: Add pricing provider

Step 11: Create the ingest API route

Step 12: Create the chat API route

Step 13: Add Langfuse tracing

Step 14: Add server actions for the frontend

Step 15: Run the tests

Next steps