Small retailers waste hours searching through scattered PDFs and spreadsheets to answer customer questions about inventory, causing delays and lost sales.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
By the end of this tutorial you’ll have a fully local RAG (retrieval-augmented generation) knowledge base running on your own machine. Staff can ask natural-language questions like “Do we have Widget X in stock?” and get answers grounded in your indexed product documents. The stack uses Next.js for the UI, Ollama for the LLM, fastembed for local embeddings, and LanceDB as a serverless vector store.
When asked to proceed, type y. The command creates the project structure in the current directory.
Expected output: The command prints its progress and ends with Ready — your files are in place.
Step 2: Install dependencies
The project needs the Ollama client, LanceDB vector store, fastembed for local embeddings, the REAA hybrid-rag packages, and a few supporting utilities.
OLLAMA_HOST tells the Ollama client where to connect. LANCEDB_PATH is the directory where LanceDB stores its database. DEFAULT_MODEL is the model used for answer generation.
Step 5: Configure Next.js for native modules
LanceDB and fastembed include native binaries that Next.js must not bundle. Open next.config.ts and add the serverExternalPackages setting.
ts
import type { NextConfig } from "next";const nextConfig: NextConfig = { serverExternalPackages: ["@lancedb/lancedb", "fastembed", "@anush008/tokenizers"],};export default nextConfig;
The serverExternalPackages array tells Next.js to treat these packages as external in the server runtime — they stay on the Node.js side and are not bundled into the client.
Step 6: Create shared types
Create src/types.ts to hold the request/response schemas and re-export the types from @reaatech/hybrid-rag.
ChatRequestSchema is the Zod schema the API route uses to validate incoming requests. IngestionOptionsSchema validates what the CLI passes to the ingestion service.
Step 7: Build the LanceDB adapter
Create src/lib/lancedb-adapter.ts. This is the core of the retrieval layer — it wraps the LanceDB SDK and implements the interface the pipeline expects.
ts
import { connect, type Connection, Table } from "@lancedb/lancedb";import type { RetrievalResult } from "@reaatech/hybrid-rag";export interface StoredChunk { id: string; documentId: string; content: string; embedding: number[]; metadata: Record<string, unknown>;}export class LanceDBStore { private db: Connection | null = null; private table
initialize() creates the database directory and the chunks table if they don’t exist. hybridSearch() runs both vector and full-text search in parallel, then fuses the ranked results with a weighted score. The vectorWeight parameter (default 0.7) controls how much weight goes to the vector score versus the BM25 score.
Step 8: Build the fastembed embedder
Create src/lib/embedding-provider.ts. This wraps the fastembed package to provide the embedder interface.
ts
import type { FlagEmbedding as FlagEmbeddingClass } from "fastembed";import type { RetrievalResult } from "@reaatech/hybrid-rag";export class FastembedEmbedder { private model: FlagEmbeddingClass | null = null; private readonly modelName: string; private _initialized = false; constructor(modelName = "BAAI/bge-small-en-v1.5") { this.modelName = modelName; } get dimension(): number { if (this.modelName.includes("small")) return 384; if (this.modelName.includes("base")) return 768; return 384; } async initialize(): Promise<void> { if (this._initialized) return; try { const fastembed = await import("fastembed"); this.model = await fastembed.FlagEmbedding.init({ model: fastembed.EmbeddingModel.BGESmallENV15, }); this._initialized = true; } catch (error) { throw new Error( `Failed to initialize fastembed model ${this.modelName}: ${ error instanceof Error ? error.message : String(error) }` ); } } async embed(text: string): Promise<number[]> { if (!this._initialized || !this.model) { throw new Error("Embedder not initialized. Call initialize() first."); } return this.model.queryEmbed(text); } async embedBatch(texts: string[]): Promise<number[][]> { if (!this._initialized || !this.model) { throw new Error("Embedder not initialized. Call initialize() first."); } const allVectors: number[][] = []; const generator = this.model.passageEmbed(texts); for await (const batch of generator) { allVectors.push(...batch); } return allVectors; }}export interface StoredChunk { id: string; documentId: string; content: string; embedding: number[]; metadata: Record<string, unknown>;}export { type RetrievalResult };
initialize() loads BAAI/bge-small-en-v1.5, a compact 384-dimension embedding model. embed() is for single queries; embedBatch() handles bulk indexing through an async generator.
Step 9: Build the retrieval service
Create src/services/retrieval.ts. This orchestrates embedding the query and calling the store’s search methods.
ts
import type { RetrievalResult } from "@reaatech/hybrid-rag";import { type QueryOptions, RAGPipeline } from "@reaatech/hybrid-rag-pipeline";import { type EmbeddingService } from "@reaatech/hybrid-rag-embedding";import { type EmbedderPort, type StorePort } from "./ingestion.js";export type RetrievalMode = "hybrid" | "vector" | "text";export class RetrievalService { private readonly embedder: EmbedderPort; private readonly store: StorePort; private readonly defaultTopK: number; constructor(embedder: EmbedderPort, store: StorePort, defaultTopK = 10) { this.embedder = embedder; this.store = store; this.defaultTopK = defaultTopK; } async retrieve( query: string, topK = this.defaultTopK, filter?: Record<string, unknown>, retrievalMode: RetrievalMode = "hybrid" ): Promise<RetrievalResult[]> { if (retrievalMode === "vector") { const vector = await this.embedder.embed(query); return this.store.vectorSearch(vector, topK, filter); } if (retrievalMode === "text") { return this.store.fullTextSearch(query, topK); } const vector = await this.embedder.embed(query); return this.store.hybridSearch(vector, query, topK, 0.7); }}export { RAGPipeline, type QueryOptions, type EmbeddingService };
By default it runs hybrid mode, which combines vector similarity and full-text search with a 0.7 weight on the vector results. Switch to "vector" or "text" by passing the retrievalMode argument.
Step 10: Build the Ollama service
Create src/services/llm.ts. This wraps the ollama npm package and handles the generate call.
ts
import { Ollama } from "ollama";export interface OllamaConfig { host?: string; model: string;}export class OllamaService { private readonly client: Ollama; private readonly model: string; constructor(config: OllamaConfig) { this.model = config.model; this.client = new Ollama({ host: config.host ?? "http://127.0.0.1:11434", }); } async generateAnswer(query: string, context: string): Promise<string> { const systemPrompt = "You are a helpful retail inventory assistant. Use the following context to answer the user's question. If the context does not contain the answer, say so.\n\nContext:\n" + context; try { const response = await this.client.chat({ model: this.model, messages: [ { role: "system", content: systemPrompt }, { role: "user", content: query }, ], }); return response.message.content; } catch (error) { if (error instanceof Error) { if ( error.message.includes("connection refused") || error.message.includes("ECONNREFUSED") ) { throw new Error("Ollama unreachable at http://127.0.0.1:11434"); } if (error.message.includes("model not found")) { throw new Error(`Model not found: ${this.model}`); } } throw error; } } async *streamAnswer( query: string, context: string ): AsyncGenerator<string> { const systemPrompt = "You are a helpful retail inventory assistant. Use the following context to answer the user's question.\n\nContext:\n" + context; const stream = await this.client.chat({ model: this.model, messages: [ { role: "system", content: systemPrompt }, { role: "user", content: query }, ], stream: true, }); for await (const part of stream) { yield part.message.content; } }}
generateAnswer() builds a system prompt that injects the retrieved context before the user’s query. streamAnswer() is an async generator if you want to stream tokens back to the client later.
Step 11: Build the ingestion service
Create src/services/ingestion.ts. This handles loading, validating, chunking, and embedding documents before writing them to the store.
ts
import { DocumentLoader, TextPreprocessor, DocumentValidator, chunkDocument, UnsupportedFormatError, FileSizeExceededError, DocumentParseError,} from "@reaatech/hybrid-rag-ingestion";import { type ChunkingConfig, type Chunk } from "@reaatech/hybrid-rag";export interface EmbedderPort { embed(text: string): Promise<number[]>; embedBatch(texts: string[]): Promise<number[][]>;}export interface StorePort {
ingestFile() loads one document, validates it, chunks it with recursive strategy, embeds each chunk with fastembed, and writes the results to LanceDB. ingestDirectory() globs for PDF, Markdown, plain text, and HTML files and processes them one by one, skipping files that fail to parse.
Step 12: Create the Chat API route
Create app/api/chat/route.ts. This is the endpoint the UI calls — it runs retrieval against LanceDB and then calls Ollama to generate an answer.
Services are lazily initialized on the first request and cached in module-level variables so subsequent requests reuse the same instances. The response truncates the context to 3000 characters to avoid exceeding the model’s context window.
Step 13: Create the Chat UI
Create app/page.tsx — a client component with a simple chat interface. It sends a POST to /api/chat and renders the answer with its source snippets.
tsx
"use client";import type { ReactElement, FormEvent } from "react";import { useState, useRef, useEffect } from "react";interface Message { role: "user" | "assistant"; content: string; sources?: { content: string; documentId: string; score: number }[];}export default function Home(): ReactElement { const [query, setQuery] = useState(""); const [messages, setMessages]
Update app/layout.tsx to set the page metadata.
tsx
import type { ReactNode } from "react";import type { ReactElement } from "react";export const metadata = { title: "Ollama RAG Knowledge Base", description: "Self-hosted knowledge base for retail inventory using natural language, fully local with Ollama.",};export default function RootLayout({ children }: { children: ReactNode }): ReactElement { return ( <html lang="en"> <body>{children}</body> </html> );}
Step 14: Create the CLI for indexing
Create src/cli/index.ts. This command-line tool handles batch indexing of PDF, Markdown, and text documents into LanceDB.
ts
#!/usr/bin/env nodeimport { Command } from "commander";import { spawn } from "child_process";import { fileURLToPath } from "url";import { LanceDBStore } from "../lib/lancedb-adapter.js";import { FastembedEmbedder } from "../lib/embedding-provider.js";interface IndexOptions { dir: string; dbPath: string; model: string; ollamaHost: string;}interface StatsOptions { dbPath: string;
Run the CLI with node --import=tsx src/cli/index.ts. The index command initializes LanceDB, loads the embedder, and then delegates to the hybrid-rag ingest CLI. The stats command prints the database path.
Step 15: Run the tests
The test suite covers the API route, retrieval service, and LanceDB adapter. Run it with:
terminal
pnpm test
Expected output: Vitest prints a summary for each test file — for example, Test Files 8 passed and Tests 20+ passed. The JSON coverage report is written to vitest-report.json.
Step 16: Start the dev server and test the API
Start the Next.js dev server:
terminal
pnpm dev
Expected output: The terminal prints Ready and Local: http://localhost:3000.
In another terminal, test the chat endpoint with a curl request:
terminal
curl -X POST http://localhost:3000/api/chat \ -H "Content-Type: application/json" \ -d '{"query": "Do we have laptops in stock?"}'
Expected output: A JSON response with an answer field and a sources array. The exact answer depends on what documents you’ve indexed.
Add documents to a ./docs directory and index them with node --import=tsx src/cli/index.ts index --dir ./docs to populate the knowledge base before asking questions.
Try different retrieval modes by passing retrievalMode: "text" in the API route to use only BM25 full-text search.
Expand the chat UI with streaming support using the streamAnswer method in src/services/llm.ts.