A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe builds an AI-powered natural-language product search for Square Online stores using a RAG (Retrieval-Augmented Generation) pipeline. You’ll index a Square Catalog into a Qdrant vector database, then serve search queries through Google Gemini — with semantic caching, cost-aware model routing, and multi-turn conversation support. By the end, you’ll have a set of Next.js API endpoints that let shoppers ask “red sneakers size 10” and get a human-like answer pointing them to the right products.
Prerequisites
Node.js >= 22 and pnpm 10 installed
Docker — to run Qdrant locally via docker run -p 6333:6333 qdrant/qdrant
Square account — create an app at developer.squareup.com, generate an OAuth token, and grant the ITEMS_READ scope
Google Cloud Platform project with the Vertex AI API enabled and a service account key with the aiplatform.user role
Langfuse account (optional) — for LLM observability and tracing
Basic familiarity with TypeScript, Next.js App Router, and vector databases
Step 1: Bootstrap the Next.js project and install dependencies
Start from an empty directory. Initialize a Next.js project with the App Router:
The scaffold creates package.json, tsconfig.json, next.config.ts, and the app/ directory at the project root. Next, create a src/ directory for your application code, then install all the dependencies this recipe needs:
Expected output:package.json now lists all exact-pinned dependencies and node_modules/ is populated.
Step 2: Configure environment variables with Zod
Create a config module that validates all environment variables at startup. This ensures you get a clear error if something is missing before any request hits.
Create src/services/config.ts:
ts
import { z } from "zod";const ConfigSchema = z.object({ SQUARE_ACCESS_TOKEN: z.string(), SQUARE_LOCATION_ID: z.string(), QDRANT_URL: z.url().default("http://localhost:6333"), QDRANT_API_KEY: z.string().optional(), QDRANT_COLLECTION_NAME: z.string().default("product-catalog"), GOOGLE_CLOUD_PROJECT: z.string(), GOOGLE_CLOUD_LOCATION: z.string().default("us-central1"), LANGFUSE_PUBLIC_KEY: z.string().optional(), LANGFUSE_SECRET_KEY: z.string().optional(), LANGFUSE_HOST: z.string().optional(),});export type Config = z.infer<typeof ConfigSchema>;export function parseConfig(): Config { return ConfigSchema.parse(process.env);}
Now fill in .env.example with placeholder values. Every variable that parseConfig reads must be listed here so other developers know what to configure:
Copy it to .env and fill in your real credentials:
terminal
cp .env.example .env
Expected output:parseConfig() returns a typed Config object when all required env vars are set, and throws a Zod error with a clear message when one is missing.
Step 3: Define shared types
Create src/types.ts with all the types shared across services and route handlers:
Expected output: These types are imported by every service module and route handler. The SearchResponse doubles as the API contract for the frontend chatbot.
Step 4: Create the Square catalog client
This service talks to the Square Catalog API to fetch products. It handles the paginated listing and individual product lookups.
SquareError is caught and re-thrown as a standard Error with a descriptive prefix — this keeps error handling uniform in the route handlers.
Type guards (isItemObject, isItemVariationObject) narrow the catalog objects so you don’t have to cast at every access site.
Expected output:createSquareClient(config) returns a SquareCatalogService whose listProducts() yields an array of Product objects.
Step 5: Create the fastembed embedding service
This service wraps the fastembed library to convert product descriptions and search queries into 384-dimensional vectors. It’s the bridge between text and vector space.
Create src/services/embedder.ts:
ts
import { FlagEmbedding, EmbeddingModel } from "fastembed";export class LocalEmbedder { private model: FlagEmbedding; constructor(model: FlagEmbedding) { this.model = model; } async *embedDocuments( texts: string[], batchSize?: number, ): AsyncGenerator<number[][]> { yield* this.model.embed(texts, batchSize ?? 256); } async embedQuery(text: string): Promise<number[]> { return this.model.queryEmbed(text); } async embed(text: string): Promise<number[]> { const generator = this.model.embed([text], 1); for await (const batch of generator) { return batch[0]; } throw new Error("Embedding produced no output"); }}export async function createEmbedder(): Promise<LocalEmbedder> { const model = await FlagEmbedding.init({ model: EmbeddingModel.BGESmallEN, }); return new LocalEmbedder(model);}
Note the async generator pattern — embedDocuments streams batches of vectors so you don’t hold all embeddings in memory at once during indexing. embedQuery uses the dedicated queryEmbed method which applies the correct pooling for single-text queries.
Expected output:createEmbedder() loads the BAAI/bge-small-en-v1.5 model and returns a LocalEmbedder that converts text into 384-element float arrays.
Step 6: Create the cache adapter for semantic deduplication
The @reaatech/llm-cache package provides a semantic cache with Qdrant-backed vector storage. Before you configure the cache engine, you need an embedder that matches its EmbeddingProvider interface.
Create src/services/cache-embedder.ts:
ts
import type { EmbeddingProvider } from "@reaatech/llm-cache";import type { LocalEmbedder } from "./embedder.js";export class FastembedEmbedder implements EmbeddingProvider { private delegate: LocalEmbedder; constructor(delegate: LocalEmbedder) { this.delegate = delegate; } async embed(text: string): Promise<number[]> { return this.delegate.embed(text); } async embedBatch(texts: string[]): Promise<number[][]> { const results: number[][] = []; for await (const batch of this.delegate.embedDocuments(texts, 256)) { results.push(...batch); } return results; } async embedQuery(text: string): Promise<number[]> { return this.delegate.embedQuery(text); }}export function createCacheEmbedder(embedder: LocalEmbedder): FastembedEmbedder { return new FastembedEmbedder(embedder);}
Now create the cache service that wires everything together. It instantiates @reaatech/llm-cache’s CacheEngine with an in-memory adapter and a Qdrant vector adapter.
Expected output: When a search query comes in, createCacheService checks for semantically similar cached answers (cosine threshold 0.85) before hitting the LLM, saving both cost and latency.
Step 7: Create the Qdrant vector indexer
This service fetches products from Square, generates embeddings, and upserts them into a Qdrant collection. The collection is created automatically if it doesn’t exist.
Expected output: Calling runIndex() pulls all products from Square, embeds their descriptions into 384-dimensional vectors, and stores them in Qdrant’s product-catalog collection using cosine distance. getIndexStats() returns the total point count.
Step 8: Create the Gemini model router with cost-aware routing
This service uses @reaatech/llm-router-engine to select between Gemini 2.5 Flash (cheap, fast) and Gemini 2.5 Pro (expensive, capable) based on query complexity. It also handles the actual LLM call via @google/genai.
createRouter from @reaatech/llm-router-engine handles model selection via a cost-optimized strategy — simple queries hit Flash ($0.15/M tokens), complex ones fall through to Pro ($1.25/M tokens).
GoogleGenAI is configured with enterprise: true to use Vertex AI rather than the public Gemini API.
Token counts from usageMetadata are returned so you can track costs.
Expected output:routeQuery("red sneakers") returns a RouterRouteSummary with model.id set to "gemini-2.5-flash" and result.content containing the generated answer.
Step 9: Create the session continuity service
This service manages multi-turn conversations, keeping track of message history and enforcing token budgets so long conversations don’t exceed context limits.
Create src/services/session-service.ts:
ts
import { SessionManager, type Session, type Message, type IStorageAdapter, type TokenCounter, type SessionManagerConfig, type SessionId, type UpdateSessionOptions, type SessionFilters, type MessageQueryOptions, type HealthStatus, type ConversationContextResult,} from "@reaatech/session-continuity";export class InMemoryStorageAdapter implements IStorageAdapter { private sessions: Map<string, Session> = new Map(); private messages:
The SessionManager enforces a 4096-token budget with 500 reserved tokens. When the budget is exceeded, it compresses the conversation using a sliding window strategy to keep recent messages and drop older ones.
Expected output:createSessionService() returns a service that can create sessions, add messages, retrieve context for LLM prompts, and end sessions cleanly.
Step 10: Create the RAG search service
This is the orchestrator. It ties together the cache, embedding, Qdrant search, model routing, and session history into a single search() method.
Create src/services/search-service.ts:
ts
import { QdrantClient } from "@qdrant/js-client-rest";import type { SearchQuery, SearchResponse, SearchResult,} from "../types.js";import type { CatalogCacheService } from "./cache-service.js";import type { Config } from "./config.js";import type { GeminiRouterService } from "./router-service.js";import type { ConversationSessionService } from "./session-service.js";import type { LocalEmbedder } from "./embedder.js";export class ProductSearchService { private cacheService: CatalogCacheService
Expected output: The search() method runs the full RAG pipeline: check cache → embed query → vector search → build prompt with context and history → route to LLM → cache response → return answer. A cache hit skips directly to the return, saving ~2–5 seconds per repeated query.
Step 11: Create the observability layer
Langfuse traces let you monitor search quality, cache hit rates, model selection decisions, and latency.
Create src/lib/observability.ts:
ts
import { Langfuse } from "langfuse";import type { Config } from "../services/config.js";import type { SearchResponse, IndexingStatus } from "../types.js";class NoopLangfuse extends Langfuse { constructor() { super({ enabled: false }); }}export function initObservability(config: Config): Langfuse { if (config.LANGFUSE_PUBLIC_KEY && config.LANGFUSE_SECRET_KEY) { return new Langfuse({ publicKey: config.LANGFUSE_PUBLIC_KEY, secretKey: config.LANGFUSE_SECRET_KEY, baseUrl: config.LANGFUSE_HOST ?? "https://cloud.langfuse.com", }); } return new NoopLangfuse();}export function traceSearch( query: string, response: SearchResponse, langfuse: Langfuse,): void { langfuse.trace({ name: "product-search", input: query, output: response, metadata: { cacheHit: response.cacheHit, modelUsed: response.modelUsed }, });}export function traceIndexing( status: IndexingStatus, productCount: number, langfuse: Langfuse,): void { langfuse.trace({ name: "product-index", input: { status, productCount }, output: { status }, });}export function createObservability(config: Config): { langfuse: Langfuse; traceSearch: typeof traceSearch; traceIndexing: typeof traceIndexing } { const langfuse = initObservability(config); return { langfuse, traceSearch, traceIndexing };}
The NoopLangfuse stub extends the real Langfuse class with observability disabled. This means code can always call langfuse.trace(...) without guarding — when Langfuse keys aren’t configured, the calls are no-ops.
Now set up instrumentation.ts so Langfuse is initialized at Next.js startup:
Expected output: On server startup, register() runs, validates all env vars, and initializes Langfuse (or the no-op stub). Every search and indexing operation can be traced without extra setup.
Step 12: Create the API route handlers
You need four endpoints in total. Build them one by one.
Note the lazy singleton pattern: services are initialized on the first request and cached. This avoids the cold-start cost of loading fastembed’s model on every request.
Expected output: After creating these files, pnpm dev starts Next.js on port 3000 and the following endpoints respond:
Endpoint
Purpose
POST /api/search
Natural-language product search with RAG
POST /api/index
Trigger full product re-index
GET /api/index
Get index stats (point count)
POST /api/session
Create a new conversation session
GET /api/session?sessionId=...
Get session message history
DELETE /api/session
Delete a session
Step 13: Write tests
The test suite covers route handlers, services, and the observability layer. All external APIs (Square, Qdrant, Gemini, Langfuse) are mocked so tests run entirely offline. Here are the key patterns.
Add a frontend chatbot UI — create an app/page.tsx that renders a chat interface calling the POST /api/search endpoint with user messages and session IDs
Add product re-index scheduling — use cron or a Vercel Cron Job to call POST /api/index nightly so new Square products are automatically indexed
Add webhook integration — subscribe to Square’s catalog.version.updated webhook and call the cache’s invalidateProductCache() to keep the vector index fresh in real time
Replace in-memory session storage — swap the InMemoryStorageAdapter for a Redis- or Postgres-backed adapter so sessions survive server restarts
Add A/B model comparison — route the same query through both Flash and Pro in a shadow mode, compare answers in Langfuse, and tune the routing strategy