Skip to content
reaatech

Files · OpenAI Knowledge Agent for Confluence SMB Internal Wiki

81 (1 binary, 507.4 kB total)attempt 1

README.md·4205 B·markdown
markdown
# OpenAI Knowledge Agent for Confluence SMB Internal Wiki
 
> A natural‑language Q&A bot that indexes Confluence spaces and delivers instant answers to employee questions.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## Architecture
 
Two pipelines — **ingestion** and **chat** — connected by a shared Qdrant vector store.
 
### Ingestion Pipeline
 
```
Confluence REST API → HTML → Markdown (node-html-markdown) → validate (agents-markdown)
→ chunk → embed (OpenAI text-embedding-3-small) → store (Qdrant)
```
 
1. **Fetch**`fetchAllPages` crawls Confluence spaces via the REST API with basic-auth pagination
2. **Convert**`htmlToMarkdown` uses `node-html-markdown` to translate HTML storage to Markdown
3. **Validate**`validatePageContent` checks content quality via `@reaatech/agents-markdown`
4. **Chunk**`chunkDocument` splits on paragraphs, merging up to 512 tokens per chunk
5. **Embed**`generateEmbedding` calls OpenAI `text-embedding-3-small` (1536 dimensions)
6. **Store**`ensureCollection` + `upsertChunks` writes to Qdrant with page metadata
 
### Chat Pipeline
 
```
user query → classify (confidence-router) → cache-check (llm-cache)
→ retrieve (Qdrant) → generate (OpenAI Responses API) → respond
```
 
1. **Classify**`routeQuery` evaluates prediction confidence against thresholds
2. **Cache check**`checkCache` looks up exact-match (SHA-256) and semantic (cosine) cache
3. **Retrieve**`generateEmbedding` embeds query → `searchChunks` fetches top-5 from Qdrant
4. **Augment** — retrieved chunks + conversation history form the LLM prompt
5. **Generate**`generateAnswer` calls the OpenAI Responses API (`gpt-5.2`)
6. **Cache & trace** — response is cached and a Langfuse trace is recorded
 
### Low-Confidence Fallback
 
When the confidence router detects uncertainty below the `fallbackThreshold`, the agent-handoff package escalates to a human search experience instead of guessing.
 
## Packages
 
| Package | Role |
|---|---|
| `openai` v6 | OpenAI SDK — Responses API and embeddings |
| `@qdrant/js-client-rest` | Qdrant REST client — vector storage and search |
| `node-html-markdown` | Fast HTML-to-Markdown conversion |
| `zod` v4 | Runtime schema validation for environment config |
| `langfuse` | LLM observability — traces and generations |
| `@reaatech/confidence-router` | Threshold-based routing decision engine |
| `@reaatech/llm-cache` | Exact + semantic LLM response caching |
| `@reaatech/session-continuity` | Session lifecycle, token budgets, and context compression |
| `@reaatech/agent-memory-core` | Memory types, cosine similarity, retry utility |
| `@reaatech/agent-handoff` | Handoff payloads, retry, escalation types |
| `@reaatech/agents-markdown` | Validation types and shared utilities (`randomId`) |
 
## Environment Variables
 
| Variable | Description |
|---|---|
| `OPENAI_API_KEY` | OpenAI API key |
| `CONFLUENCE_BASE_URL` | Confluence instance URL (e.g. `https://your-instance.atlassian.net/wiki`) |
| `CONFLUENCE_USERNAME` | Confluence username (email) |
| `CONFLUENCE_API_TOKEN` | Confluence API token |
| `CONFLUENCE_SPACE_KEYS` | Comma-separated space keys to index |
| `QDRANT_URL` | Qdrant server URL (e.g. `http://127.0.0.1:6333`) |
| `QDRANT_API_KEY` | Qdrant API key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key |
| `LANGFUSE_HOST` | Langfuse host (e.g. `https://cloud.langfuse.com`) |
 
## Quick Start
 
```bash
pnpm install
pnpm typecheck
pnpm test
pnpm dev             # next dev — starts http://localhost:3000
```
 
### Running Ingestion
 
```bash
pnpm tsx src/jobs/ingest.ts
```
 
### Using the Chat API
 
```bash
curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What is our PTO policy?", "userId": "employee-42"}'
```
 
Response:
 
```json
{
  "answer": "Our PTO policy allows …",
  "sources": [{ "pageId": "page1", "score": 0.95 }],
  "sessionId": "abc-123",
  "type": "answer"
}
```
 
## License
 
MIT — see [LICENSE](./LICENSE).