@reaatech/hybrid-rag-ingestion
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Multi-format document loading, preprocessing, validation, and four configurable chunking strategies for hybrid RAG systems. Supports PDF, Markdown, HTML, and plain text with deterministic chunk ID generation.
Installation
npm install @reaatech/hybrid-rag-ingestion
# or
pnpm add @reaatech/hybrid-rag-ingestion
Feature Overview
Multi-format loading — PDF, Markdown, HTML, and plain text with automatic format detection
Text preprocessing — Unicode normalization, whitespace normalization, special character handling
Document validation — duplicate detection via content hashing, file size limits, format verification
Four chunking strategies — Fixed-Size, Semantic, Recursive, Sliding Window
Deterministic chunk IDs — reproducible IDs based on document ID + chunk index
Chunking benchmarks — compare strategies on your documents with measured quality
Typed errors — UnsupportedFormatError, FileSizeExceededError, DocumentParseError
Quick Start
import {
DocumentLoader,
TextPreprocessor,
DocumentValidator,
chunkDocument,
ChunkingStrategy,
} from '@reaatech/hybrid-rag-ingestion' ;
// Load a document
const loader = new DocumentLoader ({ allowedFormats: [ 'pdf' , 'md' , 'html' , 'txt' ] });
const doc = await loader. loadFile ( './docs/report.pdf' );
console. log ( `Loaded: ${ doc . id }, ${ doc . content .length } chars` );
// Validate
const validator = new DocumentValidator ({ maxFileSize: 10 * 1024 * 1024 }); // 10MB
const validation = validator. validate (doc);
// Chunk
const chunks = await chunkDocument (
doc.content,
doc.id,
{
strategy: ChunkingStrategy.SEMANTIC,
chunkSize: 512 ,
overlap: 50 ,
similarityThreshold: 0.5 ,
},
doc.metadata,
);
API Reference
Document Loading
DocumentLoader
Constructor Option Type Default Description allowedFormatsstring[]['pdf','md','html','txt']Whitelist of accepted formats
Method Returns Description loadFile(filePath)DocumentLoad and parse a single file loadDirectory(dirPath)Document[]Load all supported files in a directory
Custom Errors
Error When UnsupportedFormatErrorFile format not in allowedFormats FileSizeExceededErrorFile exceeds maxFileSize limit DocumentParseErrorParse failure for the detected format
Preprocessing
TextPreprocessor
Option Type Default Description normalizeUnicodebooleantrueNormalize to NFC form normalizeWhitespacebooleantrueCollapse multiple spaces, normalize newlines removeControlCharsbooleantrueStrip non-printable control characters
Validation
DocumentValidator
Option Type Default Description maxFileSizenumber10 * 1024 * 1024Max file size in bytes minContentLengthnumber1Minimum document content length
ValidationResult
Property Type Description validbooleanWhether the document passed all checks errorsstring[]List of validation error messages
Chunking Strategies
All strategies produce Chunk[] with deterministic IDs.
Fixed-Size
Splits by token count, word count, or character count with configurable overlap.
const chunks = await chunkDocument (content, docId, {
strategy: ChunkingStrategy.FIXED_SIZE,
chunkSize: 512 , // tokens
overlap: 50 ,
});
Parameter Description chunkSizeTarget size in tokens overlapOverlap between consecutive chunks in tokens
Semantic
Splits at topic boundaries using sentence-level similarity. Best for long-form content.
const chunks = await chunkDocument (content, docId, {
strategy: ChunkingStrategy.SEMANTIC,
chunkSize: 512 ,
overlap: 50 ,
similarityThreshold: 0.5 ,
});
Parameter Description similarityThresholdMinimum similarity for boundary detection (0–1)
Recursive
Hierarchical splitting: headers → paragraphs → sentences. Best for structured documents.
const chunks = await chunkDocument (content, docId, {
strategy: ChunkingStrategy.RECURSIVE,
chunkSize: 512 ,
separators: [ '\n## ' , '\n' , '. ' ],
});
Parameter Description separatorsSplitting delimiters in priority order
Sliding Window
Fixed window moving by configurable stride. Best for dense retrieval scenarios.
const chunks = await chunkDocument (content, docId, {
strategy: ChunkingStrategy.SLIDING_WINDOW,
windowSize: 512 ,
stride: 256 ,
});
Parameter Description windowSizeSize of each window in tokens strideStep size between windows in tokens
Chunking Engine
ChunkingEngine
Orchestrator that routes to the correct strategy:
Method Description chunkDocument(content, docId, config, metadata?)Main entry point — returns Chunk[] chunkBatch(documents, config)Process multiple documents in sequence
ChunkingBenchmark
Compare strategies head-to-head:
import { ChunkingBenchmark } from '@reaatech/hybrid-rag-ingestion' ;
const benchmark = new ChunkingBenchmark ();
const results = await benchmark. benchmark (documents, [
{ name: 'fixed-512' , config: { strategy: ChunkingStrategy.FIXED_SIZE, chunkSize: 512 , overlap: 50 } },
{ name: 'semantic-512' , config: { strategy: ChunkingStrategy.SEMANTIC, chunkSize: 512 , overlap: 50 } },
]);
console. table (results. map (r => ({ name: r.name, chunkCount: r.chunkCount, avgTokens: r.avgTokens })));
Related Packages
@reaatech/hybrid-rag — Core types (Document, Chunk, ChunkingConfig)
@reaatech/hybrid-rag-retrieval — Retrieval engines (consume chunks)
@reaatech/hybrid-rag-pipeline — RAGPipeline orchestrator
License
MIT