Skip to content
reaatechREAATECH

@reaatech/hybrid-rag-ingestion

npm v0.1.0

Provides a suite of classes and functions for loading, normalizing, validating, and chunking documents into formats suitable for RAG pipelines. It supports four distinct chunking strategies and generates deterministic IDs for PDF, Markdown, HTML, and plain text files.

@reaatech/hybrid-rag-ingestion

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Multi-format document loading, preprocessing, validation, and four configurable chunking strategies for hybrid RAG systems. Supports PDF, Markdown, HTML, and plain text with deterministic chunk ID generation.

Installation

terminal
npm install @reaatech/hybrid-rag-ingestion
# or
pnpm add @reaatech/hybrid-rag-ingestion

Feature Overview

  • Multi-format loading — PDF, Markdown, HTML, and plain text with automatic format detection
  • Text preprocessing — Unicode normalization, whitespace normalization, special character handling
  • Document validation — duplicate detection via content hashing, file size limits, format verification
  • Four chunking strategies — Fixed-Size, Semantic, Recursive, Sliding Window
  • Deterministic chunk IDs — reproducible IDs based on document ID + chunk index
  • Chunking benchmarks — compare strategies on your documents with measured quality
  • Typed errorsUnsupportedFormatError, FileSizeExceededError, DocumentParseError

Quick Start

typescript
import {
  DocumentLoader,
  TextPreprocessor,
  DocumentValidator,
  chunkDocument,
  ChunkingStrategy,
} from '@reaatech/hybrid-rag-ingestion';
 
// Load a document
const loader = new DocumentLoader({ allowedFormats: ['pdf', 'md', 'html', 'txt'] });
const doc = await loader.loadFile('./docs/report.pdf');
console.log(`Loaded: ${doc.id}, ${doc.content.length} chars`);
 
// Validate
const validator = new DocumentValidator({ maxFileSize: 10 * 1024 * 1024 }); // 10MB
const validation = validator.validate(doc);
 
// Chunk
const chunks = await chunkDocument(
  doc.content,
  doc.id,
  {
    strategy: ChunkingStrategy.SEMANTIC,
    chunkSize: 512,
    overlap: 50,
    similarityThreshold: 0.5,
  },
  doc.metadata,
);

API Reference

Document Loading

DocumentLoader

Constructor OptionTypeDefaultDescription
allowedFormatsstring[]['pdf','md','html','txt']Whitelist of accepted formats
MethodReturnsDescription
loadFile(filePath)DocumentLoad and parse a single file
loadDirectory(dirPath)Document[]Load all supported files in a directory

Custom Errors

ErrorWhen
UnsupportedFormatErrorFile format not in allowedFormats
FileSizeExceededErrorFile exceeds maxFileSize limit
DocumentParseErrorParse failure for the detected format

Preprocessing

TextPreprocessor

OptionTypeDefaultDescription
normalizeUnicodebooleantrueNormalize to NFC form
normalizeWhitespacebooleantrueCollapse multiple spaces, normalize newlines
removeControlCharsbooleantrueStrip non-printable control characters

Validation

DocumentValidator

OptionTypeDefaultDescription
maxFileSizenumber10 * 1024 * 1024Max file size in bytes
minContentLengthnumber1Minimum document content length

ValidationResult

PropertyTypeDescription
validbooleanWhether the document passed all checks
errorsstring[]List of validation error messages

Chunking Strategies

All strategies produce Chunk[] with deterministic IDs.

Fixed-Size

Splits by token count, word count, or character count with configurable overlap.

typescript
const chunks = await chunkDocument(content, docId, {
  strategy: ChunkingStrategy.FIXED_SIZE,
  chunkSize: 512,  // tokens
  overlap: 50,
});
ParameterDescription
chunkSizeTarget size in tokens
overlapOverlap between consecutive chunks in tokens

Semantic

Splits at topic boundaries using sentence-level similarity. Best for long-form content.

typescript
const chunks = await chunkDocument(content, docId, {
  strategy: ChunkingStrategy.SEMANTIC,
  chunkSize: 512,
  overlap: 50,
  similarityThreshold: 0.5,
});
ParameterDescription
similarityThresholdMinimum similarity for boundary detection (0–1)

Recursive

Hierarchical splitting: headers → paragraphs → sentences. Best for structured documents.

typescript
const chunks = await chunkDocument(content, docId, {
  strategy: ChunkingStrategy.RECURSIVE,
  chunkSize: 512,
  separators: ['\n## ', '\n', '. '],
});
ParameterDescription
separatorsSplitting delimiters in priority order

Sliding Window

Fixed window moving by configurable stride. Best for dense retrieval scenarios.

typescript
const chunks = await chunkDocument(content, docId, {
  strategy: ChunkingStrategy.SLIDING_WINDOW,
  windowSize: 512,
  stride: 256,
});
ParameterDescription
windowSizeSize of each window in tokens
strideStep size between windows in tokens

Chunking Engine

ChunkingEngine

Orchestrator that routes to the correct strategy:

MethodDescription
chunkDocument(content, docId, config, metadata?)Main entry point — returns Chunk[]
chunkBatch(documents, config)Process multiple documents in sequence

ChunkingBenchmark

Compare strategies head-to-head:

typescript
import { ChunkingBenchmark } from '@reaatech/hybrid-rag-ingestion';
 
const benchmark = new ChunkingBenchmark();
const results = await benchmark.benchmark(documents, [
  { name: 'fixed-512', config: { strategy: ChunkingStrategy.FIXED_SIZE, chunkSize: 512, overlap: 50 } },
  { name: 'semantic-512', config: { strategy: ChunkingStrategy.SEMANTIC, chunkSize: 512, overlap: 50 } },
]);
 
console.table(results.map(r => ({ name: r.name, chunkCount: r.chunkCount, avgTokens: r.avgTokens })));

License

MIT