Configurable Chunk Size And Overlap Management

1

DoclingRepository56/100

via “document chunking with semantic awareness and overlap control”

IBM's document converter — PDFs, DOCX to structured markdown with OCR and table extraction.

Unique: Implements semantic-aware chunking that respects document structure boundaries (paragraphs, sections, tables) rather than naive character splitting, with configurable overlap and boundary detection, enabling better semantic coherence for RAG systems

vs others: Produces semantically-coherent chunks by respecting document structure, whereas naive chunking tools split at arbitrary character boundaries; improves retrieval quality in RAG systems by preserving semantic units

2

R2RRepository51/100

via “configurable chunking strategies with semantic awareness”

SoTA production-ready AI retrieval system. Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.

Unique: Supports multiple chunking strategies (fixed, semantic, code-aware) selectable via configuration, enabling optimization for different document types without code changes. Semantic chunking uses embeddings to identify natural breakpoints, preserving semantic units better than fixed-size windows.

vs others: More flexible than LangChain's fixed-size chunking because it supports semantic and code-aware strategies; more integrated than using external chunking libraries because strategy selection is built into R2R.

3

mcp-local-ragMCP Server42/100

via “configurable-document-chunking-with-overlap”

Local RAG MCP Server - Easy-to-setup document search with minimal configuration

Unique: Maintains rich chunk metadata including source offsets and document references, enabling precise source attribution and enabling clients to retrieve full context around search results if needed

vs others: More configurable than fixed-size splitting and more efficient than overlapping all documents, while providing better context preservation than non-overlapping chunks

4

RAG-chunk – A CLI to test RAG chunking strategiesCLI Tool38/100

via “sliding-window chunking with configurable stride”

Show HN: RAG-chunk – A CLI to test RAG chunking strategies

Unique: Provides explicit sliding-window implementation with independent control of window size and stride, enabling fine-grained tuning of chunk overlap and coverage without code modification

vs others: More flexible than fixed-size chunking for controlling overlap, and simpler to tune than semantic chunking while providing predictable chunk sizes

5

recursive-llm-tsRepository34/100

via “context-window-aware-chunking-with-overlap”

TypeScript bridge for recursive-llm: Recursive Language Models for unbounded context processing with structured outputs

Unique: Combines token-aware chunking with semantic boundary detection and configurable overlap, rather than naive fixed-size chunking

vs others: More sophisticated than simple character-based chunking and preserves context across boundaries, whereas most frameworks use fixed-size chunks

6

Memory-PlusRepository31/100

via “text-chunking-with-semantic-preservation”

** a lightweight, local RAG memory store to record, retrieve, update, delete, and visualize persistent "memories" across sessions—perfect for developers working with multiple AI coders (like Windsurf, Cursor, or Copilot) or anyone who wants their AI to actually remember them.

Unique: Implements simple fixed-size chunking with overlap rather than sophisticated semantic splitting, prioritizing simplicity and predictability over perfect semantic preservation

vs others: Simpler than semantic chunking approaches (LlamaIndex's semantic splitter) by using fixed boundaries, reducing complexity while accepting potential semantic boundary violations

7

llm-splitterRepository29/100

via “configurable chunk size and overlap control”

Efficient, configurable text chunking utility for LLM vectorization. Returns rich chunk metadata.

Unique: Provides explicit, validated configuration parameters for chunk size, overlap, and strategy selection, allowing non-destructive experimentation with chunking behavior without modifying splitting logic

vs others: More flexible than fixed-strategy splitters by exposing configuration as first-class parameters, enabling easier integration into hyperparameter optimization pipelines

8

@memberjunction/ai-vectordbRepository28/100

via “document-chunking-and-embedding-strategy”

MemberJunction: AI Vector Database Module

Unique: Provides multiple chunking strategies (fixed, semantic, sliding-window) with configurable overlap and automatic metadata propagation, enabling optimization of chunk granularity for downstream retrieval quality

vs others: More flexible than simple fixed-size splitting by supporting semantic chunking and overlap configuration, while remaining simpler than specialized document parsing libraries

9

llm-chunkRepository26/100

via “configurable-chunk-size-and-overlap-management”

A super simple text splitter for LLM

Unique: Provides explicit, user-controlled overlap parameter rather than fixed or automatic overlap strategies, giving developers direct control over redundancy vs storage tradeoff without hidden heuristics

vs others: More transparent and predictable than LangChain's overlap implementation because parameters are explicit and not abstracted behind document-type detection, but requires more manual tuning

10

Private GPTProduct25/100

via “document-chunking-with-overlap”

Tool for private interaction with your documents

Unique: Implements structure-aware chunking that respects paragraph and section boundaries rather than naive token-based splitting, combined with configurable overlap to preserve context, and attaches rich metadata for source attribution

vs others: More sophisticated than simple fixed-size chunking used in basic RAG implementations; comparable to LangChain's recursive character splitter but with tighter integration to Private GPT's embedding and retrieval pipeline

Top Matches

Also Known As

Company