Recursive Hierarchical Chunking With Fallback

1

DoclingRepository56/100

via “document chunking for rag with semantic awareness”

IBM's document converter — PDFs, DOCX to structured markdown with OCR and table extraction.

Unique: Uses document structure (headings, sections, paragraphs) detected during layout analysis to create semantically coherent chunks rather than naive character-count splitting, preserving heading hierarchy and section context in chunk metadata

vs others: More semantically aware than simple character-count chunking (LangChain's RecursiveCharacterTextSplitter) because it respects document structure; more flexible than fixed-size chunking because it adapts to variable section lengths

2

RAG_TechniquesRepository54/100

via “hierarchical-index-construction-and-traversal”

This repository showcases various advanced techniques for Retrieval-Augmented Generation (RAG) systems. Each technique has a detailed notebook tutorial.

Unique: Implements recursive document summarization to build multi-level hierarchies that enable top-down retrieval traversal, reducing embedding computations and improving efficiency for large collections — a structural approach to retrieval efficiency rather than algorithmic optimization

vs others: More efficient than flat indices for large collections because it reduces embeddings computed per query, and more effective than simple filtering because it uses semantic hierarchies rather than metadata-based pruning

3

LlamaIndexFramework47/100

via “intelligent document chunking and node splitting”

A data framework for building LLM applications over external data.

Unique: Implements a node-tree abstraction that preserves document hierarchy and enables parent-document retrieval patterns. Supports multiple splitting strategies (recursive, semantic, code-aware) with pluggable custom splitters, and automatically propagates metadata through the node tree.

vs others: More sophisticated than LangChain's text splitters because it preserves hierarchical relationships and supports semantic splitting; better for complex document structures than simple character-based splitting.

4

rag-memory-epf-mcpMCP Server46/100

via “semantic chunking with context preservation”

Project-local RAG memory MCP server — knowledge graph + multilingual vector + FTS5 in a single SQLite file. Per-project isolation, 30 MCP tools, codepoint-safe chunking (Korean/CJK/emoji).

Unique: Implements semantic chunking as part of the indexing pipeline, preserving code block and paragraph boundaries to ensure retrieved chunks are coherent units rather than arbitrary text splits, improving RAG quality

vs others: Better retrieval quality than fixed-size chunking for structured documents, and more maintainable than custom chunking logic because boundaries are detected automatically based on document structure

5

agentic-rag-for-dummiesRepository45/100

via “hierarchical parent-child document chunking with dual-embedding indexing”

A modular Agentic RAG built with LangGraph — learn Retrieval-Augmented Generation Agents in minutes.

Unique: Implements explicit parent-child chunk relationships with dual-embedding (dense + sparse BM25) indexing in a single Qdrant instance, rather than maintaining separate indices or flattening chunks. The VectorDatabaseManager and ParentStoreManager classes coordinate retrieval to return child chunks for ranking but parent context for generation, a pattern not standard in LangChain's default RecursiveCharacterTextSplitter.

vs others: Outperforms naive chunking strategies by reducing context loss (vs flat chunks) and retrieval latency (vs separate vector stores) while maintaining both semantic and keyword search capabilities in one index.

6

RAG-chunk – A CLI to test RAG chunking strategiesCLI Tool38/100

Show HN: RAG-chunk – A CLI to test RAG chunking strategies

Unique: Implements recursive chunking with explicit fallback hierarchy and structure preservation, enabling intelligent splitting that respects document semantics while enforcing size constraints

vs others: Better than fixed-size chunking for structured documents, and more predictable than pure semantic chunking while maintaining semantic coherence

7

llama-index-coreFramework34/100

via “hierarchical document chunking with semantic awareness”

Interface between LLMs and your data

Unique: Implements multiple chunking strategies (simple, recursive, semantic, hierarchical) with automatic parent-child relationship tracking, enabling retrieval systems to fetch full context by traversing node relationships. SemanticSplitter uses embedding-based boundary detection rather than token counting.

vs others: More sophisticated than LangChain's text splitters by preserving document hierarchy and supporting semantic boundaries; enables context-aware retrieval that recovers full sections rather than isolated chunks.

8

llama-indexFramework34/100

via “intelligent document chunking with semantic-aware node parsing”

Interface between LLMs and your data

Unique: Offers pluggable NodeParser strategies including semantic-aware splitting that respects document boundaries and language-specific parsing for code/markdown, with automatic metadata propagation through the node hierarchy

vs others: More sophisticated than LangChain's text splitters by preserving document hierarchy and offering semantic-aware chunking; supports language-specific parsing without external dependencies

9

DocMason – Agent Knowledge Base for local complex office filesRepository34/100

via “chunking and semantic segmentation of document content”

I think everyone has already read Karpathy's Post about LLM Knowledge Bases. Actually for recent weeks I am already working on agent-native knowledge base for complex research (DocMason). And it is purely running in Codex/Claude Code. I call this paradigm is: The repo is the app. Codex is

Unique: Uses structure-aware chunking that respects document hierarchy (sections, tables, lists) and creates overlapping chunks with full provenance metadata, rather than naive token-count splitting that destroys semantic boundaries

vs others: More sophisticated than LangChain's RecursiveCharacterTextSplitter because it understands document structure semantics and preserves table/section integrity, while simpler than enterprise solutions like Unstructured.io that require additional dependencies

10

VectorizeMCP Server34/100

via “intelligent text chunking with semantic awareness”

** - [Vectorize](https://vectorize.io) MCP server for advanced retrieval, Private Deep Research, Anything-to-Markdown file extraction and text chunking.

Unique: Implements semantic-aware chunking strategies that preserve document structure and meaning, rather than naive token-based splitting, with configurable overlap to maintain context across chunk boundaries

vs others: More sophisticated than LangChain's RecursiveCharacterTextSplitter because it considers semantic boundaries and document structure, producing higher-quality chunks for retrieval

11

@convex-dev/ragRepository34/100

via “document chunking and recursive text splitting”

A rag component for Convex.

Unique: Integrates chunking directly into the Convex RAG pipeline with automatic metadata propagation, so chunks are stored with full lineage information enabling direct retrieval of source documents without separate lookup queries

vs others: Simpler than LangChain's text splitters (no external dependencies), but less sophisticated than semantic chunking approaches that use embeddings to identify natural boundaries

12

UnstructuredMCP Server33/100

via “semantic chunking with configurable chunk boundaries”

** - Set up and interact with your unstructured data processing workflows in [Unstructured Platform](https://unstructured.io)

Unique: Implements boundary-aware chunking that respects document semantics (sentences, paragraphs, table cells) rather than naive token-count splitting. Maintains bidirectional traceability between chunks and source elements, enabling citation and source attribution in downstream RAG applications.

vs others: Superior to fixed-size token chunking (used by LangChain's RecursiveCharacterTextSplitter) because it preserves semantic units and provides element-level traceability; more flexible than document-level chunking because it handles large documents efficiently.

13

llama-parseCLI Tool30/100

via “semantic document chunking with context preservation”

Parse files into RAG-Optimized formats.

Unique: Preserves document hierarchy and semantic structure in chunks through vision-language model understanding of content relationships, enabling context-aware retrieval and maintaining chunk provenance for citation and ranking

vs others: Produces semantically coherent chunks that improve LLM reasoning compared to fixed-size splitting, and maintains provenance metadata for citation and source tracking unlike generic chunking libraries

14

NeedleMCP Server30/100

via “chunking-strategy-for-semantic-coherence”

** - Production-ready RAG out of the box to search and retrieve data from your own documents.

Unique: unknown — insufficient architectural detail on chunking algorithm, boundary detection method, or configurable chunk size parameters

vs others: Likely uses semantic-aware chunking rather than fixed-size windows, improving retrieval quality compared to naive splitting strategies

15

unstructuredRepository28/100

via “intelligent document chunking with semantic boundaries”

A library that prepares raw documents for downstream ML tasks.

Unique: Chunks at element boundaries (paragraph, table, section) rather than character counts, preserving semantic units and enabling overlap strategies that maintain context for embedding models

vs others: Respects document structure during chunking unlike simple token-count approaches, reducing semantic fragmentation in RAG systems

16

@memberjunction/ai-vectordbRepository28/100

via “document-chunking-and-embedding-strategy”

MemberJunction: AI Vector Database Module

Unique: Provides multiple chunking strategies (fixed, semantic, sliding-window) with configurable overlap and automatic metadata propagation, enabling optimization of chunk granularity for downstream retrieval quality

vs others: More flexible than simple fixed-size splitting by supporting semantic chunking and overlap configuration, while remaining simpler than specialized document parsing libraries

17

llm-chunkRepository26/100

via “recursive-text-chunking-with-delimiter-hierarchy”

A super simple text splitter for LLM

Unique: Uses a simple recursive delimiter-hierarchy approach (newline → space → character) rather than ML-based semantic segmentation or token-counting libraries, making it lightweight and dependency-free while trading off semantic precision for simplicity and speed

vs others: Simpler and faster than LangChain's RecursiveCharacterTextSplitter for basic use cases due to minimal dependencies, but lacks token-aware splitting and language-specific optimizations that more mature libraries provide

Top Matches

Also Known As

Company