doctor
MCP ServerFreeDoctor is a tool for discovering, crawl, and indexing web sites to be exposed as an MCP server for LLM agents.
Capabilities11 decomposed
asynchronous web crawling with job queue orchestration
Medium confidenceDoctor implements a distributed crawling system using crawl4ai for HTML fetching paired with Redis-backed job queuing. The Web Service accepts crawl requests via REST API, enqueues them to Redis, and the Crawl Worker processes jobs asynchronously, enabling non-blocking crawl operations at scale. This microservice architecture decouples request handling from resource-intensive crawling, allowing the system to handle multiple concurrent crawl jobs without blocking client requests.
Uses Redis message queue to decouple crawl requests from processing, enabling true asynchronous job management with persistent queue state rather than in-memory task scheduling. Integrates crawl4ai as the crawling engine, providing modern browser-based content extraction.
Faster than synchronous crawlers for multi-site indexing because job queuing allows parallel processing across multiple worker instances, and more reliable than simple threading because Redis persists job state across restarts.
semantic text chunking with configurable splitting strategies
Medium confidenceThe Crawl Worker uses langchain_text_splitters to break extracted HTML text into semantically meaningful chunks before embedding. This capability supports multiple splitting strategies (character-based, token-based, recursive) to optimize chunk size for downstream embedding models, ensuring that semantic boundaries are preserved and chunks fit within embedding model token limits. The chunking strategy is configurable per crawl job, allowing optimization for different content types and embedding models.
Leverages langchain_text_splitters for configurable chunking strategies rather than naive fixed-size splitting, enabling semantic-aware chunk boundaries. Supports recursive splitting to handle nested document structures and preserves chunk overlap for context continuity.
More flexible than fixed-size chunking because it adapts to content structure and supports multiple splitting strategies; more efficient than sentence-level chunking because it respects token limits of embedding models.
configuration-driven system setup with environment variables
Medium confidenceDoctor uses environment variables and configuration files to control system behavior (embedding provider, Redis connection, DuckDB path, crawl parameters). This configuration-driven approach allows deployment-time customization without code changes, supporting different environments (dev, staging, production) with different settings. Configuration covers embedding model selection, database paths, queue settings, and crawl parameters like timeout and retry logic.
Implements configuration-driven setup using environment variables and config files, enabling deployment-time customization of embedding providers, database paths, and crawl parameters without code modification.
More flexible than hardcoded settings because configuration can be changed per deployment; more maintainable than scattered config logic because all settings are centralized.
multi-provider embedding generation with litellm abstraction
Medium confidenceDoctor abstracts embedding generation through litellm, enabling support for multiple embedding providers (OpenAI, Anthropic, local models) without changing core code. The Crawl Worker generates vector embeddings for each text chunk using the configured provider, storing both the chunk text and its vector representation in DuckDB. This abstraction allows switching embedding providers by configuration change, supporting cost optimization and model selection without code modification.
Uses litellm as an abstraction layer over embedding providers, enabling provider-agnostic embedding generation. This allows configuration-driven provider selection without code changes, supporting OpenAI, Anthropic, and local models through a unified interface.
More flexible than hardcoded OpenAI embeddings because it supports provider switching via configuration; more maintainable than custom provider adapters because litellm handles provider-specific API differences.
vector-backed semantic search with duckdb vss
Medium confidenceDoctor stores text chunks and their vector embeddings in DuckDB with vector search support (VSS), enabling semantic similarity search across indexed content. The system computes vector similarity between query embeddings and stored chunk embeddings, returning ranked results based on cosine similarity. This capability allows LLM agents to retrieve contextually relevant information from indexed websites using natural language queries, without requiring keyword matching.
Leverages DuckDB's native vector search support (VSS extension) for in-process semantic search without external vector database dependency. This eliminates the need for separate vector stores like Pinecone or Weaviate, reducing operational complexity and latency.
Simpler deployment than Pinecone/Weaviate because vector search is co-located with data in DuckDB; faster than external vector databases for small-to-medium collections because there's no network round-trip for search queries.
mcp server integration for llm agent tool access
Medium confidenceDoctor exposes its search and crawl capabilities through the Model Context Protocol (MCP), enabling LLM agents to discover, crawl, and search indexed websites as native tools. The MCP server translates agent tool calls into Doctor API requests, allowing agents to autonomously trigger crawls, search indexed content, and retrieve specific documents. This integration enables LLM agents to extend their knowledge beyond training data by accessing live web content through a standardized protocol.
Implements MCP server to expose Doctor capabilities as native LLM tools, enabling agents to autonomously trigger crawls and search without leaving the agent execution context. This standardized protocol integration allows compatibility with any MCP-supporting LLM.
More seamless than REST API integration because agents can call tools natively without custom HTTP logic; more standardized than custom agent plugins because MCP is a protocol-level standard supported by multiple LLM providers.
rest api for document search and retrieval
Medium confidenceDoctor exposes a REST API for querying indexed documents, allowing applications to search crawled content and retrieve specific chunks by semantic similarity or metadata filters. The API accepts search queries, executes vector similarity search against the DuckDB index, and returns ranked results with source URLs and chunk content. This capability enables non-agent applications to access indexed web content programmatically.
Provides REST API endpoints for semantic search and document retrieval, enabling non-agent applications to query indexed content. The API directly interfaces with DuckDB VSS, returning ranked results with full chunk content and metadata.
Simpler than building custom search UI because API returns structured results ready for display; more flexible than hardcoded search because API supports arbitrary semantic queries without predefined indexes.
crawl job lifecycle management with status tracking
Medium confidenceDoctor provides REST API endpoints for creating, monitoring, and managing crawl jobs with persistent status tracking. Jobs are enqueued to Redis with metadata (URL, status, progress, error messages), and clients can poll job status endpoints to track progress from queued → processing → completed/failed. The system stores job metadata in DuckDB, enabling historical tracking and error diagnosis. This capability allows applications to manage long-running crawl operations and handle failures gracefully.
Implements persistent job lifecycle tracking using Redis queue for state and DuckDB for metadata storage, enabling clients to monitor crawl progress and diagnose failures. Job status is queryable via REST API, providing visibility into asynchronous operations.
More reliable than in-memory job tracking because Redis persists queue state across restarts; more observable than fire-and-forget crawling because status endpoints provide real-time progress visibility.
html-to-text extraction with content cleaning
Medium confidenceThe Crawl Worker extracts plain text from crawled HTML using content extraction logic that removes boilerplate (navigation, ads, scripts), preserving main content. This extraction happens after crawl4ai fetches the page, converting raw HTML into clean text suitable for chunking and embedding. The extraction strategy balances content preservation with noise removal, ensuring that extracted text is semantically meaningful without excessive markup or irrelevant elements.
Integrates content extraction as part of the crawl pipeline, removing boilerplate and noise before text chunking. Uses crawl4ai's extraction capabilities combined with custom cleaning logic to produce semantically clean text.
More effective than regex-based HTML stripping because it understands content structure; more efficient than keeping raw HTML because extracted text is smaller and more relevant for embedding.
distributed crawl worker scaling with redis queue
Medium confidenceDoctor's architecture supports horizontal scaling of crawl workers by adding multiple worker instances that consume jobs from the same Redis queue. Each worker independently processes crawl jobs, extracts text, generates embeddings, and stores results in the shared DuckDB instance. The Redis queue ensures job distribution across workers without duplication, enabling linear scaling of crawl throughput by adding more worker instances.
Implements worker pool pattern with Redis queue for job distribution, enabling multiple crawl workers to process jobs concurrently without coordination overhead. Workers are stateless and can be added/removed dynamically.
More scalable than single-threaded crawling because workers process jobs in parallel; more reliable than shared memory queues because Redis persists queue state across worker failures.
duckdb-based persistent document storage with metadata indexing
Medium confidenceDoctor uses DuckDB as the primary data store for crawled content, storing raw text, text chunks, vector embeddings, and job metadata in structured tables. The database schema includes tables for jobs (status, metadata), documents (raw content, source URL), and chunks (text, embeddings, chunk index). This persistent storage enables long-term retention of indexed content and supports both semantic search (via VSS) and metadata-based queries (URL filtering, date ranges).
Uses DuckDB with VSS extension as the primary data store, eliminating the need for separate vector databases. Combines structured metadata tables with vector search in a single database, simplifying deployment and reducing operational complexity.
Simpler than separate vector DB + metadata store because all data is in one system; more cost-effective than managed vector databases because DuckDB is self-hosted and open-source.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with doctor, ranked by overlap. Discovered automatically through the match graph.
Crawl4AI
AI-optimized web crawler — clean markdown extraction, JS rendering, structured output for RAG.
WebDataSource
** - Web Crawler for AI Agents. Supercharge your AI agents with an MCP-ready web crawler that delivers real-time insights from the web and your private knowledge bases.
n8n-no-code-web-scraper
No-code web scraper built with n8n and ScrapingBee for AI-powered data extraction and automated web scraping workflows without writing code.
Supadata
** - Official MCP server for [Supadata](https://supadata.ai) - YouTube, TikTok, X and Web data for makers.
firecrawl-mcp
MCP server for Firecrawl web scraping integration. Supports both cloud and self-hosted instances. Features include web scraping, search, batch processing, structured data extraction, and LLM-powered content analysis.
BabyCatAGI
BabyCatAGI is a mod of BabyBeeAGI
Best For
- ✓LLM application builders needing to index dynamic web content
- ✓Teams building knowledge bases from live websites
- ✓Developers integrating web discovery into AI agents
- ✓Builders creating RAG systems with semantic search
- ✓Teams optimizing embedding quality for domain-specific content
- ✓Developers fine-tuning chunk size for specific embedding models
- ✓Teams deploying Doctor across multiple environments
- ✓Builders requiring flexible provider selection
Known Limitations
- ⚠Redis dependency required for job queue — no built-in fallback to in-memory queuing
- ⚠Crawl4ai may struggle with JavaScript-heavy sites requiring full browser rendering
- ⚠No built-in rate limiting or politeness delays — requires external configuration to avoid overwhelming target servers
- ⚠Chunking happens post-extraction — no awareness of original HTML structure or semantic markup
- ⚠No built-in handling of code blocks or structured data — treats all text uniformly
- ⚠Chunk overlap configuration is static per job, not adaptive based on content type
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Repository Details
Last commit: May 24, 2025
About
Doctor is a tool for discovering, crawl, and indexing web sites to be exposed as an MCP server for LLM agents.
Categories
Alternatives to doctor
Are you the builder of doctor?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →