rule-less web page structured data extraction via computer vision
Automatically extracts structured data from arbitrary web pages without requiring CSS selectors, regex patterns, or manual rules. Uses computer vision to identify and classify page elements (text blocks, tables, images, metadata) and NLP to map them to domain-specific schemas (articles, products, organizations, events, discussions). Processes one page per API call, consuming 1 credit per extraction or 2 credits when routed through datacenter proxies for geo-spoofing or IP rotation.
Unique: Uses computer vision (image analysis) + NLP jointly to identify page structure without CSS selectors or regex, enabling extraction from pages with dynamic or non-standard HTML. Automatically detects content type (article vs. product vs. organization) and applies type-specific schema extraction in a single API call.
vs alternatives: Faster to deploy than Selenium/Puppeteer + regex pipelines because it requires no rule maintenance; more flexible than CSS-selector-based tools (Scrapy, Beautiful Soup) when page structure varies across domains.
web crawling and bulk extraction across site hierarchies
Crawlbot spiders websites across 50 to 50,000+ URLs, automatically following links and discovering pages within a domain or URL pattern. Applies the Extract API to each crawled page, returning structured data for all discovered pages. Crawling itself consumes zero credits; only the extraction of crawled pages consumes credits (1 per page). Supports configurable crawl depth, URL filtering, and crawl scheduling via the dashboard or API.
Unique: Decouples crawling (free) from extraction (paid), allowing users to discover site structure without cost and then selectively extract high-value pages. Combines web spidering with rule-less extraction, eliminating the need to maintain separate crawl rules and extraction rules.
vs alternatives: More cost-efficient than Scrapy + regex pipelines for large sites because crawling is free and extraction is pay-per-page; more maintainable than custom crawlers because extraction rules adapt automatically to page structure changes.
multi-language and multi-region knowledge graph indexing
Knowledge Graph indexes entities (organizations, articles, products, discussions, events) across multiple languages and regions. Article/News index (1.6B+ records) includes content from global news sources in multiple languages. Organization index (246M+ records) includes companies from multiple regions with localized data (e.g., revenue in local currency, regional employee counts). Product index (3M+ records) includes products from global e-commerce sites. No explicit documentation of supported languages or regions, but scale suggests broad coverage.
Unique: Knowledge Graph indexes 1.6B+ articles in multiple languages and 246M+ organizations across regions, enabling global entity search without requiring separate language-specific APIs or manual translation.
vs alternatives: More comprehensive than single-language APIs (e.g., English-only news APIs) because it covers global content; more cost-effective than building separate language-specific crawlers because data is pre-indexed.
entity and relationship extraction from unstructured text via nlp
Natural Language API extracts named entities (people, organizations, locations, products), relationships between entities (e.g., 'person works at organization'), and topic-level sentiment from raw text documents (1–10,000 characters). Uses NLP models to identify entity types, resolve entity references, and infer relationships without requiring labeled training data or custom entity definitions. Each document consumes 1 credit regardless of length (within the 1–10k character range).
Unique: Combines entity extraction, relationship inference, and sentiment analysis in a single API call without requiring separate models or training data. Automatically links extracted entities to Diffbot's 10B+ entity Knowledge Graph for entity resolution and enrichment.
vs alternatives: Simpler to integrate than spaCy + custom relationship extraction models because it requires no training data or model fine-tuning; more comprehensive than regex-based entity extraction because it infers relationships and resolves entity references.
knowledge graph search and entity lookup across 10b+ indexed entities
Knowledge Graph API provides query access to Diffbot's pre-indexed database of 10B+ entities across six types: Organizations (246M+ records with 50+ fields), Articles/News (1.6B+ records), Products (3M+ pre-crawled retail products), Discussions (forum/review data with entity matching), Events (23k+ normalized records), and People (scale unknown). Queries use Diffbot Query Language (DQL), a custom SQL-like syntax. Each entity record export consumes 25 credits. Supports filtering, sorting, and aggregation across entity types.
Unique: Pre-indexed 10B+ entity database with cross-entity relationships (e.g., people linked to organizations, organizations linked to news articles and funding events) enables multi-hop queries without requiring external knowledge base construction. DQL query language provides SQL-like filtering and aggregation without requiring REST API pagination loops.
vs alternatives: More comprehensive than single-source APIs (e.g., LinkedIn API for people, Crunchbase for companies) because it integrates data across news, products, discussions, and events; cheaper than building custom web crawlers to index equivalent data, though per-entity export cost is high for bulk operations.
person and organization data enrichment from knowledge graph
Enhance API enriches existing person or organization records by querying the Knowledge Graph and appending additional fields (revenue, locations, employees, funding, executives for organizations; employment history, education, social profiles for people). Input is a person name/email or organization name/domain; output is enriched record with 50+ fields for organizations or equivalent for people. Each enrichment consumes 1 credit (same as Natural Language API). Integrations available via Excel, Google Sheets, and Zapier for non-technical users.
Unique: Provides low-code enrichment via Excel/Sheets/Zapier integrations, enabling non-technical users to enrich datasets without API integration. Leverages pre-indexed Knowledge Graph to avoid real-time web scraping, providing faster enrichment with consistent data quality.
vs alternatives: Faster and cheaper than building custom web scrapers for company intelligence; more comprehensive than single-source APIs (e.g., Clearbit, Hunter) because it aggregates data across news, funding, products, and discussions; easier to integrate for non-technical users via Sheets/Excel.
credit-based pay-per-use api billing with tiered rate discounts
Diffbot uses a credit-based billing model where each API operation consumes a fixed number of credits: Extract (1 credit), Extract with proxy (2 credits), Natural Language (1 credit), Knowledge Graph export (25 credits), Enhance (1 credit). Monthly plans (Free, Startup, Plus, Enterprise) provide credit allotments at different per-credit rates ($0.001–$0.0009). Overage charges apply at the plan's per-credit rate. Free tier (10,000 credits/month, 5 calls/min) is perpetual with no trial expiration. No long-term contracts required; monthly billing.
Unique: Credit-based model decouples API operations from pricing, allowing different operations (Extract, Natural Language, Knowledge Graph export) to have different credit costs. Perpetual free tier with no trial expiration or credit card requirement lowers barrier to entry for small projects.
vs alternatives: More transparent than per-request pricing because credit costs are fixed and documented; more flexible than subscription-only models because overage charges allow usage to scale beyond monthly allotment without contract renegotiation.
low-code data enrichment via excel and google sheets integrations
Diffbot provides native integrations with Microsoft Excel and Google Sheets, allowing non-technical users to enrich datasets without API integration. Excel integration includes a visual query editor for Knowledge Graph searches and data enrichment. Google Sheets integration supports custom Diffbot Query Language (DQL) formulas for entity lookups and enrichment. Zapier integration enables trigger-based enrichment workflows (e.g., enrich new Salesforce leads with company data). All integrations consume credits at the same rate as direct API calls.
Unique: Brings Knowledge Graph enrichment to non-technical users via familiar tools (Excel, Sheets) without requiring API integration or custom code. Visual query editor in Excel abstracts DQL syntax, lowering barrier to entry for business users.
vs alternatives: More accessible than direct API integration for non-technical users; faster to deploy than building custom Python/Node.js scripts; integrates with existing Zapier workflows for teams already using no-code automation.
+3 more capabilities