llm-powered query refinement for dark web search optimization
Transforms raw user investigation queries into optimized search terms by routing them through a pluggable multi-provider LLM layer (OpenAI, Anthropic, Google, Ollama). The system uses prompt engineering to expand queries with domain-specific dark web terminology, synonyms, and alternative phrasings that improve hit rates across heterogeneous dark web search engines. Implementation delegates to llm.refine_query() which constructs a system prompt contextualizing the dark web domain, then streams the LLM response to generate semantically richer search queries.
Unique: Integrates domain-specific prompt engineering for dark web terminology expansion rather than generic query expansion; supports four LLM providers via unified abstraction layer (llm_utils.get_llm()) enabling provider switching without code changes, and contextualizes refinement within OSINT investigation workflows rather than generic search
vs alternatives: Outperforms generic query expansion tools (e.g., Elasticsearch query DSL) by leveraging LLM semantic understanding of dark web marketplace conventions, payment tracking terminology, and threat actor naming patterns specific to OSINT investigations
multi-engine concurrent dark web search with result aggregation
Queries multiple dark web search engines (Torch, Ahmia, Candle, etc.) concurrently using a thread-pooled orchestration pattern implemented in search.py:get_search_results(). Each search engine query is wrapped in a timeout-protected thread to prevent hanging on slow .onion sites; results are aggregated into a unified list of URLs and titles. The system handles search engine-specific response formats through adapter patterns, normalizing heterogeneous HTML/JSON responses into a common data structure for downstream LLM filtering.
Unique: Implements thread-pooled concurrent search across heterogeneous dark web search engines with timeout protection and adapter-based response normalization, rather than sequential queries or single-engine reliance; integrates Tor SOCKS5 proxy routing at the HTTP client level to ensure anonymity across all search engine queries
vs alternatives: Faster than sequential dark web search tools by parallelizing queries across 4+ engines simultaneously; more comprehensive than single-engine tools (e.g., Torch-only searches) by aggregating results across multiple indices with different indexing patterns and coverage
configuration management via environment variables and config files
Manages Robin configuration through a two-tier system: environment variables for sensitive credentials (API keys, Tor proxy address) and YAML/JSON config files for operational settings (model selection, timeout values, search engine whitelist). The system reads environment variables first (highest priority), then falls back to config file values, then uses hardcoded defaults. Configuration is loaded at startup in main.py and passed through the investigation pipeline. This approach enables secure credential management (via environment variables in Docker/Kubernetes) while allowing flexible operational configuration (via config files for different investigation types).
Unique: Implements two-tier configuration (environment variables + config files) with environment variable priority, enabling secure credential management while allowing flexible operational configuration; supports multiple config file formats (YAML, JSON) for flexibility
vs alternatives: More secure than hardcoded credentials by using environment variables; more flexible than single-tier configuration by supporting both sensitive (credentials) and operational (parameters) settings; more portable than system-specific config locations by supporting multiple formats
llm-based intelligent result filtering with relevance scoring
Filters dark web search results using LLM-powered relevance scoring implemented in llm.py:filter_results(). The system constructs a prompt containing the original investigation query and candidate search results, then uses the LLM to score each result's relevance to the investigation objective. Results are ranked by LLM-assigned relevance scores and filtered to retain only high-confidence matches, reducing noise from off-topic .onion pages. This approach captures semantic relevance beyond keyword matching — e.g., identifying a marketplace listing as relevant to 'ransomware payment tracking' even if it doesn't contain the exact phrase.
Unique: Uses LLM semantic understanding to score relevance rather than keyword matching or TF-IDF, enabling detection of conceptually related pages that don't contain exact query terms; integrates with the multi-provider LLM abstraction to allow filtering with different models and comparing their scoring patterns
vs alternatives: More semantically accurate than regex/keyword-based filtering (e.g., grep-based result filtering) because it understands synonyms and contextual relevance; faster than manual review but slower than simple keyword filtering, trading latency for recall/precision improvements
tor-routed anonymous content scraping from .onion sites
Extracts HTML content from dark web .onion sites by routing HTTP requests through a Tor SOCKS5 proxy (127.0.0.1:9050) implemented in scrape.py:scrape_multiple(). The system uses a thread-pooled architecture to scrape multiple URLs concurrently with per-request timeout protection (default 30 seconds) to prevent hanging on slow/offline sites. Responses are parsed with BeautifulSoup to extract text content, and failures (connection timeouts, 404s, Tor circuit failures) are gracefully handled with fallback retry logic. The implementation maintains request anonymity by routing all HTTP traffic through Tor and rotating user agents to avoid fingerprinting.
Unique: Implements thread-pooled concurrent scraping with per-request timeout protection and Tor SOCKS5 proxy routing at the HTTP client level, ensuring anonymity across all requests; integrates graceful failure handling with retry logic rather than blocking on slow/offline sites, enabling large-scale scraping without manual intervention
vs alternatives: Faster than sequential scraping by parallelizing requests across 5-10 threads; more reliable than naive Tor scraping by implementing timeout protection and retry logic; more anonymous than direct HTTP scraping by routing all traffic through Tor and rotating user agents
structured osint report generation from raw dark web content
Synthesizes raw scraped content, search results, and metadata into structured intelligence reports using LLM-powered summarization implemented in llm.py:generate_summary(). The system constructs a prompt containing the investigation query, filtered search results, and scraped page content, then uses the LLM to extract key findings, identify threat indicators (IOCs), and organize information into a structured report with sections like 'Threat Overview', 'Key Findings', 'Indicators of Compromise', and 'Recommendations'. The report is formatted as JSON or markdown for downstream consumption by SIEM systems, threat intelligence platforms, or human analysts.
Unique: Implements LLM-powered synthesis of heterogeneous dark web content (marketplace listings, forum posts, leaked data) into structured OSINT reports with explicit IOC extraction, rather than simple text summarization; integrates with the multi-provider LLM abstraction to allow report generation with different models and comparing output quality
vs alternatives: More actionable than generic summarization tools because it extracts structured IOCs and threat indicators; faster than manual report writing by automating synthesis of 20+ pages into a structured format; more flexible than template-based reporting by using LLM to adapt report structure to investigation context
multi-provider llm abstraction with unified interface
Provides a pluggable abstraction layer for multiple LLM providers (OpenAI, Anthropic, Google, Ollama) implemented in llm_utils.py:get_llm(). The system uses a factory pattern to instantiate the appropriate LLM client based on environment variables or configuration, enabling seamless provider switching without modifying downstream code. Each provider is wrapped with a consistent interface supporting streaming responses, token counting, and error handling. Configuration is managed through environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) and a config file, allowing users to specify model selection, temperature, and max tokens per provider.
Unique: Implements a unified factory pattern abstraction across four distinct LLM providers (OpenAI, Anthropic, Google, Ollama) with consistent interface for streaming, error handling, and configuration, rather than provider-specific client code scattered throughout the codebase; enables on-premises execution via Ollama while maintaining API compatibility with cloud providers
vs alternatives: More flexible than provider-locked tools (e.g., OpenAI-only OSINT tools) by supporting multiple providers; more maintainable than conditional provider logic throughout codebase by centralizing provider instantiation; enables cost optimization by allowing provider switching based on query complexity
six-stage investigation pipeline orchestration
Orchestrates a complete dark web OSINT investigation workflow through a six-stage pipeline implemented in main.py:cli(). The pipeline sequentially executes: (1) LLM initialization, (2) query refinement, (3) multi-engine search, (4) result filtering, (5) content scraping, and (6) report generation. Each stage is implemented as a modular function with clear input/output contracts, enabling easy insertion of custom stages or modification of existing ones. The orchestration layer handles error propagation, logging, and progress reporting across stages, with optional checkpointing to resume interrupted investigations.
Unique: Implements a six-stage investigation pipeline with clear modular boundaries and unified orchestration in main.py, enabling easy extension and customization; integrates all Robin capabilities (query refinement, search, filtering, scraping, synthesis) into a cohesive workflow rather than exposing individual functions
vs alternatives: More comprehensive than single-purpose tools (e.g., search-only or scrape-only tools) by automating the entire investigation workflow; more maintainable than monolithic scripts by decomposing the pipeline into modular stages with clear contracts
+3 more capabilities