What can Have I Been Trained? do?

reverse-image-lookup-against-training-datasets, multi-model-training-dataset-aggregation, perceptual-image-matching-with-tolerance, training-dataset-provenance-reporting, batch-image-dataset-scanning, training-dataset-index-maintenance

Have I Been Trained?

Product

Check if your image has been used to train popular AI art models.

/ 100

6 capabilities

Capabilities6 decomposed

reverse-image-lookup-against-training-datasets

Medium confidence

Accepts an image file and performs reverse-lookup queries against indexed snapshots of popular AI art model training datasets (LAION, Stable Diffusion, Midjourney, DALL-E, etc.) using perceptual hashing and semantic embedding matching. The system likely maintains pre-computed hash tables and vector indices of known training data, then compares incoming images against these indices to detect matches or near-duplicates, returning provenance metadata if found.

Solves for

I want to check if my artwork was scraped and used to train a specific AI model without my consentI need to verify whether my photograph appears in any public training dataset before licensing itI'm investigating copyright infringement claims and need evidence of whether my image was in a model's training set

Best for

artists and photographers concerned about unauthorized use in AI training

legal teams investigating copyright violations in generative AI

content creators wanting to audit their digital footprint across ML datasets

Requires

Image file in common formats (JPEG, PNG, WebP, etc.)

Internet connection to query remote dataset indices

Image resolution sufficient for perceptual hashing (typically 256x256 minimum)

Limitations

Only detects images that were actually included in indexed training snapshots; cannot detect images used in private or proprietary training runs

Matching accuracy depends on image quality and whether the exact image or only similar variants were in training data

Dataset indices are static snapshots and may not reflect real-time training data collection

What makes it unique

Specializes in detecting whether images appear in AI model training datasets by maintaining indexed snapshots of LAION, Stable Diffusion, and other public training corpora, using perceptual hashing to match images even after compression or minor modifications, rather than generic reverse-image search

vs alternatives

More targeted than Google Images reverse search because it specifically indexes AI training datasets rather than the general web, and more comprehensive than individual model documentation because it aggregates multiple training sources in one query

multi-model-training-dataset-aggregation

Medium confidence

Maintains a unified index across multiple popular generative AI model training datasets (Stable Diffusion, DALL-E, Midjourney, etc.) and exposes a single query interface to check an image against all indexed datasets simultaneously. This likely involves periodic crawling or partnership access to dataset metadata, normalization of dataset schemas, and a federated search architecture that queries multiple indices in parallel and aggregates results.

Solves for

I want to check my image against all major AI art models in one query instead of checking each separatelyI need to understand which specific models might have been trained on my imageI'm building a compliance tool and need to query multiple training datasets programmatically

Best for

artists wanting a one-stop verification tool across all major models

legal/compliance teams needing comprehensive training data audits

platforms building content moderation features around training data transparency

Requires

Image file in supported format

Internet connection with sufficient bandwidth

No API key required for basic web interface (if applicable)

Limitations

Coverage is limited to publicly documented or accessible training datasets; proprietary models with closed training data cannot be queried

Index freshness varies by dataset; some may be months or years old depending on update frequency

Aggregation latency increases with number of datasets queried; may timeout on very large images or slow connections

What makes it unique

Aggregates training dataset indices from multiple competing generative AI models into a single queryable interface, rather than requiring users to check each model's dataset separately or use disparate tools

vs alternatives

Broader coverage than checking individual model documentation or using model-specific tools, and more efficient than manual searches across multiple platforms

perceptual-image-matching-with-tolerance

Medium confidence

Uses perceptual hashing algorithms (likely pHash, dHash, or similar) to match images even when they have been slightly modified (compressed, cropped, color-shifted, watermarked). The system computes a compact hash fingerprint of the query image and compares it against pre-computed hashes of training dataset images, using a configurable similarity threshold to determine matches. This enables detection of images that are visually identical or near-identical to training data despite minor transformations.

Solves for

I want to find my image in a training dataset even if it was compressed or slightly edited before being scrapedI need to detect if a cropped or watermarked version of my image was used in trainingI'm checking if my image appears in training data despite format conversion or quality degradation

Best for

artists verifying their work against training datasets with tolerance for compression artifacts

researchers studying how training data collection handles image variants

Requires

Image file with sufficient resolution for hash computation

Original image or near-original for best matching accuracy

Limitations

Perceptual hashing may produce false positives on visually similar but distinct images (e.g., two different photos of the same subject)

Tolerance threshold is fixed or limited in configurability; may miss heavily modified images or produce too many false positives if set too loose

Does not work on images that have been substantially altered (e.g., heavily edited, restyled, or combined with other images)

What makes it unique

Implements perceptual hashing with configurable tolerance thresholds to detect training dataset images even after compression, cropping, or minor modifications, rather than requiring exact pixel-level matches

vs alternatives

More robust than cryptographic hashing (MD5, SHA) which fails on any modification, and more practical than deep learning-based similarity because it's faster and doesn't require GPU resources

training-dataset-provenance-reporting

Medium confidence

When a match is detected, generates a detailed report showing which dataset(s) contain the image, metadata about the dataset (size, creation date, model association), and links to source documentation or dataset repositories. The system aggregates metadata from multiple sources and formats it into a human-readable report that provides context about how the image entered the training pipeline.

Solves for

I found my image in a training dataset and need to understand which model it was used for and whenI want to cite the dataset source in a legal complaint or DMCA takedown noticeI need to document the provenance chain for my image across multiple AI models

Best for

artists building evidence for copyright claims

legal teams preparing formal complaints

researchers documenting training data provenance

Requires

Successful image match from reverse-lookup capability

Access to dataset metadata repositories and documentation

Limitations

Provenance metadata is only as complete as the underlying dataset documentation; many datasets lack detailed source attribution

Cannot determine whether the image was licensed, scraped, or obtained through other means

Report generation may be delayed if dataset indices are not real-time

What makes it unique

Aggregates and formats provenance metadata from multiple training dataset sources into a structured report suitable for legal or research purposes, rather than just returning a binary match result

vs alternatives

More actionable than raw dataset indices because it contextualizes matches with model associations and source documentation, and more comprehensive than individual model transparency reports

batch-image-dataset-scanning

Medium confidence

Accepts multiple images (via file upload, URL list, or API) and processes them in parallel or queued batches against the training dataset indices. The system likely implements job queuing, rate limiting, and asynchronous processing to handle multiple images without blocking, returning results as a consolidated report or per-image breakdown. This enables artists or platforms to audit large collections of images efficiently.

Solves for

I want to check 100+ images from my portfolio against training datasets without manually uploading each oneI need to audit all images on my platform to identify which ones appear in AI training dataI'm building a content moderation system and need to batch-check user uploads

Best for

artists with large portfolios needing bulk verification

platforms implementing training data transparency features

researchers conducting large-scale studies of training data composition

Requires

Multiple image files or URLs

Batch upload interface or API endpoint

Sufficient storage/memory for queuing multiple images

Limitations

Batch processing may have rate limits or queue delays depending on service load

Large batches may timeout or require splitting into smaller chunks

Results may be returned asynchronously, requiring polling or webhook callbacks

What makes it unique

Implements batch processing with job queuing and asynchronous result delivery to handle multiple image scans efficiently, rather than requiring sequential single-image uploads

vs alternatives

More scalable than manual per-image uploads for large portfolios, and more practical than building custom batch infrastructure for individual artists or small platforms

training-dataset-index-maintenance

Medium confidence

Periodically crawls, ingests, and updates indices of public training datasets (LAION snapshots, Stable Diffusion dataset releases, etc.) to keep the searchable corpus current. This likely involves automated pipelines that detect new dataset releases, download metadata, compute perceptual hashes for new images, and update the search indices. The system must handle versioning to track which dataset snapshot was used for each match.

Solves for

I want to know that the training dataset index I'm querying is recent and includes the latest publicly released training dataI need to understand which version of a training dataset was queried when I checked my imageI'm researching how training datasets evolve over time and need access to historical snapshots

Best for

researchers studying training data composition and evolution

platforms needing to maintain up-to-date training data indices

users wanting confidence that their queries reflect current public datasets

Requires

Automated pipeline infrastructure for dataset discovery and ingestion

Storage for multiple dataset versions and their indices

Compute resources for hash computation and index updates

Limitations

Index updates lag behind actual dataset releases; there is inherent delay between dataset publication and index availability

Only public or documented datasets can be indexed; private or proprietary training data remains inaccessible

Storage and compute costs for maintaining large indices may limit update frequency

What makes it unique

Maintains versioned indices of multiple training dataset snapshots with automated update pipelines, enabling users to understand which dataset version was queried and track how training data evolves over time

vs alternatives

More transparent than static indices because it tracks versions and update dates, and more comprehensive than relying on individual model documentation which may lag behind actual training data releases

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Have I Been Trained?, ranked by overlap. Discovered automatically through the match graph.

Web App27

Have I Been Trained?

Check if your image has been used to train popular AI art...

perceptual-hash-based image matching against training datasetsimage similarity clustering and variant detectionmodel-specific training data attribution with confidence scoring

3 shared capabilities

Model21

Qwen: Qwen3 VL 235B A22B Thinking

Qwen3-VL-235B-A22B Thinking is a multimodal model that unifies strong text generation with visual understanding across images and video. The Thinking model is optimized for multimodal reasoning in STEM and math....

dense visual question-answering with multi-image reasoning

1 shared capability

Model20

Qwen: Qwen VL Max

Qwen VL Max is a visual understanding model with 7500 tokens context length. It excels in delivering optimal performance for a broader spectrum of complex tasks.

comparative visual analysis across multiple images

1 shared capability

Dataset45

ShareGPT4V

1.2M image-text pairs with GPT-4V captions.

cross-modal retrieval and similarity search dataset support

1 shared capability

Dataset46

LLaVA-Instruct 150K

150K visual instruction examples for multimodal model training.

cross-domain visual understanding generalization

1 shared capability

Dataset46

MS COCO (Common Objects in Context)

330K images with object detection, segmentation, and captions.

multi-task dataset with unified annotation schema across detection, segmentation, captioning, and pose

1 shared capability

Best For

✓artists and photographers concerned about unauthorized use in AI training
✓legal teams investigating copyright violations in generative AI
✓content creators wanting to audit their digital footprint across ML datasets
✓artists wanting a one-stop verification tool across all major models
✓legal/compliance teams needing comprehensive training data audits
✓platforms building content moderation features around training data transparency
✓artists verifying their work against training datasets with tolerance for compression artifacts
✓copyright investigators needing to match images despite minor modifications

Known Limitations

⚠Only detects images that were actually included in indexed training snapshots; cannot detect images used in private or proprietary training runs
⚠Matching accuracy depends on image quality and whether the exact image or only similar variants were in training data
⚠Dataset indices are static snapshots and may not reflect real-time training data collection
⚠Cannot distinguish between legitimate licensed use and unauthorized scraping
⚠Coverage is limited to publicly documented or accessible training datasets; proprietary models with closed training data cannot be queried
⚠Index freshness varies by dataset; some may be months or years old depending on update frequency

Requirements

Image file in common formats (JPEG, PNG, WebP, etc.)Internet connection to query remote dataset indicesImage resolution sufficient for perceptual hashing (typically 256x256 minimum)Image file in supported formatInternet connection with sufficient bandwidthNo API key required for basic web interface (if applicable)Image file with sufficient resolution for hash computationOriginal image or near-original for best matching accuracy

Input / Output

Accepts: image, image match result (structured data from reverse-lookup), image (multiple), dataset metadata (from public sources)

Produces: structured data (match results with dataset name, model name, confidence score), text (provenance metadata, dataset source links), structured data (per-model match results with confidence scores), text (summary report listing which models contain the image), structured data (match confidence score, similarity percentage), text (match status: found, not found, or partial match), text (formatted report with dataset names, model associations, source links), structured data (JSON/CSV export of provenance metadata), structured data (per-image match results in CSV or JSON), text (summary report with statistics), structured data (index version metadata, update timestamps), text (changelog of dataset updates)

UnfragileRank

Adoption15%(30% weight)

Quality14%(25% weight)

Ecosystem25%(15% weight)

Match Graph10%(25% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.