Inter Annotator Agreement Measurement

1

Natural QuestionsDataset58/100

via “multi-annotator agreement and answer quality assessment”

307K real Google Search queries answered from Wikipedia.

Unique: Includes explicit inter-annotator agreement metrics for each question, enabling researchers to understand benchmark reliability and filter by agreement level

vs others: More transparent about annotation quality than benchmarks that hide disagreement, allowing researchers to make informed decisions about evaluation methodology

2

OpenAssistant Conversations (OASST)Dataset58/100

via “human quality rating aggregation with inter-annotator agreement metrics”

161K human-written messages in 35 languages with quality ratings.

Unique: Provides raw per-annotator ratings alongside aggregates, enabling downstream systems to compute custom agreement metrics and weight examples by confidence rather than using fixed aggregation. Most datasets only expose final scores.

vs others: Richer annotation metadata than single-rater datasets (e.g., Alpaca) or datasets with binary labels, allowing nuanced quality-based filtering and confidence-weighted training.

3

Scale AIPlatform57/100

via “inter-annotator agreement measurement and conflict resolution”

Enterprise AI data labeling with managed annotation workforce.

Unique: Combines automatic agreement calculation with expert adjudication routing, creating a feedback loop where low-agreement examples are escalated rather than accepted, ensuring final dataset quality

vs others: More rigorous than platforms that accept single-pass annotations because it measures agreement as a quality signal and routes conflicts to experts, whereas crowdsourcing platforms often accept majority vote without expert review

4

DoccanoRepository56/100

via “annotation quality monitoring with inter-annotator agreement metrics”

Open-source text annotation for NLP tasks.

Unique: Implements multiple IAA metrics (Cohen's Kappa, Fleiss' Kappa, Krippendorff's Alpha) via scikit-learn, computed asynchronously via Celery and cached in the database — metrics are filterable by label, date, and annotator pair, enabling drill-down analysis of disagreement

vs others: More comprehensive than Prodigy (which has no IAA support) but less sophisticated than specialized quality tools like Labelbox's quality metrics; better for teams needing standard IAA metrics without custom analysis

5

label-studioRepository26/100

via “inter-annotator agreement measurement and quality control”

Label Studio annotation tool

Unique: Stores agreement scores in database alongside annotations, enabling efficient filtering and sorting without recalculation; integrates with Data Manager UI for visual exploration of agreement patterns

vs others: More integrated than manual agreement calculation because metrics are computed automatically; simpler than external tools like MIAOU because agreement is built into the annotation workflow

6

DatasaurProduct

via “inter-annotator-agreement-measurement”

7

Kili TechnologyProduct

via “inter-annotator agreement metrics”

8

LabelboxProduct

via “consensus scoring and inter-annotator agreement measurement”

Top Matches

Also Known As

Company