sodacl domain-specific language parsing and compilation
Parses human-readable SodaCL YAML syntax into an abstract syntax tree (AST) that represents data quality checks, then compiles these checks into executable check objects. The parser uses a configuration-driven approach where SodaCL statements are tokenized, validated against a schema, and mapped to check type implementations. This enables non-technical users to define complex data quality rules without writing SQL directly.
Unique: Uses a layered parser architecture (SodaCLParser class) that separates tokenization, validation, and compilation phases, enabling extensible check type registration and custom check implementations without modifying the core parser logic
vs alternatives: More readable than raw SQL-based quality checks (like dbt tests) and more expressive than simple threshold-based tools, but less flexible than programmatic Python-based frameworks for complex multi-table logic
multi-dialect sql query generation and execution
Converts compiled SodaCL checks into dialect-specific SQL queries (PostgreSQL, Snowflake, BigQuery, Redshift, Spark, Athena) by routing through data source-specific adapter packages. Each adapter implements a QueryExecutor that translates generic check logic into optimized SQL for that database's syntax and functions, then executes the query and returns results as structured data. This abstraction enables the same check definition to run across heterogeneous data platforms.
Unique: Implements a data source adapter pattern where each database (Snowflake, BigQuery, Redshift, Spark, Athena, Postgres) has a dedicated package extending a QueryExecutor base class, enabling dialect-specific optimizations and native function usage without modifying core check logic
vs alternatives: More flexible than single-dialect tools (like dbt, which targets Snowflake/BigQuery/Redshift separately) and more performant than generic SQL translators because adapters use native database functions rather than lowest-common-denominator SQL
soda cloud integration with centralized quality monitoring
Integrates with Soda Cloud (SaaS platform) to upload scan results, enable centralized quality dashboards, configure alerts, and manage quality governance policies. The integration uses API credentials to authenticate with Soda Cloud, uploads scan results and check definitions, and enables cross-organization quality monitoring. Supports both push-based result uploads and pull-based scan scheduling from Soda Cloud.
Unique: Implements cloud integration via API-based result uploads and pull-based scan scheduling, enabling centralized quality monitoring without requiring on-premise infrastructure or custom integration code
vs alternatives: More comprehensive than standalone Soda Core because it adds centralized dashboards, alerts, and governance; more expensive than open-source alternatives because it requires SaaS subscription
cli-based scan execution with variable substitution and output formatting
Provides a command-line interface for executing scans with the `soda scan` command, supporting variable substitution, output format selection, and configuration overrides. The CLI parses command-line arguments, substitutes variables into SodaCL configurations, executes scans, and formats results as JSON, YAML, or text. Supports integration with CI/CD pipelines via exit codes and structured output formats.
Unique: Implements a CLI interface with variable substitution and multiple output formats, enabling easy integration into CI/CD pipelines and orchestration platforms without requiring custom wrapper scripts
vs alternatives: More user-friendly than programmatic Python API because it doesn't require code; less flexible than Python API because it doesn't support complex logic or conditional execution
custom check extension framework with pluggable check types
Enables extension of Soda with custom check types by implementing a Check base class and registering custom check implementations. The framework allows users to define custom metrics, validation logic, and result evaluation without modifying core Soda code. Custom checks are registered in the check type registry and can be used in SodaCL alongside built-in check types, enabling domain-specific quality checks tailored to specific use cases.
Unique: Implements a Check base class that enables custom check implementations to be registered in the check type registry, allowing domain-specific checks to be defined in Python and used in SodaCL without modifying core framework code
vs alternatives: More extensible than closed-source quality tools because it exposes the Check class API; requires more development effort than configuration-only tools because custom checks must be implemented in Python
metric-based data quality checks with threshold evaluation
Executes metric checks that compute aggregate statistics (row count, missing values, duplicate count, valid values) over entire tables or column subsets, then evaluates results against user-defined thresholds (exact values, ranges, or percentage-based). The metric check system generates SQL aggregation queries, caches results, and compares them to threshold configurations to produce pass/fail outcomes. Supports both simple numeric thresholds and complex multi-condition rules.
Unique: Implements a metric registry pattern where each metric type (missing_count, duplicate_count, row_count, valid_count) is a pluggable check class that generates dialect-specific SQL aggregations and evaluates results against configurable thresholds, enabling extensibility without modifying core evaluation logic
vs alternatives: More comprehensive than simple row count checks (like dbt freshness tests) because it includes missing value detection, duplicate detection, and validity checks; simpler than statistical anomaly detection tools because it uses fixed thresholds rather than learned baselines
distribution-based data quality checks with reference profiles
Captures and validates the statistical distribution of column values by computing frequency distributions, quantiles, and value ranges, then comparing current distributions against stored reference profiles (DRO files). The system generates SQL queries to compute distribution statistics, stores them in YAML-based distribution reference objects, and detects distribution drift when current values deviate from historical baselines. Supports both automatic reference generation and manual threshold configuration.
Unique: Implements a distribution reference object (DRO) pattern where statistical profiles are persisted as YAML files that can be version-controlled and updated via the `soda update-dro` CLI command, enabling reproducible distribution-based quality checks without requiring external reference databases
vs alternatives: More sophisticated than simple value list validation because it captures statistical properties and detects drift; lighter-weight than full data profiling tools because it focuses on specific columns and stores profiles in version-controllable YAML rather than external databases
anomaly detection using time-series statistical modeling
Detects anomalies in numeric metrics by fitting time-series models (Prophet from Facebook) to historical metric values and identifying deviations from expected trends. The soda-scientific package extends core Soda with anomaly check types that compute metrics over time windows, train Prophet models on historical data, and flag values that fall outside predicted confidence intervals. This enables unsupervised anomaly detection without manual threshold configuration.
Unique: Integrates Facebook's Prophet time-series forecasting library as an optional extension (soda-scientific) that learns from historical metric data to detect anomalies without manual threshold configuration, enabling adaptive quality monitoring that adjusts to seasonal patterns and trends
vs alternatives: More sophisticated than fixed-threshold checks because it learns from historical data and handles seasonality; less flexible than custom ML models because it's limited to Prophet's capabilities and requires separate package installation
+5 more capabilities