<p align="center">
<img src="assets/ExDataCheck.svg" alt="ExDataCheck" width="150"/>
</p>
# ExDataCheck
**Data Validation and Quality Library for ML Pipelines**
[](https://elixir-lang.org)
[](https://www.erlang.org)
[](https://github.com/North-Shore-AI/ExDataCheck/blob/main/LICENSE)
[](https://hexdocs.pm/ex_data_check)
---
A comprehensive data validation and quality assessment library for Elixir, specifically designed for machine learning workflows. ExDataCheck provides Great Expectations-style validation, data profiling, schema validation, and quality metrics to ensure your ML pipelines work with high-quality data.
## Features
- **Expectations-Based Validation**: Define declarative expectations about your data (inspired by Great Expectations)
- **Data Profiling**: Automatic statistical profiling and data characterization
- **Schema Validation**: Type checking, structure validation, and schema enforcement
- **Quality Metrics**: Comprehensive data quality scoring and reporting
- **ML-Specific Checks**: Feature distributions, data drift detection, label imbalance
- **Pipeline Integration**: Seamlessly integrate into ETL and ML pipelines
- **Streaming Support**: Validate data in real-time as it flows through your pipeline
- **Rich Reporting**: Generate detailed validation reports in multiple formats
## Design Principles
1. **Declarative Expectations**: Express data requirements as clear, testable expectations
2. **Fail Fast**: Catch data quality issues early in the pipeline
3. **Comprehensive Metrics**: Track data quality across multiple dimensions
4. **ML-Aware**: Built specifically for machine learning use cases
5. **Production Ready**: Designed for high-throughput production environments
6. **Observable**: Rich logging and reporting for data quality monitoring
## Installation
Add `ex_data_check` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:ex_data_check, "~> 0.1.0"}
]
end
```
Or install from GitHub:
```elixir
def deps do
[
{:ex_data_check, github: "North-Shore-AI/ExDataCheck"}
]
end
```
## Quick Start
### Basic Expectations
```elixir
# Define expectations for a dataset
dataset = [
%{age: 25, income: 50000, score: 0.85},
%{age: 32, income: 75000, score: 0.92},
%{age: 28, income: 62000, score: 0.78}
]
expectations = [
expect_column_values_to_be_between(:age, 18, 100),
expect_column_values_to_be_of_type(:income, :integer),
expect_column_values_to_be_in_range(:score, 0.0, 1.0),
expect_column_to_exist(:age),
expect_no_missing_values(:income)
]
result = ExDataCheck.validate(dataset, expectations)
# => %ExDataCheck.ValidationResult{
# success: true,
# expectations_met: 5,
# expectations_failed: 0,
# details: [...]
# }
```
### Data Profiling
```elixir
# Profile your dataset to understand its characteristics
profile = ExDataCheck.profile(dataset)
# => %ExDataCheck.Profile{
# row_count: 3,
# column_count: 3,
# columns: %{
# age: %{type: :integer, min: 25, max: 32, mean: 28.33, ...},
# income: %{type: :integer, min: 50000, max: 75000, ...},
# score: %{type: :float, min: 0.78, max: 0.92, ...}
# },
# missing_values: %{},
# quality_score: 0.98
# }
```
### Schema Validation
```elixir
# Define and enforce schemas
schema = ExDataCheck.Schema.new([
{:age, :integer, required: true, min: 0, max: 150},
{:income, :integer, required: true, min: 0},
{:score, :float, required: true, min: 0.0, max: 1.0},
{:name, :string, required: false}
])
{:ok, validated_data} = ExDataCheck.validate_schema(dataset, schema)
```
### Quality Metrics
```elixir
# Calculate comprehensive quality metrics
metrics = ExDataCheck.quality_metrics(dataset)
# => %ExDataCheck.QualityMetrics{
# completeness: 1.0, # No missing values
# validity: 0.98, # 98% of values pass constraints
# consistency: 0.95, # Cross-column consistency
# accuracy: 0.92, # Estimated accuracy (if ground truth available)
# timeliness: 1.0, # Data freshness
# overall_score: 0.97
# }
```
## Expectations Reference
### Value Expectations
```elixir
# Column values must be between min and max
expect_column_values_to_be_between(:age, 0, 120)
# Column values must be in a set
expect_column_values_to_be_in_set(:country, ["US", "UK", "CA"])
# Column values must match a regex
expect_column_values_to_match_regex(:email, ~r/@/)
# Column values must not be null
expect_column_values_to_not_be_null(:user_id)
# Column values must be unique
expect_column_values_to_be_unique(:transaction_id)
```
### Statistical Expectations
```elixir
# Column mean should be approximately a value
expect_column_mean_to_be_between(:age, 25, 35)
# Column standard deviation
expect_column_stdev_to_be_between(:score, 0.1, 0.3)
# Column median
expect_column_median_to_be_between(:income, 40000, 60000)
# Percentile checks
expect_column_quantile_to_be(:age, 0.95, 65)
```
### ML-Specific Expectations
```elixir
# Feature distribution checks
expect_feature_distribution(:age, :normal, mean: 30, stdev: 10)
# Label balance
expect_label_balance(:class, min_ratio: 0.3)
# Feature correlation
expect_feature_correlation(:feature_a, :feature_b, max: 0.9)
# Data drift detection
expect_no_data_drift(:features, reference_distribution)
```
### Schema Expectations
```elixir
# Column must exist
expect_column_to_exist(:user_id)
# Column type check
expect_column_to_be_of_type(:age, :integer)
# Number of columns
expect_column_count_to_equal(10)
# Table row count
expect_table_row_count_to_be_between(1000, 10000)
```
## Data Profiling
ExDataCheck provides comprehensive data profiling capabilities:
```elixir
# Generate a full profile
profile = ExDataCheck.profile(dataset, detailed: true)
# Profile includes:
# - Column types and cardinality
# - Statistical summaries (min, max, mean, median, stdev)
# - Missing value analysis
# - Distribution analysis
# - Correlation matrix
# - Outlier detection
# - Data quality score
# Export profile to various formats
ExDataCheck.Profile.to_json(profile)
ExDataCheck.Profile.to_html(profile)
ExDataCheck.Profile.to_markdown(profile)
```
## Schema Validation
Define strict schemas for your data:
```elixir
schema = ExDataCheck.Schema.new([
# Column name, type, options
{:user_id, :integer, required: true, unique: true},
{:email, :string, required: true, format: ~r/@/},
{:age, :integer, required: true, min: 18, max: 100},
{:score, :float, required: true, min: 0.0, max: 1.0},
{:tags, {:list, :string}, required: false},
{:metadata, :map, required: false}
])
# Validate entire dataset
case ExDataCheck.validate_schema(dataset, schema) do
{:ok, validated_data} ->
# All data passes schema validation
process_data(validated_data)
{:error, validation_errors} ->
# Handle validation errors
log_errors(validation_errors)
end
```
## Pipeline Integration
Integrate ExDataCheck into your ML pipelines:
```elixir
defmodule MyMLPipeline do
use ExDataCheck.Pipeline
def run(data) do
data
|> validate_with([
expect_column_to_exist(:features),
expect_column_to_exist(:labels),
expect_no_missing_values(:features),
expect_label_balance(:labels, min_ratio: 0.2)
])
|> profile(store: :pipeline_metrics)
|> transform()
|> validate_output([
expect_column_count_to_equal(10),
expect_table_row_count_to_be_between(100, 10000)
])
end
defp transform(validated_data) do
# Your transformation logic
validated_data
end
end
```
## Quality Monitoring
Track data quality over time:
```elixir
# Initialize quality monitor
monitor = ExDataCheck.Monitor.new()
# Add quality checks
monitor
|> ExDataCheck.Monitor.add_check(:completeness, threshold: 0.95)
|> ExDataCheck.Monitor.add_check(:validity, threshold: 0.90)
|> ExDataCheck.Monitor.add_check(:consistency, threshold: 0.85)
# Run checks on batches
result = ExDataCheck.Monitor.check(monitor, batch_data)
# Alert on quality degradation
if result.overall_score < 0.90 do
alert_quality_issue(result)
end
```
## Data Drift Detection
Detect when your data distribution changes:
```elixir
# Establish baseline
baseline = ExDataCheck.Drift.create_baseline(training_data)
# Check for drift in production data
drift_result = ExDataCheck.Drift.detect(production_data, baseline)
# => %ExDataCheck.DriftResult{
# drifted: true,
# columns_drifted: [:age, :income],
# drift_scores: %{age: 0.23, income: 0.45, score: 0.02},
# method: :kolmogorov_smirnov
# }
if drift_result.drifted do
notify_team("Data drift detected in columns: #{inspect(drift_result.columns_drifted)}")
trigger_retraining()
end
```
## Reporting
Generate comprehensive validation reports:
```elixir
result = ExDataCheck.validate(dataset, expectations)
# Markdown report
markdown = ExDataCheck.Report.to_markdown(result)
File.write!("validation_report.md", markdown)
# HTML report
html = ExDataCheck.Report.to_html(result, template: :detailed)
File.write!("validation_report.html", html)
# JSON export
json = ExDataCheck.Report.to_json(result)
send_to_monitoring_system(json)
```
## Module Structure
```
lib/ex_data_check/
├── ex_data_check.ex # Main API
├── validation_result.ex # Result structs
├── expectation.ex # Expectation definitions
├── profile.ex # Data profiling
├── schema.ex # Schema validation
├── quality_metrics.ex # Quality scoring
├── pipeline.ex # Pipeline integration
├── monitor.ex # Quality monitoring
├── drift.ex # Drift detection
├── report.ex # Reporting/export
└── expectations/
├── value.ex # Value-based expectations
├── statistical.ex # Statistical expectations
├── schema.ex # Schema expectations
├── ml.ex # ML-specific expectations
└── custom.ex # Custom expectation framework
```
## Use Cases
### Data Pipeline Validation
```elixir
# Validate data as it enters your pipeline
defmodule DataIngestion do
def process(raw_data) do
expectations = [
expect_column_to_exist(:timestamp),
expect_column_to_exist(:user_id),
expect_column_values_to_not_be_null(:user_id),
expect_column_values_to_match_regex(:email, ~r/@/)
]
case ExDataCheck.validate(raw_data, expectations) do
%{success: true} = result ->
{:ok, raw_data}
%{success: false} = result ->
Logger.error("Data validation failed: #{inspect(result.details)}")
{:error, result}
end
end
end
```
### ML Feature Validation
```elixir
# Validate features before training
defmodule ModelTraining do
def prepare_features(data) do
expectations = [
expect_no_missing_values(:features),
expect_column_mean_to_be_between(:feature_1, 0.0, 1.0),
expect_feature_correlation(:feature_1, :feature_2, max: 0.95),
expect_label_balance(:target, min_ratio: 0.2),
expect_table_row_count_to_be_between(1000, 1_000_000)
]
ExDataCheck.validate!(data, expectations)
end
end
```
### Production Monitoring
```elixir
# Monitor production data quality
defmodule ProductionMonitor do
use GenServer
def check_batch(batch) do
profile = ExDataCheck.profile(batch)
metrics = ExDataCheck.quality_metrics(batch)
if metrics.overall_score < 0.85 do
alert_ops_team(metrics)
end
store_metrics(profile, metrics)
end
end
```
## Best Practices
### 1. Define Expectations Early
Define your data expectations during development:
```elixir
# Create expectation suites for different stages
training_expectations = [
expect_no_missing_values(:features),
expect_label_balance(:target, min_ratio: 0.3)
]
inference_expectations = [
expect_column_to_exist(:features),
expect_column_count_to_equal(10)
]
```
### 2. Use Profiling for Exploration
Profile your data to understand it before writing expectations:
```elixir
profile = ExDataCheck.profile(data, detailed: true)
IO.inspect(profile.columns, label: "Column Statistics")
```
### 3. Monitor Quality Trends
Track quality metrics over time:
```elixir
metrics = ExDataCheck.quality_metrics(batch)
store_in_timeseries_db(metrics, timestamp: DateTime.utc_now())
```
### 4. Handle Validation Failures Gracefully
```elixir
case ExDataCheck.validate(data, expectations) do
%{success: true} ->
process_data(data)
%{success: false, details: details} ->
# Log failures
Logger.warn("Validation failures: #{inspect(details)}")
# Decide on action: reject, quarantine, or continue with warnings
quarantine_data(data, details)
end
```
## Testing
Run the test suite:
```bash
mix test
```
Run specific tests:
```bash
mix test test/ex_data_check_test.exs
mix test test/expectations_test.exs
mix test test/profile_test.exs
```
## Performance
ExDataCheck is designed for high-throughput production use:
- Stream-based processing for large datasets
- Lazy evaluation of expectations
- Configurable sampling for profiling
- Minimal memory overhead
```elixir
# Process large datasets efficiently
large_dataset
|> Stream.chunk_every(1000)
|> Stream.map(&ExDataCheck.validate(&1, expectations))
|> Enum.reduce(%{}, &aggregate_results/2)
```
## Roadmap
See [docs/roadmap.md](docs/roadmap.md) for the complete implementation roadmap.
### Phase 1: Core Validation (Current)
- Basic expectations framework
- Schema validation
- Simple profiling
### Phase 2: ML Features
- Data drift detection
- Feature correlation analysis
- Distribution comparison
### Phase 3: Advanced Monitoring
- Quality trend analysis
- Anomaly detection
- Real-time alerting
### Phase 4: Enterprise Features
- Multi-dataset validation
- Expectation versioning
- Advanced reporting
## Contributing
This is part of the North Shore AI Research Infrastructure. Contributions are welcome!
Please ensure all tests pass and code follows the project style guide.
## License
MIT License - see [LICENSE](https://github.com/North-Shore-AI/ExDataCheck/blob/main/LICENSE) file for details
## Related Projects
- [crucible_bench](https://github.com/North-Shore-AI/crucible_bench) - Statistical testing framework for AI research
- Great Expectations (Python) - Inspiration for expectations-based validation