CHANGELOG.md

# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.5.1] - 2025-12-28

### Changed
- Updated `crucible_framework` dependency from `~> 0.5.0` to `~> 0.5.2`

## [0.5.0] - 2025-12-27

### Changed
- **Normalized describe/1 to canonical schema format** - The `describe/1` callback now returns a schema conforming to the Crucible Stage contract specification v1.0.
  - Changed `:stage` key to `:name` key (atom value)
  - Added `__schema_version__: "1.0.0"` marker for schema evolution
  - Added `required` field (list of required option keys)
  - Added `optional` field (list of optional option keys)
  - Added `types` field (type specifications for all options)
  - Added `defaults` field (default values for optional options)
  - Moved `metrics` list to `__extensions__.fairness.supported_metrics`
  - Added `data_sources` and `output_location` to extensions

### Added
- **Conformance tests** - New `test/ex_fairness/conformance_test.exs` validates Stage contract compliance
- **Extended describe/1 tests** - Comprehensive tests for canonical schema format

### Dependencies
- Updated `crucible_framework` dependency to `~> 0.5.0` (required for new describe/1 contract)

## [0.4.0] - 2025-12-25

### Added
- **ExFairness.CrucibleStage** implementing `Crucible.Stage` for crucible_framework pipelines.
- Environment-specific config files (`config/*.exs`) to disable the CrucibleFramework repo by default.
- Documentation snapshot and gap analysis in `docs/20251225/`.
- Crucible stage integration test suite.

### Changed
- Refactored `ExFairness.evaluate/5` into smaller helpers for metric computation and violations.
- Improved chi-square computation structure in `ExFairness.Utils.StatisticalTests`.
- Updated project logo in `assets/ExFairness.svg`.
- Dependencies: add `crucible_framework`, update `crucible_ir` to `~> 0.2.0`, add `ecto_sql` and `postgrex`.

## [0.3.1] - 2025-11-26

### Added - CrucibleIR Integration

**Pipeline Stage:**
- **ExFairness.Stage** - Pipeline stage for Crucible framework integration
  - Seamless integration with CrucibleIR experiment orchestration
  - Accepts `CrucibleIR.Reliability.Fairness` configuration
  - Extracts predictions, labels, and sensitive attributes from model outputs
  - Supports all ExFairness metrics (demographic parity, equalized odds, equal opportunity, predictive parity, calibration)
  - Configurable threshold and fail-on-violation behavior
  - Comprehensive error handling and validation
  - Returns structured fairness results with violations tracking

**Main API Enhancement:**
- **ExFairness.evaluate/5** - New function for CrucibleIR config-based evaluation
  - Direct evaluation using `CrucibleIR.Reliability.Fairness` struct
  - Optional probabilities parameter for calibration metrics
  - Returns structured results with metrics, violations, and overall pass/fail status
  - Conditionally compiled (only when crucible_ir is available)

### Configuration Support

The integration supports the following `CrucibleIR.Reliability.Fairness` structure:

```elixir
%CrucibleIR.Reliability.Fairness{
  enabled: true,                    # Enable/disable fairness evaluation
  metrics: [:demographic_parity, :equalized_odds, :equal_opportunity, :predictive_parity, :calibration],
  group_by: :gender,                # Sensitive attribute field name
  threshold: 0.1,                   # Maximum acceptable disparity
  fail_on_violation: false,         # Whether to fail on violations
  options: %{}                      # Additional metric-specific options
}
```

### Testing

**New Test Suite:**
- ExFairness.StageTest - 15 comprehensive tests
  - Stage description validation
  - Disabled fairness pass-through
  - Single and multiple metric evaluation
  - Calibration with/without probabilities
  - Custom threshold configuration
  - Violation detection and reporting
  - Fail-on-violation behavior
  - Invalid context handling
  - Unknown metric handling
  - Custom options pass-through

**Test Coverage:** 174 (v0.3.0) → 189 (v0.4.0) = +15 tests (+8.6%)

### Dependencies

**New Dependencies:**
- `{:crucible_ir, "~> 0.1.1"}` - CrucibleIR configuration structs

### Documentation

**Updated Documentation:**
- mix.exs - Added ExFairness.Stage to Pipeline module group
- README.md - Added Stage usage examples and CrucibleIR integration guide
- ExFairness.Stage - Comprehensive module documentation with examples
- ExFairness.evaluate/5 - Full API documentation with examples

### Quality Metrics

- **Zero compilation warnings** (enforced via warnings_as_errors)
- **Zero Dialyzer errors** (type-safe)
- **All tests passing** (189 total tests)
- **Backward compatible:** All v0.3.0 code works without modification

### Integration Benefits

1. **Seamless Crucible Integration**: ExFairness can now be used as a pipeline stage in Crucible experiments
2. **Standardized Configuration**: Uses CrucibleIR configuration structs for consistency
3. **Experiment Orchestration**: Fairness evaluation can be automated as part of experiment pipelines
4. **Flexible Violation Handling**: Choose whether fairness violations should fail experiments
5. **Comprehensive Results**: Structured output suitable for experiment reporting

### Example Usage

```elixir
# Configure fairness evaluation
config = %CrucibleIR.Reliability.Fairness{
  enabled: true,
  metrics: [:demographic_parity, :equalized_odds],
  group_by: :gender,
  threshold: 0.1,
  fail_on_violation: false
}

# In a Crucible pipeline
context = %{
  experiment: %{reliability: %{fairness: config}},
  outputs: model_outputs  # List of maps with :prediction, :label, :gender
}

{:ok, result_context} = ExFairness.Stage.run(context)
# result_context.fairness contains fairness evaluation results

# Or use the direct evaluation API
result = ExFairness.evaluate(predictions, labels, sensitive_attr, config)
# Returns %{metrics: ..., overall_passes: ..., violations: ...}
```

### Breaking Changes

**None** - This is a backward compatible release. All existing code continues to work unchanged.

### Migration from v0.3.0

No code changes required. The new CrucibleIR integration is opt-in and does not affect existing usage patterns.

## [0.3.0] - 2025-11-25

### Added - Statistical Inference and Calibration

**Statistical Inference Framework:**
- **ExFairness.Utils.Bootstrap** - Bootstrap confidence interval computation
  - Stratified bootstrap to preserve group proportions
  - Parallel and sequential computation modes
  - Percentile and basic bootstrap methods
  - Configurable number of samples (default: 1000)
  - Reproducible with seed parameter
  - GPU-accelerated metric computation via Nx.Defn
- **ExFairness.Utils.StatisticalTests** - Hypothesis testing for fairness metrics
  - Two-proportion Z-test for demographic parity
  - Chi-square test for equalized odds
  - Permutation test for any fairness metric (non-parametric)
  - Cohen's h effect size computation
  - Configurable significance levels (default: α=0.05)
  - Statistical interpretation generation

**Calibration Fairness Metric:**
- **ExFairness.Metrics.Calibration** - Calibration fairness for probability predictions
  - Expected Calibration Error (ECE) computation
  - Maximum Calibration Error (MCE) computation
  - Uniform and quantile binning strategies
  - Configurable number of bins (default: 10)
  - Group-wise calibration comparison
  - Validation for probability ranges [0, 1]

### Enhanced - Existing Metrics

All existing fairness metrics can now optionally include:
- Bootstrap confidence intervals
- Statistical hypothesis tests
- Effect size measures
- Enhanced interpretations with statistical significance

Example usage:
```elixir
result = ExFairness.demographic_parity(predictions, sensitive_attr,
  include_ci: true,              # NEW: Bootstrap CI
  statistical_test: :z_test,     # NEW: Hypothesis testing
  bootstrap_samples: 1000,       # NEW: Configurable bootstrap
  confidence_level: 0.95         # NEW: CI level
)
# Returns enhanced result with :confidence_interval and :p_value
```

### Testing

**New Test Suites:**
- ExFairness.Utils.BootstrapTest - 11 comprehensive tests
  - Bootstrap interval validation
  - Stratified sampling verification
  - Method comparison (percentile vs basic)
  - Reproducibility testing
  - Parallel vs sequential equivalence
- ExFairness.Utils.StatisticalTestsTest - 14 comprehensive tests
  - Two-proportion Z-test validation
  - Chi-square test verification
  - Permutation test correctness
  - Effect size computation
  - P-value range validation
- ExFairness.Metrics.CalibrationTest - 15 comprehensive tests
  - ECE/MCE computation validation
  - Binning strategy verification
  - Probability range validation
  - Edge case handling

**Total Tests:** 134 (v0.2.0) → 174 (v0.3.0) = +40 tests (+30%)

### Documentation

**Design Documentation:**
- docs/20251125/enhancements_design.md (comprehensive 8-week implementation plan)
  - Statistical inference algorithms and formulas
  - Calibration metric mathematical foundation
  - Implementation roadmap and success criteria
  - API examples and migration guide
  - Research citations (10+ additional papers)

**Updated Documentation:**
- mix.exs - Version updated to 0.3.0, new modules added to docs
- README.md - Version badge and installation instructions updated
- CHANGELOG.md - Complete v0.3.0 release notes

### Quality Metrics

- **Zero compilation warnings** (enforced via warnings_as_errors)
- **Zero Dialyzer errors** (type-safe)
- **Test coverage target:** >90% (expected)
- **Backward compatible:** All v0.2.0 code works without modification

### Performance

- Bootstrap: ~1-2 seconds for 1000 samples on standard metrics
- Permutation test: ~2-3 seconds for 10,000 permutations
- Parallel bootstrap: 4-8x speedup on multi-core systems
- Calibration: <100ms for typical datasets

### Research Foundations

**New Academic Citations:**
- Efron, B., & Tibshirani, R. J. (1994). "An introduction to the bootstrap." CRC press.
- Davison, A. C., & Hinkley, D. V. (1997). "Bootstrap methods and their application."
- Good, P. (2013). "Permutation tests: A practical guide to resampling methods."
- Agresti, A. (2018). "Statistical methods for the social sciences."
- Cohen, J. (1988). "Statistical power analysis for the behavioral sciences."
- Pleiss, G., et al. (2017). "On fairness and calibration." NeurIPS.
- Guo, C., et al. (2017). "On calibration of modern neural networks." ICML.

### Breaking Changes

**None** - This is a backward compatible release. All existing code continues to work unchanged.

### Migration from v0.2.0

No code changes required. All new features are opt-in via additional parameters.

See docs/20251125/enhancements_design.md for detailed migration examples.

## [0.2.0] - 2025-10-20

### Added - Comprehensive Technical Documentation
- **future_directions.md (1,941 lines)** - Complete roadmap to v1.0.0
  - Detailed specifications for statistical inference
  - Calibration metric with complete algorithm
  - Intersectional analysis implementation plan
  - Threshold optimization algorithm
  - 6-month development timeline
  - 12+ additional research citations
- **implementation_report.md (1,288 lines)** - Technical implementation details
  - Module-by-module analysis of all 14 modules
  - Algorithm documentation with pseudocode
  - Design decisions and rationale
  - Performance characteristics
  - Code statistics and metrics
- **testing_and_qa_strategy.md (1,220 lines)** - QA methodology
  - TDD philosophy and evidence
  - Complete test coverage matrix (134 tests)
  - Edge case testing strategy
  - Future testing enhancements (property testing, integration testing)
  - Quality gates and CI/CD specifications

### Enhanced - README.md
- Expanded from ~660 to 1,437 lines (+118%)
- Added **Mathematical Foundations** section (200+ lines)
  - Complete mathematical definitions for all 4 metrics
  - Formal probability notation
  - Disparity measures
  - Comprehensive citations with DOI numbers
- Added **Theoretical Background** section (300+ lines)
  - Types of fairness (group, individual, causal)
  - Measurement problem discussion
  - Impossibility theorem with proof intuition
  - Fairness-accuracy tradeoff analysis
- Added **Advanced Usage** section (200+ lines)
  - Axon integration example (neural networks)
  - Scholar integration example (classical ML)
  - Batch fairness analysis
  - Production monitoring with GenServer
- Expanded **Research Foundations** (150+ lines)
  - 15+ peer-reviewed papers with full bibliographic details
  - DOI numbers for all citations
  - Framework comparisons (AIF360, Fairlearn, etc.)
- Added **API Reference** section
- Updated real-world use cases with legal compliance checks

### Documentation
- Total documentation: ~9,120 lines
- Academic citations: 27+ peer-reviewed papers
- Working code examples: 20+
- Integration patterns documented

## [0.1.0] - 2025-10-20

### Added - Core Implementation

**Infrastructure:**
- `ExFairness.Error` - Custom exception handling with type safety
- `ExFairness.Validation` - Comprehensive input validation
  - Binary tensor validation
  - Shape matching validation
  - Multiple groups requirement (min 2 groups)
  - Sufficient samples validation (default: 10 per group)
  - Helpful error messages with actionable suggestions
- `ExFairness.Utils` - GPU-accelerated tensor operations
  - `positive_rate/2` - Positive prediction rate with masking
  - `create_group_mask/2` - Binary mask generation
  - `group_count/2` - Sample counting per group
  - `group_positive_rates/2` - Batch rate computation
- `ExFairness.Utils.Metrics` - Classification metrics
  - `confusion_matrix/3` - TP, FP, TN, FN with masking
  - `true_positive_rate/3` - TPR/Recall
  - `false_positive_rate/3` - FPR
  - `positive_predictive_value/3` - PPV/Precision

**Fairness Metrics:**
- `ExFairness.Metrics.DemographicParity` - P(Ŷ=1|A=0) = P(Ŷ=1|A=1)
  - Configurable threshold (default: 0.1)
  - Plain language interpretations
  - Citations: Dwork et al. (2012), Feldman et al. (2015)
- `ExFairness.Metrics.EqualizedOdds` - Equal TPR and FPR across groups
  - Both error rates checked
  - Combined pass/fail determination
  - Citations: Hardt et al. (2016)
- `ExFairness.Metrics.EqualOpportunity` - Equal TPR across groups
  - Relaxed version of equalized odds
  - Focus on false negative parity
  - Citations: Hardt et al. (2016)
- `ExFairness.Metrics.PredictiveParity` - Equal PPV across groups
  - Precision parity
  - Consistent prediction meaning
  - Citations: Chouldechova (2017)

**Detection Algorithms:**
- `ExFairness.Detection.DisparateImpact` - EEOC 80% rule
  - Legal standard for adverse impact
  - 4/5ths rule implementation
  - Legal interpretation with EEOC context
  - Citations: EEOC (1978), Biddle (2006)

**Mitigation Techniques:**
- `ExFairness.Mitigation.Reweighting` - Sample weighting for fairness
  - Supports demographic parity and equalized odds targets
  - Formula: w(a,y) = P(Y=y) / P(A=a,Y=y)
  - Normalized weights (mean = 1.0)
  - GPU-accelerated via Nx.Defn
  - Citations: Kamiran & Calders (2012)

**Reporting System:**
- `ExFairness.Report` - Multi-metric fairness assessment
  - Aggregate pass/fail counts
  - Overall assessment generation
  - Markdown export (human-readable)
  - JSON export (machine-readable)

**Main API:**
- `ExFairness.demographic_parity/3` - Convenience function
- `ExFairness.equalized_odds/4` - Convenience function
- `ExFairness.equal_opportunity/4` - Convenience function
- `ExFairness.predictive_parity/4` - Convenience function
- `ExFairness.fairness_report/4` - Comprehensive reporting

### Testing
- 134 total tests (102 unit tests + 32 doctests)
- 100% pass rate
- Comprehensive edge case coverage
- Strict TDD approach (Red-Green-Refactor)
- All tests async (parallel execution)

### Quality Gates
- Zero compiler warnings (enforced)
- Zero Dialyzer errors (type-safe)
- Credo strict mode configured
- Code formatting enforced (100 char lines)
- ExCoveralls configured for coverage reports

### Documentation
- Comprehensive README.md with examples
- Complete module documentation (@moduledoc)
- Complete function documentation (@doc)
- Working examples (verified by doctests)
- Research citations in all metrics
- Mathematical definitions included

### Dependencies
- Production: `nx ~> 0.7` (only production dependency)
- Development: `ex_doc`, `dialyxir`, `excoveralls`, `credo`, `stream_data`, `jason`

---

[Unreleased]: https://github.com/North-Shore-AI/ExFairness/compare/v0.5.1...HEAD
[0.5.1]: https://github.com/North-Shore-AI/ExFairness/compare/v0.5.0...v0.5.1
[0.5.0]: https://github.com/North-Shore-AI/ExFairness/compare/v0.4.0...v0.5.0
[0.4.0]: https://github.com/North-Shore-AI/ExFairness/compare/v0.3.1...v0.4.0
[0.3.1]: https://github.com/North-Shore-AI/ExFairness/compare/v0.3.0...v0.3.1
[0.3.0]: https://github.com/North-Shore-AI/ExFairness/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/North-Shore-AI/ExFairness/compare/v0.1.0...v0.2.0
[0.1.0]: https://github.com/North-Shore-AI/ExFairness/releases/tag/v0.1.0