README_UNIFIED_GRPC_BRIDGE.md

# Unified gRPC Bridge Implementation

This document describes the implementation of the unified gRPC bridge for DSPex, covering Stages 0, 1, and 2.

## Overview

The unified gRPC bridge provides a high-performance, protocol-based communication layer between Elixir (DSPex) and Python (DSPy) components. The implementation follows a staged approach with clear architectural boundaries.

## Stage 0: Protocol Foundation

**Status**: ✅ Complete

### What Was Implemented
- Core gRPC service definition (`BridgeService`) in `priv/proto/snakepit_bridge.proto`
- Protocol buffer message definitions for all operations
- Elixir gRPC server implementation in `lib/snakepit/grpc/bridge_server.ex`
- Python gRPC client/server in `priv/python/snakepit_bridge/`
- Basic RPC handlers: Ping, InitializeSession, CleanupSession

### Key Files
- `snakepit/priv/proto/snakepit_bridge.proto` - Protocol definition
- `snakepit/lib/snakepit/grpc/bridge_server.ex` - Elixir server
- `snakepit/priv/python/snakepit_bridge/grpc_server.py` - Python server

### Recent Updates (Stage 2 Compliance)
- Fixed service name from `SnakepitBridge` to `BridgeService`
- Added missing `GetSession` and `Heartbeat` RPCs
- Updated all references across codebases

## Stage 1: Core Variables & Tools

**Status**: ✅ Complete

### What Was Implemented
- `SessionStore` - Centralized state management (`lib/snakepit/bridge/session_store.ex`)
- Variable CRUD operations with type validation
- Batch operations for performance
- TTL-based session cleanup
- Type system with constraints (`lib/snakepit/bridge/variables/types/`)
- Serialization layer for cross-language compatibility

### Key Components
- **SessionStore**: GenServer-based state management with ETS backing
- **Type System**: Float, Integer, String, Boolean with validation and constraints
- **Serialization**: JSON-based encoding for protobuf Any type

### Recent Updates (Stage 2 Compliance)
- Fixed double-encoding issue in serialization
- Centralized type system to avoid duplication
- Updated tests to use new Serialization module

## Stage 2: Cognitive Layer & DSPex Integration

**Status**: ✅ Complete

### What Was Implemented
- `DSPex.Context` - High-level API for variable management
- Dual backend architecture:
  - `LocalState` - Pure Elixir for fast operations
  - `BridgedState` - gRPC bridge for Python integration
- Automatic backend switching based on requirements
- State migration between backends
- Full StateProvider behavior compliance

### Key Components
- **DSPex.Context**: Main user-facing API (`lib/dspex/context.ex`)
- **LocalState**: In-memory backend (`lib/dspex/bridge/state/local.ex`)
- **BridgedState**: SessionStore-backed backend (`lib/dspex/bridge/state/bridged.ex`)
- **StateProvider**: Common behavior for backends

### Recent Updates (Stage 2 Compliance)
- Removed duplicated type system from LocalState
- Refactored BridgedState to use SessionStore API directly
- Fixed test warnings with proper log capture

## Architecture

```
┌─────────────┐     ┌─────────────┐
│   DSPex     │     │   Python    │
│  Context    │     │   DSPy      │
└──────┬──────┘     └──────┬──────┘
       │                    │
┌──────┴──────┐     ┌──────┴──────┐
│  LocalState │     │   Bridge    │
│  (Elixir)   │     │   Client    │
└──────┬──────┘     └──────┬──────┘
       │                    │
       └────────┬───────────┘
                │
        ┌───────┴────────┐
        │  SessionStore  │
        │   (GenServer)  │
        └───────┬────────┘
                │
        ┌───────┴────────┐
        │  gRPC Server   │
        │  (Port 50051)  │
        └────────────────┘
```

## Reliability updates (v0.6.6)

- **Worker port persistence** – `Snakepit.GRPCWorker` now replaces the placeholder `0` with the OS-selected port before publishing registry metadata, ensuring BridgeServer can always reach the correct address (`test/unit/grpc/grpc_worker_ephemeral_port_test.exs`).
- **Channel reuse & cleanup** – BridgeServer asks workers for their cached `GRPC.Stub` and only creates a short-lived channel as a fallback, closing it after each call (`test/snakepit/grpc/bridge_server_test.exs`).
- **Defensive parameter decoding** – Malformed JSON payloads and unexpected protobuf envelopes now raise `{:error, {:invalid_parameter, key, reason}}` without ever touching the worker.
- **Protected state stores** – SessionStore, ToolRegistry, and ProcessRegistry expose `:protected` ETS tables and keep DETS handles private, blocking external mutation attempts (`test/unit/pool/process_registry_security_test.exs`).
- **Session quotas** – Configurable caps on session counts and program storage prevent runaway growth and surface actionable errors (`test/unit/bridge/session_store_test.exs`).
- **Log redaction helpers** – the new logger redaction summary keeps secrets and large blobs out of logs (`test/unit/logger/redaction_test.exs`).

## Testing

Comprehensive test coverage across all components:

### Test Files
- Protocol tests: `test/snakepit/grpc/`
- SessionStore tests: `test/snakepit/bridge/session_store_test.exs`
- Type system tests: `test/snakepit/bridge/variables/types_test.exs`
- Property-based tests: `test/snakepit/bridge/property_test.exs`
- Integration tests: `test/snakepit/bridge/integration_test.exs`
- Test runner: `test/run_bridge_tests.exs`
- Worker port/channel regression: `test/unit/grpc/grpc_worker_ephemeral_port_test.exs`, `test/snakepit/grpc/bridge_server_test.exs`
- Registry hardening & logging: `test/unit/pool/process_registry_security_test.exs`, `test/unit/logger/redaction_test.exs`

### Running Tests
```bash
# Run all tests
mix test

# Run unified test suite
mix run test/run_bridge_tests.exs --all

# Run specific test types
mix test --include property
mix test --include integration
mix test --include performance

# Run with test runner options
mix run test/run_bridge_tests.exs --property --integration --verbose
```

### Test Types
1. **Unit Tests**: Individual component testing with isolation
2. **Property-Based Tests**: Invariant verification with generated data using StreamData
3. **Integration Tests**: Full stack Python-Elixir communication testing
4. **Performance Tests**: Benchmark operations against targets

## Performance Characteristics

### Operation Latency
- **LocalState**: Microsecond operations (pure Elixir)
- **BridgedState**: 1-5ms operations (includes gRPC overhead)
- **Batch operations**: Amortized cost for multiple operations
- **Session cleanup**: Automatic TTL-based expiration

### Binary Serialization
- **Automatic optimization**: Data > 10KB uses binary encoding
- **Performance gains**: 5-10x faster for large tensors/embeddings
- **Size reduction**: 3-5x smaller message size
- **Supported types**: `tensor` and `embedding` variables
- **Threshold**: 10,240 bytes (10KB)
- **Format**: Erlang Term Format (ETF) on Elixir, pickle on Python

### Benchmarks
| Operation | Small Data (<10KB) | Large Data (>10KB) |
|-----------|-------------------|-------------------|
| Variable Set | 2ms (JSON) | 3ms (Binary) |
| Variable Get | 1.5ms (JSON) | 2ms (Binary) |
| Serialization | 0.5ms | 0.1ms (5x faster) |
| Network Transfer | 1ms | 0.3ms (3x faster) |

## Future Work

Low priority items for future consideration:
- Benchmark suite for performance regression testing
- Stage 3: Streaming and real-time updates
- Stage 4: Advanced features (optimization, dependencies)

## References

- [Main README](README.md)
- [Testing Guide](README_TESTING.md)
- [Process Management](README_PROCESS_MANAGEMENT.md)
- [gRPC Communication](README_GRPC.md)