priv/python/BINARY_SERIALIZATION.md

Select File:
priv/python/BINARY_SERIALIZATION.md

# Binary Serialization Guide

## Overview

The Snakepit bridge automatically uses binary serialization for large tensor and embedding data to improve performance. This guide explains how it works and how to use it effectively.

## Automatic Binary Encoding

When variable data exceeds 10KB (10,240 bytes), the system automatically switches from JSON to binary encoding:

- **Small data (<10KB)**: Uses JSON for readability and debugging
- **Large data (>10KB)**: Uses binary for performance

## Supported Types

Binary serialization is currently available for:
- `tensor`: Multi-dimensional arrays with shape information
- `embedding`: Vector representations (1D arrays)

## Python Usage

### Basic Example

```python
from snakepit_bridge import SessionContext
import numpy as np

class MLAdapter:
    def process_large_data(self, ctx: SessionContext):
        # Create a large tensor (>10KB)
        large_tensor = np.random.randn(100, 100).tolist()  # ~80KB
        
        # This automatically uses binary serialization
        ctx.register_variable("my_tensor", "tensor", {
            "shape": [100, 100],
            "data": large_tensor
        })
        
        # Retrieval is transparent - handles binary automatically
        retrieved = ctx["my_tensor"]
        print(f"Shape: {retrieved['shape']}")
```

### Performance Example

```python
import time

def benchmark_serialization(ctx: SessionContext):
    # Small data - uses JSON
    small_data = list(range(100))  # ~800 bytes
    start = time.time()
    ctx.register_variable("small", "embedding", small_data)
    print(f"Small data: {time.time() - start:.3f}s")
    
    # Large data - uses binary
    large_data = list(range(10000))  # ~80KB
    start = time.time()
    ctx.register_variable("large", "embedding", large_data)
    print(f"Large data: {time.time() - start:.3f}s")
    # Typically 5-10x faster!
```

## Technical Details

### Serialization Format

- **Python side**: Uses `pickle` with highest protocol
- **Elixir side**: Uses Erlang Term Format (ETF)
- **Wire format**: Protocol Buffers with separate binary field

### Threshold Calculation

The 10KB threshold is based on the estimated serialized size:
- For tensors: `len(data) * 8` bytes (assuming float64)
- For embeddings: `len(array) * 8` bytes

### Binary Message Structure

When binary serialization is used:

1. **Metadata** (in `value` field):
   ```json
   {
     "shape": [100, 100],
     "dtype": "float32",
     "binary_format": "pickle",
     "type": "tensor"
   }
   ```

2. **Binary Data** (in `binary_value` field):
   - Pickled Python object containing the actual data

## Best Practices

1. **Use Appropriate Types**: Always use `tensor` or `embedding` types for numerical data
2. **Batch Large Operations**: Group multiple large variables in batch updates
3. **Monitor Memory**: Binary data is held in memory during transfer
4. **Profile Your Data**: Check if your typical data sizes benefit from binary encoding

## Limitations

1. **Type Support**: Only `tensor` and `embedding` types use binary serialization
2. **Debugging**: Binary data is not human-readable in logs
3. **Compatibility**: Binary format is internal - use JSON for external APIs

## Configuration

Currently, the 10KB threshold is not configurable. This value was chosen as optimal for most workloads:
- Keeps small data human-readable
- Maximizes performance for large data
- Minimizes overhead of format detection