docs/owngil_internals.md

Select File:
docs/owngil_internals.md

# OWN_GIL Mode Internals

## Overview

OWN_GIL mode provides true parallel Python execution using Python 3.14+ per-interpreter GIL (`PyInterpreterConfig_OWN_GIL`). Each OWN_GIL context runs in a dedicated pthread with its own subinterpreter and GIL.

**Note**: OWN_GIL requires Python 3.14+ due to C extension global state bugs in earlier versions (e.g., `_decimal`, `numpy`). On Python 3.12/3.13, use the default `worker` mode — contexts share the main interpreter but each owns a dedicated pthread.

## Quick Start

```erlang
%% Create an OWN_GIL context (requires Python 3.14+)
{ok, Ctx} = py_context:start_link(1, owngil),

%% Basic operations work the same as other modes
{ok, 4.0} = py_context:call(Ctx, math, sqrt, [16], #{}),
ok = py_context:exec(Ctx, <<"x = 42">>),
{ok, 42} = py_context:eval(Ctx, <<"x">>),

%% True parallelism: multiple OWN_GIL contexts execute simultaneously
{ok, Ctx2} = py_context:start_link(2, owngil),
%% Ctx and Ctx2 run in parallel with independent GILs

%% Process-local environments for namespace isolation
{ok, Env} = py_context:create_local_env(Ctx),
CtxRef = py_context:get_nif_ref(Ctx),
ok = py_nif:context_exec(CtxRef, <<"my_var = 'isolated'">>  , Env),

%% Cleanup
py_context:stop(Ctx),
py_context:stop(Ctx2).
```

## Feature Compatibility

All major erlang_python features work with OWN_GIL mode:

| Feature | Status | Notes |
|---------|--------|-------|
| `py_context:call/5` | Full | Function calls |
| `py_context:eval/2` | Full | Expression evaluation |
| `py_context:exec/2` | Full | Statement execution |
| Channels (`py_channel`) | Full | Bidirectional messaging |
| Buffers (`py_buffer`) | Full | Zero-copy streaming |
| Callbacks (`erlang.call`) | Partial | Uses thread_worker, not re-entrant |
| PIDs (`erlang.Pid`) | Full | Round-trip serialization |
| Send (`erlang.send`) | Full | Fire-and-forget messaging |
| Reactor (`erlang.reactor`) | Full | FD-based protocols |
| Async Tasks | Full | `py_event_loop:create_task` |
| Asyncio | Full | `asyncio.sleep`, `gather`, etc. |
| Process-local envs | Full | Namespace isolation |

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                        Erlang VM                                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Process A                    Process B                            │
│   py_context:call(Ctx1, ...)   py_context:call(Ctx2, ...)          │
│        │                            │                               │
│        ▼                            ▼                               │
│   ┌─────────────┐             ┌─────────────┐                       │
│   │ Dirty Sched │             │ Dirty Sched │                       │
│   └──────┬──────┘             └──────┬──────┘                       │
│          │                           │                              │
└──────────┼───────────────────────────┼──────────────────────────────┘
           │                           │
           │ dispatch_to_owngil_thread │
           ▼                           ▼
┌──────────────────────┐    ┌──────────────────────┐
│  OWN_GIL Thread 1    │    │  OWN_GIL Thread 2    │
│  ┌────────────────┐  │    │  ┌────────────────┐  │
│  │ Subinterpreter │  │    │  │ Subinterpreter │  │
│  │ (own GIL)      │  │    │  │ (own GIL)      │  │
│  └────────────────┘  │    └──┴────────────────┘  │
│  Parallel Execution! │    │  Parallel Execution! │
└──────────────────────┘    └──────────────────────┘
```

## Comparison with Other Modes

| Mode | Python Version | Thread Model | GIL | Parallelism |
|------|----------------|--------------|-----|-------------|
| `worker` | Any | Dedicated pthread per context | Main interpreter GIL | True parallel on free-threaded 3.13t+ |
| `owngil` | 3.14+ | Dedicated pthread per context | Per-interpreter GIL | True parallel |

**Why OWN_GIL requires Python 3.14+**: C extensions like `_decimal`, `numpy` have global state that crashes in OWN_GIL sub-interpreters on Python 3.12/3.13. Python 3.14 includes fixes for these issues (see [cpython#106078](https://github.com/python/cpython/issues/106078)).

## Key Data Structures

### py_context_t (OWN_GIL fields)

```c
typedef struct {
    // ... common fields ...

    bool uses_own_gil;              // OWN_GIL mode flag
    pthread_t own_gil_thread;       // Dedicated pthread
    PyThreadState *own_gil_tstate;  // Thread state
    PyInterpreterState *own_gil_interp; // Interpreter state

    // IPC synchronization
    pthread_mutex_t request_mutex;
    pthread_cond_t request_ready;   // Signal: request available
    pthread_cond_t response_ready;  // Signal: response ready

    // Request/response state
    int request_type;               // CTX_REQ_* enum
    ErlNifEnv *shared_env;          // Zero-copy term passing
    ERL_NIF_TERM request_term;
    ERL_NIF_TERM response_term;
    bool response_ok;

    // Process-local env support
    void *local_env_ptr;            // py_env_resource_t*

    // Lifecycle
    _Atomic bool thread_running;
    _Atomic bool shutdown_requested;
} py_context_t;
```

### Request Types

```c
typedef enum {
    CTX_REQ_CALL,            // Call Python function
    CTX_REQ_EVAL,            // Evaluate expression
    CTX_REQ_EXEC,            // Execute statements
    CTX_REQ_REACTOR_READ,    // Reactor on_read_ready
    CTX_REQ_REACTOR_WRITE,   // Reactor on_write_ready
    CTX_REQ_REACTOR_INIT,    // Reactor init_connection
    CTX_REQ_CALL_WITH_ENV,   // Call with process-local env
    CTX_REQ_EVAL_WITH_ENV,   // Eval with process-local env
    CTX_REQ_EXEC_WITH_ENV,   // Exec with process-local env
    CTX_REQ_CREATE_LOCAL_ENV,// Create process-local env dicts
    CTX_REQ_SHUTDOWN         // Shutdown thread
} ctx_request_type_t;
```

## Request Flow

### 1. Context Creation

```
nif_context_create(env, "owngil")
    └── owngil_context_init(ctx)
        ├── Initialize mutex/condvars
        ├── Create shared_env
        └── pthread_create(owngil_context_thread_main)
            └── owngil_context_thread_main(ctx)
                ├── Py_NewInterpreterFromConfig(OWN_GIL)
                ├── Initialize globals/locals
                ├── Register py_event_loop module
                └── Enter request loop
```

### 2. Request Dispatch

```
nif_context_call(env, ctx, module, func, args, kwargs)
    │
    ├── [ctx->uses_own_gil == true]
    │   └── dispatch_to_owngil_thread(env, ctx, CTX_REQ_CALL, request)
    │       ├── pthread_mutex_lock(&ctx->request_mutex)
    │       ├── Copy request term to shared_env
    │       ├── Set ctx->request_type = CTX_REQ_CALL
    │       ├── pthread_cond_signal(&ctx->request_ready)
    │       ├── pthread_cond_wait(&ctx->response_ready)  // Block
    │       ├── Copy response from shared_env
    │       └── pthread_mutex_unlock(&ctx->request_mutex)
    │
    └── [ctx->uses_own_gil == false]
        └── Direct execution with GIL (worker mode)
```

### 3. Request Processing (OWN_GIL Thread)

```
owngil_context_thread_main(ctx)
    while (!shutdown_requested) {
        pthread_cond_wait(&ctx->request_ready)

        owngil_execute_request(ctx)
            switch (ctx->request_type) {
                case CTX_REQ_CALL: owngil_execute_call(ctx); break;
                case CTX_REQ_EVAL: owngil_execute_eval(ctx); break;
                case CTX_REQ_EXEC: owngil_execute_exec(ctx); break;
                // ... other cases
            }

        pthread_cond_signal(&ctx->response_ready)
    }
```

## Process-Local Environments

OWN_GIL contexts support process-local environments for namespace isolation:

```
                Erlang Process A          Erlang Process B
                     │                         │
                     ▼                         ▼
             ┌───────────────┐         ┌───────────────┐
             │ py_env_res_t  │         │ py_env_res_t  │
             │ globals_A     │         │ globals_B     │
             │ locals_A      │         │ locals_B      │
             └───────┬───────┘         └───────┬───────┘
                     │                         │
                     └─────────┬───────────────┘
                               ▼
                    ┌─────────────────────┐
                    │   OWN_GIL Context   │
                    │   (shared context,  │
                    │   isolated envs)    │
                    └─────────────────────┘
```

### Creating Process-Local Env

```
py_context:create_local_env(Ctx)
    └── nif_create_local_env(CtxRef)
        └── dispatch_create_local_env_to_owngil(env, ctx, res)
            └── owngil_execute_create_local_env(ctx)
                ├── res->globals = PyDict_New()
                ├── res->locals = PyDict_New()
                └── res->interp_id = ctx->interp_id
```

### Using Process-Local Env

```erlang
{ok, Env} = py_context:create_local_env(Ctx),
CtxRef = py_context:get_nif_ref(Ctx),
ok = py_nif:context_exec(CtxRef, <<"x = 1">>, Env),
{ok, 1} = py_nif:context_eval(CtxRef, <<"x">>, #{}, Env).
```

## Thread Lifecycle

### Startup

1. `Py_NewInterpreterFromConfig` with `PyInterpreterConfig_OWN_GIL`
2. Save thread state and interpreter state
3. Initialize `__builtins__` in globals
4. Register `py_event_loop` module for reactor callbacks
5. Release GIL and enter request loop

### Request Loop

```c
while (!shutdown_requested) {
    pthread_mutex_lock(&request_mutex);
    while (!request_pending && !shutdown_requested) {
        pthread_cond_wait(&request_ready, &request_mutex);
    }

    if (shutdown_requested) break;

    // Process request (GIL already held within subinterpreter)
    owngil_execute_request(ctx);

    pthread_cond_signal(&response_ready);
    pthread_mutex_unlock(&request_mutex);
}
```

### Shutdown

1. Set `shutdown_requested = true`
2. Signal `request_ready` to wake thread
3. Thread exits loop, acquires GIL
4. Call `Py_EndInterpreter` to destroy subinterpreter
5. pthread terminates

## Memory Management

### Shared Environment

- `ctx->shared_env` is used for zero-copy term passing
- Request terms copied into shared_env by caller
- Response terms created in shared_env by OWN_GIL thread
- Caller copies response back to their env

### Process-Local Env Cleanup

```c
py_env_resource_dtor(env, res) {
    if (res->pool_slot >= 0) {
        // Shared-GIL subinterpreter: DECREF with pool GIL
    } else if (res->interp_id != 0) {
        // OWN_GIL subinterpreter: skip DECREF
        // Py_EndInterpreter cleans up all objects
    } else {
        // Worker mode: DECREF with main GIL
    }
}
```

## Reactor / Event Loop Integration

OWN_GIL contexts support the reactor pattern for I/O-driven protocols. The `py_event_loop` module is registered in each OWN_GIL subinterpreter during startup.

### Why Event Loop Registration Matters

Each Python subinterpreter has its own module namespace. The `py_event_loop` module provides:
- `erlang.reactor` protocol callbacks (`on_read_ready`, `on_write_ready`, `init_connection`)
- Per-interpreter state for cached function references
- Module state isolation between interpreters

### Reactor Request Flow

```
┌────────────────────────────────────────────────────────────────────────┐
│                           Erlang                                        │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  py_reactor_context                                                     │
│       │                                                                 │
│       │  {select, FdRes, Ref, ready_input}                             │
│       ▼                                                                 │
│  handle_info                                                            │
│       │                                                                 │
│       ├── Read data from fd into ReactorBuffer                         │
│       │                                                                 │
│       └── py_nif:reactor_on_read_ready(CtxRef, Fd)                     │
│                │                                                        │
└────────────────┼────────────────────────────────────────────────────────┘
                 │
                 │ [ctx->uses_own_gil == true]
                 ▼
┌────────────────────────────────────────────────────────────────────────┐
│  dispatch_reactor_read_to_owngil(env, ctx, fd, buffer_ptr)             │
│       │                                                                 │
│       ├── ctx->reactor_buffer_ptr = buffer_ptr                         │
│       ├── ctx->request_type = CTX_REQ_REACTOR_READ                     │
│       ├── pthread_cond_signal(&request_ready)                          │
│       └── pthread_cond_wait(&response_ready)                           │
└────────────────────────────────────────────────────────────────────────┘
                 │
                 ▼
┌────────────────────────────────────────────────────────────────────────┐
│  OWN_GIL Thread                                                         │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  owngil_execute_reactor_read(ctx)                                       │
│       │                                                                 │
│       ├── Create ReactorBuffer Python object                           │
│       │                                                                 │
│       ├── Get module state (per-interpreter reactor cache)             │
│       │   state = get_module_state()                                   │
│       │   ensure_reactor_cached_for_interp(state)                      │
│       │                                                                 │
│       └── Call Python: state->reactor_on_read(fd, buffer)              │
│                │                                                        │
│                ▼                                                        │
│           erlang.reactor.on_read_ready(fd, data)                        │
│                │                                                        │
│                ▼                                                        │
│           Protocol.data_received(data)                                  │
│                │                                                        │
│                └── Returns action: "continue" | "write_pending" | ...   │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

### Module State Per-Interpreter

Each OWN_GIL subinterpreter maintains its own cached references:

```c
typedef struct {
    PyObject *reactor_module;      // erlang.reactor module
    PyObject *reactor_on_read;     // Cached on_read_ready function
    PyObject *reactor_on_write;    // Cached on_write_ready function
    PyObject *reactor_init_conn;   // Cached init_connection function
    // ...
} py_event_loop_module_state_t;
```

The `ensure_reactor_cached_for_interp()` function lazily imports `erlang.reactor` and caches the callback functions on first use within each interpreter.

### Reactor Request Types

| Request Type | Dispatch Function | Execute Function |
|--------------|-------------------|------------------|
| `CTX_REQ_REACTOR_READ` | `dispatch_reactor_read_to_owngil` | `owngil_execute_reactor_read` |
| `CTX_REQ_REACTOR_WRITE` | `dispatch_reactor_write_to_owngil` | `owngil_execute_reactor_write` |
| `CTX_REQ_REACTOR_INIT` | `dispatch_reactor_init_to_owngil` | `owngil_execute_reactor_init` |

### Buffer Handling

For read operations, the `ReactorBuffer` (zero-copy buffer) is passed through:

1. `py_reactor_context` reads data into a `reactor_buffer_resource_t`
2. Buffer pointer stored in `ctx->reactor_buffer_ptr`
3. OWN_GIL thread wraps it in a Python `ReactorBuffer` object
4. Python protocol receives data via buffer protocol (zero-copy)

### Example: TCP Echo Server with OWN_GIL

```erlang
%% Start OWN_GIL context for protocol handling
{ok, Ctx} = py_context:start_link(1, owngil),

%% Define protocol in Python
py_context:exec(Ctx, <<"
import erlang.reactor as reactor

class EchoProtocol(reactor.Protocol):
    def data_received(self, data):
        self.write(data)  # Echo back
        return 'write_pending'
">>),

%% Start reactor with the context
{ok, Reactor} = py_reactor_context:start_link(#{
    context => Ctx,
    protocol_class => <<"EchoProtocol">>
}).
```

## Performance Characteristics

| Operation | Worker (shared GIL) | OWN_GIL |
|-----------|--------------------|---------|
| Call overhead | ~2.5μs | ~10μs |
| Throughput (single context) | ~400K/s | ~100K/s |
| Parallelism (N contexts) | GIL-bound | Linear up to N cores |
| Resource usage | One pthread per context | One pthread + one subinterpreter per context |

## Pros and Cons

### Pros

- **True CPU parallelism.** Each context owns its GIL, so N contexts run on N cores at once. Worker mode serialises on the main GIL unless Python is built free-threaded (3.13t+).
- **Crash isolation.** A C-level fault in one subinterpreter leaves the others alive. Worker mode shares the main interpreter, so a corrupt module state can take everything down.
- **Clean namespace per context.** Each subinterpreter has its own `sys.modules`, so module-level state cannot bleed between contexts. Useful when running adversarial or untrusted code paths side by side.
- **Predictable scheduling.** Requests are dispatched via mutex/condvar IPC, not dirty schedulers, so OWN_GIL contexts will not be starved by other dirty NIF traffic.

### Cons

- **Python 3.14+ only.** Earlier versions have C-extension global-state bugs (`_decimal`, `numpy`, etc.) that crash inside subinterpreters. See [cpython#106078](https://github.com/python/cpython/issues/106078).
- **Higher per-call latency.** ~4x the round-trip cost of worker mode (~10μs vs ~2.5μs) because every call crosses a mutex/condvar handoff to the dedicated thread.
- **Higher memory.** Each subinterpreter imports its own copy of every module. A 50 MB module set across 8 contexts is ~400 MB resident, not 50 MB.
- **C-extension compatibility is not universal.** Extensions must opt in via the multi-phase init protocol (PEP 489) and `Py_mod_multiple_interpreters`. Pure-Python and well-behaved C extensions work; older ones fail at import inside the subinterpreter.
- **No shared Python state.** Module globals, class definitions, and cached objects are per-interpreter. Use `py:state_store/2` (ETS-backed) or `erlang.send` for cross-context data.
- **Callback re-entry is restricted.** When Python in an OWN_GIL context calls `erlang.call`, the callback runs on a thread worker, not back on the OWN_GIL thread (which cannot suspend). Re-entrant Python -> Erlang -> *same* OWN_GIL context calls will not work; use a different context for the nested call, or use `erlang.async_call` from asyncio code.
- **Process-local envs do not span interpreters.** A `py_env_resource_t` is bound to the interpreter that created it. Reusing one across contexts returns `{error, env_wrong_interpreter}`.

### When to Use Each

Use **OWN_GIL** when:

- The workload is CPU-bound Python (ML inference, numpy/torch compute, parsing, codecs) and you want N-way parallelism per BEAM scheduler.
- You can pin the per-context memory budget and the modules in use are subinterpreter-safe.
- You are on Python 3.14+.

Use **worker** (default) when:

- You are on Python 3.12 or 3.13.
- Calls are short and frequent (every microsecond of overhead matters).
- You are running modules that are not subinterpreter-safe (some scientific stacks, older C extensions).
- You are already running free-threaded Python (3.13t+); worker mode gets parallelism for free without the per-interpreter memory cost.

### Common Pitfalls

- **Importing once is not enough.** Imports happen per subinterpreter. Pre-warming a worker context will not pre-warm the OWN_GIL contexts; do it inside each `py_context`.
- **Sharing Python objects across contexts.** Passing a `PyObject*` reference (via `py_state` or otherwise) between OWN_GIL contexts is undefined behaviour. Round-trip through Erlang terms or ETS-backed state.
- **Long-running tasks block the dispatcher.** A single OWN_GIL context processes one request at a time. If you have a 30-second compute job, parallelise across contexts; do not queue everything onto context 1.
- **Callback storms.** Heavy `erlang.call` use inside an OWN_GIL context routes to thread workers, which is fine, but the round-trip cost is then worker-style on top of OWN_GIL dispatch. For tight callback loops, prefer worker mode end-to-end.

## Benchmarking

Run the benchmark to compare modes on your system:

```bash
rebar3 compile && escript examples/bench_owngil.erl
```

Example output:
```
========================================================
  OWN_GIL vs Worker Benchmark
========================================================

System Information
------------------
  Erlang/OTP:       27
  Schedulers:       8
  Python:           3.14.0
  OWN_GIL:          true

1. Single Context Latency (1000 calls to math.sqrt)
   Mode            us/call    calls/sec
   ----            -------    ---------
   worker              2.5       400000
   owngil             10.2        98000

2. Parallel Throughput (4 contexts, 10000 calls each)
   Mode            total_ms   calls/sec
   ----            --------   ---------
   worker            100.5       398000
   owngil             28.3      1415000   <- 3.5x faster

3. CPU-Bound Speedup (fibonacci(30) x 4 contexts)
   Mode            total_ms   speedup
   ----            --------   -------
   worker            800.2      1.0x
   owngil            205.1      3.9x     <- near-linear scaling
```

## Safety Mechanisms

### Interpreter ID Validation

Process-local environments (`py_env_resource_t`) store the Python interpreter ID when created. Before execution, OWN_GIL functions validate that the env belongs to the current interpreter:

```c
PyInterpreterState *current_interp = PyInterpreterState_Get();
if (current_interp != NULL && penv->interp_id != PyInterpreterState_GetID(current_interp)) {
    // Return {error, env_wrong_interpreter}
}
```

This prevents dangling pointer access when an env resource outlives its interpreter.

### Lock Ordering (ABBA Deadlock Prevention)

Lock ordering must be consistent to prevent deadlocks:

**Correct order: GIL first, then namespaces_mutex**

Normal execution path:
```
PyGILState_Ensure()     // 1. Acquire GIL
pthread_mutex_lock()     // 2. Acquire mutex
// ... work ...
pthread_mutex_unlock()   // 3. Release mutex
PyGILState_Release()     // 4. Release GIL
```

Cleanup paths (`event_loop_down`, `event_loop_destructor`) follow the same order:
```c
// For main interpreter: GIL first, then mutex
PyGILState_STATE gstate = PyGILState_Ensure();
pthread_mutex_lock(&loop->namespaces_mutex);
// ... cleanup with Py_XDECREF ...
pthread_mutex_unlock(&loop->namespaces_mutex);
PyGILState_Release(gstate);
```

For subinterpreters (where `PyGILState_Ensure` cannot be used), cleanup skips `Py_DECREF` - the objects will be freed when the interpreter is destroyed.

### Callback Re-entry Limitation

OWN_GIL contexts do not support the suspension/resume protocol used for `erlang.call()` callbacks. When Python code in an OWN_GIL context calls `erlang.call()`:

1. The call is routed to `thread_worker_call()` (not the OWN_GIL thread)
2. The call executes on a thread worker, not the calling OWN_GIL context
3. Re-entrant calls back to the same OWN_GIL context are not supported

This is because the OWN_GIL thread cannot be suspended - it owns its GIL and must remain responsive to process requests.

## Files

| File | Description |
|------|-------------|
| `c_src/py_nif.h` | Structure definitions, request types |
| `c_src/py_nif.c` | Thread main, dispatch, execute functions |
| `c_src/py_callback.c` | Callback handling, thread worker dispatch |
| `c_src/py_event_loop.c` | Event loop and namespace management |
| `src/py_context.erl` | Erlang API for context management |
| `test/py_owngil_features_SUITE.erl` | Test suite |