docs/TECHNICAL_DESIGN.md

Select File:
docs/TECHNICAL_DESIGN.md

# DuckDB Elixir Port - Technical Design Document

## Overview

This document provides the comprehensive technical design for porting the DuckDB Python client to Elixir. This is a **100% exact port** of the Python client functionality to Elixir, maintaining API parity while following Elixir conventions and idioms.

## Source Reference

**Primary Reference**: `duckdb-python/` directory in this repository

All implementation decisions should reference the Python implementation for:
- API surface area and function signatures
- Behavior and semantics
- Error handling patterns
- Type conversions
- Edge cases and special handling

## Architecture Overview

### Current Implementation (CLI)

DuckdbEx currently uses the DuckDB CLI in JSON mode managed via `erlexec`. The
`DuckdbEx.Port` GenServer owns the CLI process and parses ordered JSON results
into tuple rows. Connections are PIDs; cursors are lightweight wrappers around
the connection PID to avoid extra file locks in the CLI backend.

Environment configuration:
- `DUCKDB_PATH` to locate the CLI binary.
- `DUCKDB_EX_EXEC_AS_ROOT` to force root execution when required in containers.
- `mix duckdb_ex.install` installs the CLI into the project `priv/duckdb/duckdb` and is picked up automatically.

### Core Components

The DuckDB Elixir client mirrors the Python architecture with the following main components:

1. **Connection** (`DuckdbEx.Connection`)
   - Maps to `DuckDBPyConnection` (pyconnection.hpp)
   - Handles database connection lifecycle
   - Manages transactions
   - Executes queries and statements

2. **Relation** (`DuckdbEx.Relation`)
   - Maps to `DuckDBPyRelation` (pyrelation.hpp)
   - Lazy query builder pattern
   - Chainable query operations
   - Result materialization

3. **Result** (`DuckdbEx.Result`)
   - Maps to `DuckDBPyResult` (pyresult.hpp)
   - Query result container
   - Multiple fetch modes
   - Type conversions

4. **Type System** (`DuckdbEx.Type`)
   - Maps to `DuckDBPyType` and related type classes
   - Type creation and introspection
   - Custom type support (enum, decimal, struct, etc.)

5. **Exceptions** (`DuckdbEx.Exceptions`)
   - All DuckDB exception types
   - Maps Python exception hierarchy to Elixir

## Future NIF Layer Design (Not Implemented)

The sections below describe a potential Rustler-based NIF architecture. The
current implementation does **not** use NIFs; it uses the CLI-based port
described above.

### Technology Stack

- **Rustler**: Elixir NIF framework using Rust
- **DuckDB Rust bindings**: Interface to DuckDB C++ library
- **Architecture**: Rust NIF layer between Elixir and DuckDB

### Why Rustler?

1. **Safety**: Memory-safe bindings to C++
2. **Performance**: Near-native performance
3. **Ergonomics**: Better developer experience than C NIFs
4. **Compatibility**: Excellent DuckDB Rust bindings exist

### NIF Module Structure

```
native/duckdb_nif/
├── src/
│   ├── lib.rs              # NIF entry point
│   ├── connection.rs       # Connection resource and methods
│   ├── relation.rs         # Relation resource and methods
│   ├── result.rs           # Result handling
│   ├── types.rs            # Type system
│   ├── exceptions.rs       # Exception mapping
│   ├── conversions.rs      # Elixir ↔ DuckDB type conversions
│   ├── arrow.rs            # Arrow integration
│   └── filesystem.rs       # Filesystem integration
└── Cargo.toml
```

## Module Structure

### Elixir Module Hierarchy

```
lib/duckdb_ex/
├── connection.ex           # Main connection module
├── relation.ex             # Relation/query builder
├── result.ex               # Result handling
├── exceptions.ex           # Exception definitions
├── port.ex                 # DuckDB CLI process management
├── default_connection.ex   # Default connection holder
├── cursor.ex               # Cursor wrapper
├── parameters.ex           # SQL parameter interpolation
├── statement.ex            # Statement metadata
├── statement_type.ex       # Statement type enum
└── expected_result_type.ex # Expected result enum

# Planned modules (not yet implemented)
# - type.ex, expression/*, value/*, arrow/*, etc.
```

## API Surface

### Connection API

The Connection module provides the primary interface to DuckDB. Reference: `duckdb-python/src/duckdb_py/include/duckdb_python/pyconnection/pyconnection.hpp`

```elixir
defmodule DuckdbEx.Connection do
  # Connection Management
  @spec connect(String.t() | :memory, keyword()) :: {:ok, t()} | {:error, term()}
  @spec close(t()) :: :ok
  @spec interrupt(t()) :: :ok

  # Transactions
  @spec begin(t()) :: {:ok, t()} | {:error, term()}
  @spec commit(t()) :: {:ok, t()} | {:error, term()}
  @spec rollback(t()) :: {:ok, t()} | {:error, term()}
  @spec checkpoint(t()) :: {:ok, t()} | {:error, term()}

  # Query Execution
  @spec execute(t(), String.t(), list()) :: {:ok, t()} | {:error, term()}
  @spec execute_many(t(), String.t(), list()) :: {:ok, t()} | {:error, term()}
  @spec query(t(), String.t(), String.t(), list()) :: {:ok, Relation.t()} | {:error, term()}
  @spec sql(t(), String.t(), list()) :: {:ok, Relation.t()} | {:error, term()}

  # Data Sources
  @spec read_csv(t(), String.t() | list(String.t()), keyword()) :: {:ok, Relation.t()}
  @spec read_json(t(), String.t() | list(String.t()), keyword()) :: {:ok, Relation.t()}
  @spec read_parquet(t(), String.t() | list(String.t()), keyword()) :: {:ok, Relation.t()}
  @spec from_arrow(t(), term()) :: {:ok, Relation.t()}
  @spec from_df(t(), term()) :: {:ok, Relation.t()}
  @spec from_query(t(), String.t()) :: {:ok, Relation.t()}

  # Table/View Access
  @spec table(t(), String.t()) :: {:ok, Relation.t()}
  @spec view(t(), String.t()) :: {:ok, Relation.t()}
  @spec values(t(), list()) :: {:ok, Relation.t()}
  @spec table_function(t(), String.t(), list()) :: {:ok, Relation.t()}

  # Schema Operations
  @spec get_table_names(t(), String.t(), boolean()) :: {:ok, list(String.t())}

  # UDF Registration
  @spec create_function(t(), String.t(), (... -> term()), keyword()) :: {:ok, t()}
  @spec remove_function(t(), String.t()) :: {:ok, t()}

  # Type Creation
  @spec map_type(t(), Type.t(), Type.t()) :: Type.t()
  @spec struct_type(t(), keyword()) :: Type.t()
  @spec list_type(t(), Type.t()) :: Type.t()
  @spec array_type(t(), Type.t(), non_neg_integer()) :: Type.t()
  @spec union_type(t(), keyword()) :: Type.t()
  @spec enum_type(t(), String.t(), Type.t(), list()) :: Type.t()
  @spec decimal_type(t(), integer(), integer()) :: Type.t()
  @spec string_type(t(), String.t()) :: Type.t()
  @spec type(t(), String.t()) :: Type.t()

  # Extensions
  @spec install_extension(t(), String.t(), keyword()) :: :ok
  @spec load_extension(t(), String.t()) :: :ok

  # Object Registration
  @spec register(t(), String.t(), term()) :: {:ok, t()}
  @spec unregister(t(), String.t()) :: {:ok, t()}
  @spec append(t(), String.t(), term(), keyword()) :: {:ok, t()}

  # Filesystem
  @spec register_filesystem(t(), term()) :: :ok
  @spec unregister_filesystem(t(), String.t()) :: :ok
  @spec list_filesystems(t()) :: list(String.t())
  @spec filesystem_is_registered(t(), String.t()) :: boolean()

  # Result Fetching (for DBAPI compatibility)
  @spec fetch_one(t()) :: {:ok, tuple() | nil}
  @spec fetch_many(t(), non_neg_integer()) :: {:ok, list(tuple())}
  @spec fetch_all(t()) :: {:ok, list(tuple())}
  @spec fetch_df(t(), keyword()) :: {:ok, term()}
  @spec fetch_arrow(t(), keyword()) :: {:ok, term()}

  # Progress Tracking
  @spec query_progress(t()) :: float()

  # Statement Extraction
  @spec extract_statements(t(), String.t()) :: list(term())
end
```

### Relation API

The Relation module provides the lazy query builder interface. Reference: `duckdb-python/src/duckdb_py/include/duckdb_python/pyrelation.hpp`

```elixir
defmodule DuckdbEx.Relation do
  # Basic Operations
  @spec project(t(), list(String.t())) :: t()
  @spec filter(t(), String.t() | Expression.t()) :: t()
  @spec limit(t(), integer(), integer()) :: t()
  @spec order(t(), String.t()) :: t()
  @spec sort(t(), list(String.t())) :: t()
  @spec distinct(t()) :: t()
  @spec unique(t(), String.t()) :: t()

  # Aggregations
  @spec aggregate(t(), String.t() | list(String.t()), String.t()) :: t()
  @spec count(t(), String.t(), keyword()) :: t()
  @spec sum(t(), String.t(), keyword()) :: t()
  @spec avg(t(), String.t(), keyword()) :: t()
  @spec min(t(), String.t(), keyword()) :: t()
  @spec max(t(), String.t(), keyword()) :: t()
  @spec median(t(), String.t(), keyword()) :: t()
  @spec mode(t(), String.t(), keyword()) :: t()
  @spec stddev(t(), String.t(), keyword()) :: t()
  @spec variance(t(), String.t(), keyword()) :: t()
  @spec first(t(), String.t(), keyword()) :: t()
  @spec last(t(), String.t(), keyword()) :: t()
  @spec list(t(), String.t(), keyword()) :: t()
  @spec string_agg(t(), String.t(), String.t(), keyword()) :: t()

  # Window Functions
  @spec row_number(t(), String.t(), String.t()) :: t()
  @spec rank(t(), String.t(), String.t()) :: t()
  @spec dense_rank(t(), String.t(), String.t()) :: t()
  @spec percent_rank(t(), String.t(), String.t()) :: t()
  @spec cume_dist(t(), String.t(), String.t()) :: t()
  @spec ntile(t(), String.t(), integer(), String.t()) :: t()
  @spec lag(t(), String.t(), String.t(), keyword()) :: t()
  @spec lead(t(), String.t(), String.t(), keyword()) :: t()
  @spec first_value(t(), String.t(), keyword()) :: t()
  @spec last_value(t(), String.t(), keyword()) :: t()
  @spec nth_value(t(), String.t(), String.t(), integer(), keyword()) :: t()

  # Set Operations
  @spec union(t(), t()) :: t()
  @spec except(t(), t()) :: t()
  @spec intersect(t(), t()) :: t()

  # Joins
  @spec join(t(), t(), String.t() | Expression.t(), String.t()) :: t()
  @spec cross(t(), t()) :: t()

  # Transformations
  @spec map(t(), (term() -> term()), keyword()) :: t()

  # Execution & Fetching
  @spec execute(t()) :: t()
  @spec fetch_one(t()) :: {:ok, tuple() | nil}
  @spec fetch_many(t(), non_neg_integer()) :: {:ok, list(tuple())}
  @spec fetch_all(t()) :: {:ok, list(tuple())}
  @spec fetch_df(t(), keyword()) :: {:ok, term()}
  @spec fetch_arrow(t(), keyword()) :: {:ok, term()}
  @spec fetch_record_batch(t(), keyword()) :: {:ok, term()}

  # Export
  @spec to_csv(t(), String.t(), keyword()) :: :ok
  @spec to_parquet(t(), String.t(), keyword()) :: :ok
  @spec to_arrow_table(t(), keyword()) :: {:ok, term()}
  @spec to_arrow_capsule(t(), keyword()) :: {:ok, term()}

  # Table/View Creation
  @spec create_view(t(), String.t(), keyword()) :: t()
  @spec insert_into(t(), String.t()) :: :ok
  @spec insert(t(), list()) :: :ok
  @spec update(t(), keyword(), String.t()) :: :ok
  @spec create(t(), String.t()) :: :ok

  # Introspection
  @spec describe(t()) :: t()
  @spec length(t()) :: non_neg_integer()
  @spec shape(t()) :: {non_neg_integer(), non_neg_integer()}
  @spec columns(t()) :: list(String.t())
  @spec column_types(t()) :: list(String.t())
  @spec description(t()) :: list(tuple())
  @spec type(t()) :: String.t()
  @spec alias(t()) :: String.t()
  @spec set_alias(t(), String.t()) :: t()

  # SQL Generation
  @spec to_sql(t()) :: String.t()
  @spec explain(t(), atom()) :: String.t()

  # Display
  @spec to_string(t()) :: String.t()
  @spec print(t(), keyword()) :: :ok
end
```

### Type System

Reference: `duckdb-python/duckdb/typing/__init__.py` and type-related C++ code

```elixir
defmodule DuckdbEx.Type do
  # Type struct
  defstruct [:id, :internal_type, :metadata]

  # Standard Types (constants)
  @type standard_type ::
    :boolean | :tinyint | :smallint | :integer | :bigint |
    :hugeint | :utinyint | :usmallint | :uinteger | :ubigint |
    :float | :double | :decimal | :varchar | :blob |
    :timestamp | :timestamp_s | :timestamp_ms | :timestamp_ns |
    :date | :time | :interval | :uuid | :json

  # Composite type constructors
  @spec list(t()) :: t()
  @spec array(t(), non_neg_integer()) :: t()
  @spec map(t(), t()) :: t()
  @spec struct(keyword()) :: t()
  @spec union(keyword()) :: t()
  @spec enum(String.t(), t(), list()) :: t()
  @spec decimal(integer(), integer()) :: t()
end
```

### Exception Hierarchy

Reference: `duckdb-python/duckdb/__init__.py` exception imports

```elixir
defmodule DuckdbEx.Exceptions do
  # Base exceptions
  defmodule Error, do: defexception [:message]
  defmodule Warning, do: defexception [:message]

  # Specific exception types
  defmodule DatabaseError, do: defexception [:message]
  defmodule DataError, do: defexception [:message]
  defmodule OperationalError, do: defexception [:message]
  defmodule IntegrityError, do: defexception [:message]
  defmodule InternalError, do: defexception [:message]
  defmodule ProgrammingError, do: defexception [:message]
  defmodule NotSupportedError, do: defexception [:message]

  # DuckDB-specific exceptions
  defmodule BinderException, do: defexception [:message]
  defmodule CatalogException, do: defexception [:message]
  defmodule ConnectionException, do: defexception [:message]
  defmodule ConstraintException, do: defexception [:message]
  defmodule ConversionException, do: defexception [:message]
  defmodule DependencyException, do: defexception [:message]
  defmodule FatalException, do: defexception [:message]
  defmodule HTTPException, do: defexception [:message]
  defmodule InternalException, do: defexception [:message]
  defmodule InterruptException, do: defexception [:message]
  defmodule InvalidInputException, do: defexception [:message]
  defmodule InvalidTypeException, do: defexception [:message]
  defmodule IOException, do: defexception [:message]
  defmodule NotImplementedException, do: defexception [:message]
  defmodule OutOfMemoryException, do: defexception [:message]
  defmodule OutOfRangeException, do: defexception [:message]
  defmodule ParserException, do: defexception [:message]
  defmodule PermissionException, do: defexception [:message]
  defmodule SequenceException, do: defexception [:message]
  defmodule SerializationException, do: defexception [:message]
  defmodule SyntaxException, do: defexception [:message]
  defmodule TransactionException, do: defexception [:message]
  defmodule TypeMismatchException, do: defexception [:message]
end
```

## Data Type Mapping

### DuckDB → Elixir Type Conversions

| DuckDB Type | Elixir Type | Notes |
|-------------|-------------|-------|
| BOOLEAN | boolean() | Direct mapping |
| TINYINT/SMALLINT/INTEGER | integer() | Arbitrary precision |
| BIGINT | integer() | Arbitrary precision |
| HUGEINT | integer() | Arbitrary precision |
| FLOAT/DOUBLE | float() | IEEE 754 |
| DECIMAL | Decimal.t() | Use Decimal library |
| VARCHAR/TEXT | String.t() | UTF-8 strings |
| BLOB | binary() | Raw bytes |
| DATE | Date.t() | Elixir Date |
| TIME | Time.t() | Elixir Time |
| TIMESTAMP | DateTime.t() / NaiveDateTime.t() | With/without timezone |
| INTERVAL | DuckdbEx.Interval.t() | Custom struct |
| UUID | String.t() | String representation |
| JSON | term() | Decoded JSON via Jason |
| LIST | list() | Elixir lists |
| STRUCT | map() | Elixir maps |
| MAP | %{} | Elixir maps |
| UNION | tagged tuple | {tag, value} |
| ENUM | atom() or String.t() | Configurable |

### Parameter Binding

Support both positional and named parameters:

```elixir
# Positional
DuckdbEx.Connection.execute(conn, "SELECT * FROM users WHERE id = ?", [42])

# Named (similar to Python)
DuckdbEx.Connection.execute(conn, "SELECT * FROM users WHERE id = :id", [id: 42])
```

## Module-Level API (Default Connection)

Reference: `duckdb-python/duckdb/__init__.py` for module-level functions

```elixir
defmodule DuckdbEx do
  # Default connection management
  @spec default_connection() :: Connection.t()
  @spec set_default_connection(Connection.t()) :: :ok

  # Convenience functions using default connection
  @spec connect(String.t() | :memory, keyword()) :: {:ok, Connection.t()}
  @spec close() :: :ok
  @spec execute(String.t(), list()) :: {:ok, Connection.t()}
  @spec query(String.t(), list()) :: {:ok, Relation.t()}
  @spec sql(String.t(), list()) :: {:ok, Relation.t()}
  @spec read_csv(String.t(), keyword()) :: {:ok, Relation.t()}
  @spec read_json(String.t(), keyword()) :: {:ok, Relation.t()}
  @spec read_parquet(String.t(), keyword()) :: {:ok, Relation.t()}
  @spec table(String.t()) :: {:ok, Relation.t()}
  @spec values(list()) :: {:ok, Relation.t()}
end
```

## Resource Management

### NIF Resources

Use Rustler resources for:

1. **Connection Resource**: Wraps DuckDB connection handle
2. **Relation Resource**: Wraps DuckDB relation handle
3. **Result Resource**: Wraps query results

```rust
// Example resource definition
#[derive(NifStruct)]
#[module = "DuckdbEx.Native.Connection"]
pub struct ConnectionResource {
    // Internal DuckDB connection
    inner: Arc<Mutex<duckdb::Connection>>,
}

#[derive(NifStruct)]
#[module = "DuckdbEx.Native.Relation"]
pub struct RelationResource {
    inner: Arc<Mutex<duckdb::Relation>>,
    connection: ConnectionResource,
}
```

### Resource Lifecycle

- Resources cleaned up by BEAM GC
- Explicit close() methods for deterministic cleanup
- Connection pool support for multiple connections

## Error Handling Strategy

### NIF Error Propagation

```rust
// In Rust NIF
fn execute_query(conn: ResourceArc<ConnectionResource>, query: String)
    -> Result<RelationResource, Error> {
    conn.inner.lock()
        .execute(&query)
        .map(|rel| RelationResource::new(rel, conn.clone()))
        .map_err(|e| Error::DatabaseError(e.to_string()))
}
```

### Elixir Error Handling

```elixir
# Pattern 1: {:ok, result} | {:error, exception}
case DuckdbEx.Connection.execute(conn, "SELECT * FROM invalid") do
  {:ok, result} -> process(result)
  {:error, %DuckdbEx.Exceptions.CatalogException{} = e} ->
    Logger.error("Table not found: #{e.message}")
end

# Pattern 2: Raise on error (bang methods)
result = DuckdbEx.Connection.execute!(conn, "SELECT * FROM users")
```

## Testing Strategy

### Test Structure

Reference: `duckdb-python/tests/` for comprehensive test coverage

```
test/
├── duckdb_ex_test.exs              # Module-level API tests
├── connection_test.exs             # Connection tests
├── relation_test.exs               # Relation/query builder tests
├── result_test.exs                 # Result handling tests
├── type_test.exs                   # Type system tests
├── exception_test.exs              # Exception handling tests
├── integration/
│   ├── csv_test.exs               # CSV reading/writing
│   ├── parquet_test.exs           # Parquet integration
│   ├── arrow_test.exs             # Arrow integration
│   ├── transaction_test.exs       # Transaction tests
│   ├── udf_test.exs               # User-defined functions
│   └── filesystem_test.exs        # Filesystem integration
└── support/
    ├── test_helper.exs
    └── fixtures/
        ├── test.csv
        ├── test.parquet
        └── test.json
```

### Test Patterns

1. **Property-based testing** (StreamData)
2. **Comparison testing** against Python client
3. **Integration tests** with real DuckDB operations
4. **Concurrent access tests**
5. **Memory leak detection**

## Integration Points

### Arrow Integration

Support Apache Arrow for zero-copy data exchange:

```elixir
defmodule DuckdbEx.Arrow do
  @spec to_arrow_table(Relation.t(), keyword()) :: {:ok, term()}
  @spec from_arrow(Connection.t(), term()) :: {:ok, Relation.t()}
  @spec to_arrow_capsule(Relation.t()) :: {:ok, term()}
end
```

### Explorer Integration

Provide first-class integration with Elixir's Explorer library:

```elixir
defmodule DuckdbEx.Explorer do
  @spec to_dataframe(Relation.t()) :: Explorer.DataFrame.t()
  @spec from_dataframe(Connection.t(), Explorer.DataFrame.t()) :: Relation.t()
end
```

### Nx Integration

Support Nx tensors for numerical computing:

```elixir
defmodule DuckdbEx.Nx do
  @spec to_tensor(Relation.t()) :: Nx.Tensor.t()
  @spec from_tensor(Connection.t(), Nx.Tensor.t(), keyword()) :: Relation.t()
end
```

## Performance Considerations

### Optimization Strategies

1. **Lazy Evaluation**: Relations build query plans without execution
2. **Streaming Results**: Support for chunked result fetching
3. **Zero-Copy**: Use Arrow for data transfer where possible
4. **Connection Pooling**: DBConnection-compatible pool
5. **Prepared Statements**: Cache compiled queries
6. **Batch Operations**: Efficient bulk inserts

### Benchmarking

Compare performance against:
- Python duckdb client
- PostgreSQL (Postgrex)
- SQLite (Exqlite)

## Security Considerations

1. **SQL Injection**: Use parameterized queries exclusively
2. **Resource Limits**: Configurable memory limits
3. **Path Traversal**: Validate file paths for read_csv, etc.
4. **Extension Loading**: Optional restrictions on loading extensions

## Configuration

```elixir
# config/config.exs
config :duckdb_ex,
  default_connection: [
    database: ":memory:",
    config: [
      threads: 4,
      max_memory: "1GB",
      temp_directory: "/tmp/duckdb"
    ]
  ],
  pool: [
    size: 10,
    max_overflow: 5
  ]
```

## Migration from Python

### API Compatibility Table

| Python API | Elixir API | Notes |
|------------|------------|-------|
| `duckdb.connect()` | `DuckdbEx.connect()` | Returns {:ok, conn} tuple |
| `con.execute()` | `Connection.execute()` | Same parameters |
| `con.sql()` | `Connection.sql()` | Returns Relation |
| `rel.filter()` | `Relation.filter()` | Chainable |
| `rel.fetchall()` | `Relation.fetch_all()` | Returns {:ok, list} |
| `duckdb.read_csv()` | `DuckdbEx.read_csv()` | Same options |

### Migration Guide

Document to be created showing side-by-side examples of Python vs Elixir code.

## Documentation Requirements

1. **ExDoc**: Complete module and function documentation
2. **Guides**: Getting started, cookbook, migration guide
3. **Examples**: Real-world usage examples
4. **Changelog**: Track changes and version compatibility
5. **Type Specs**: Complete @spec for all public functions

## Future Enhancements

1. **Ecto Adapter**: Full Ecto integration
2. **Phoenix Integration**: LiveView components
3. **Telemetry**: Instrumentation for observability
4. **Distributed Queries**: Multi-node query execution
5. **Custom Extensions**: Elixir-based DuckDB extensions

## References

- DuckDB Python source: `duckdb-python/` directory
- DuckDB Documentation: https://duckdb.org/docs
- DuckDB Rust: https://github.com/duckdb/duckdb-rs
- Rustler: https://github.com/rusterlium/rustler
- DB Connection: https://github.com/elixir-ecto/db_connection