README.md

# Natch

⚑ **High-performance native ClickHouse client for Elixir**

Natch provides fast access to ClickHouse using the native TCP protocol (port 9000) via C++ NIFs. Native protocol benefits include binary columnar format, efficient compression, and reduced overhead compared to HTTP-based clients.

## Why Natch?

- πŸš€ **Native Protocol Performance** - Direct TCP connection using ClickHouse's binary protocol
- πŸ“Š **Columnar-First Design** - API designed for analytics workloads, not OLTP
- πŸ”§ **Production Ready** - 227 tests covering all ClickHouse types including complex nested structures
- πŸ’ͺ **Type Complete** - Full support for all ClickHouse types: primitives, dates, decimals, UUIDs, arrays, maps, tuples, nullables, enums, and low cardinality
- 🎯 **Zero-Copy Efficiency** - Bulk operations with minimal overhead
- πŸ”’ **Memory Safe** - Built with FINE for crash-proof NIFs

## Requirements

- **Elixir**: 1.18+ / Erlang 27+
- **ClickHouse**: Server 20.3+
- **Build**: C++17 compiler, CMake 3.15+, clickhouse-cpp dependencies

## Installation

Add `natch` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:natch, "~> 0.2.0"}
  ]
end
```

**Prebuilt binaries** are available for macOS (x86_64, ARM64) and Linux (x86_64, ARM64) and will be downloaded automatically during installation.

### Building from Source

If prebuilt binaries are not available for your platform, or if you prefer to build from source:

```bash
# Clone the repository
git clone https://github.com/Intellection/natch.git
cd natch

# Initialize the clickhouse-cpp submodule
git submodule update --init --recursive

# Build
mix deps.get
mix compile
```

**Build Requirements:**
- C++17 compiler (GCC 7+, Clang 5+, or MSVC 2017+)
- CMake 3.15+
- OpenSSL development headers
- Git (for submodule)

## Quick Start

### Local ClickHouse

```elixir
# Start a connection
{:ok, conn} = Natch.Connection.start_link(
  host: "localhost",
  port: 9000,
  database: "default"
)

# Create a table
Natch.Connection.execute(conn, """
CREATE TABLE events (
  id UInt64,
  user_id UInt32,
  event_type LowCardinality(String),
  properties Map(String, String),
  tags Array(String),
  timestamp DateTime,
  metadata Nullable(String)
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id)
""")

# Insert data (columnar format - optimal performance!)
columns = %{
  id: [1, 2, 3],
  user_id: [100, 101, 100],
  event_type: ["click", "view", "click"],
  properties: [
    %{"page" => "home", "referrer" => "google"},
    %{"page" => "about"},
    %{"page" => "pricing"}
  ],
  tags: [["web", "desktop"], ["mobile"], ["web"]],
  timestamp: [~U[2024-01-01 10:00:00Z], ~U[2024-01-01 10:01:00Z], ~U[2024-01-01 10:02:00Z]],
  metadata: ["extra", nil, "data"]
}

schema = [
  id: :uint64,
  user_id: :uint32,
  event_type: {:low_cardinality, :string},
  properties: {:map, :string, :string},
  tags: {:array, :string},
  timestamp: :datetime,
  metadata: {:nullable, :string}
]

:ok = Natch.insert(conn, "events", columns, schema)

# Query data
{:ok, results} = Natch.Connection.select_rows(conn, "SELECT * FROM events WHERE user_id = 100")
IO.inspect(results)
# => [
#      %{id: 1, user_id: 100, event_type: "click", ...},
#      %{id: 3, user_id: 100, event_type: "click", ...}
#    ]
```

### ClickHouse Cloud (SSL)

ClickHouse Cloud requires SSL/TLS connections on port 9440:

```elixir
{:ok, conn} = Natch.Connection.start_link(
  host: "your-instance.clickhouse.cloud",
  port: 9440,
  database: "default",
  user: "default",
  password: "your-password",
  ssl: true  # Enable SSL/TLS
)
```

**Note:** SSL support requires clickhouse-cpp to be built with OpenSSL. If you get a `Natch.OpenSSLError` saying "Library was built with no SSL support", the C++ library needs to be rebuilt with `-DWITH_OPENSSL=ON` CMake flag. This is typically handled automatically by package managers on systems with OpenSSL development libraries installed.

### Timeout Configuration

Configure socket-level timeouts to prevent operations from hanging indefinitely in production:

```elixir
{:ok, conn} = Natch.Connection.start_link(
  host: "localhost",
  port: 9000,
  connect_timeout: 5_000,   # Time to establish TCP connection (default: 5000ms)
  recv_timeout: 60_000,     # Time to receive data from server (default: 0 = infinite)
  send_timeout: 60_000      # Time to send data to server (default: 0 = infinite)
)
```

**Important:** The default `recv_timeout` is 0 (no timeout), which allows long-running analytical queries to complete. For production use, consider setting explicit timeouts based on your workload. When a timeout occurs, a `Natch.ConnectionError` is raised.

## Benchmarks

Real-world performance comparison vs Pillar (HTTP-based client) on M3 Pro, tested with 7-column schema.

**Important:** Benchmarks use `Pillar.select/2` which parses JSON responses. Using `Pillar.query/2` (which returns unparsed TSV strings) is not a fair comparison.

### INSERT Performance

| Rows | Natch | Pillar | Speedup | Memory (Natch) | Memory (Pillar) |
|------|------|--------|---------|---------------|-----------------|
| 10k | 13.5 ms | 63.9 ms | **4.7x faster** | 976 B | 45 MB |
| 100k | 184 ms | 626 ms | **3.4x faster** | 976 B | 452 MB |
| 1M | 2,094 ms | 5,545 ms | **2.6x faster** | 976 B | 4.5 GB |

**Natch uses ~4.6 million times less memory** than Pillar for inserts due to columnar format.

### SELECT Performance

| Query Type | Natch | Pillar | Speedup | Memory (Natch) | Memory (Pillar) |
|------------|------|--------|---------|---------------|-----------------|
| Aggregation | 3.6 ms | 4.9 ms | **1.4x faster** | 544 B | 17 KB |
| Filtered (10k rows) | 12 ms | 53 ms | **4.4x faster** | 128 B | 30 MB |
| Full scan (1M rows) | 802 ms | 4,980 ms | **6.2x faster** | 128 B | 3 GB |

**Natch uses ~5.5 million times less memory** than Pillar for large SELECT queries due to streaming columnar format vs materialized row-oriented maps.

### Key Takeaways

- **Native protocol is faster** - Natch's native TCP protocol with binary columnar format outperforms HTTP+JSON
- **Massive memory efficiency** - Millions of times less memory usage due to streaming and columnar format
- **Scales better** - Performance advantage increases with data size (6.2x for 1M rows vs 1.4x for aggregations)

See `bench/README.md` and `BINARY_PASSTHROUGH.md` for detailed analysis and methodology.

## Core Concepts

### Columnar Format (Recommended)

Natch uses a **columnar-first API** that matches ClickHouse's native storage format:

```elixir
# βœ… GOOD: Columnar format - 3 NIF calls for any number of rows
columns = %{
  id: [1, 2, 3, 4, 5],
  name: ["Alice", "Bob", "Charlie", "Dave", "Eve"],
  value: [100.0, 200.0, 300.0, 400.0, 500.0]
}

Natch.insert(conn, "table", columns, schema)
```

Why columnar?
- **100x faster** - M NIF calls instead of NΓ—M (rows Γ— columns)
- **Natural fit** - ClickHouse is a columnar database
- **Analytics-first** - Matches how you work with data (SUM, AVG, GROUP BY operate on columns)
- **Better compression** - Column values compressed together

### Type System

Natch supports **all ClickHouse types** with full roundtrip fidelity:

#### Primitive Types
```elixir
schema = [
  id: :uint64,           # UInt8, UInt16, UInt32, UInt64
  count: :int32,         # Int8, Int16, Int32, Int64
  price: :float64,       # Float32, Float64
  name: :string,         # String
  active: :bool          # Bool (UInt8)
]
```

#### Date and Time
```elixir
schema = [
  created: :date,        # Date (days since epoch)
  updated: :datetime,    # DateTime (seconds since epoch)
  logged: :datetime64    # DateTime64(6) - microsecond precision
]

# Works with Elixir DateTime structs or integers
columns = %{
  created: [~D[2024-01-01], ~D[2024-01-02]],
  updated: [~U[2024-01-01 10:00:00Z], ~U[2024-01-01 11:00:00Z]],
  logged: [~U[2024-01-01 10:00:00.123456Z], 1704103200123456]
}
```

#### Decimals and UUIDs
```elixir
schema = [
  amount: :decimal64,    # Decimal64(9) - fixed-point decimals
  user_id: :uuid         # UUID - 128-bit identifiers
]

columns = %{
  amount: [Decimal.new("99.99"), Decimal.new("149.50")],
  user_id: ["550e8400-e29b-41d4-a716-446655440000", "6ba7b810-9dad-11d1-80b4-00c04fd430c8"]
}
```

#### Nullable Types
```elixir
schema = [
  description: {:nullable, :string},
  count: {:nullable, :uint64}
]

columns = %{
  description: ["text", nil, "more text"],
  count: [100, nil, 200]
}
```

#### Arrays
```elixir
schema = [
  tags: {:array, :string},
  matrix: {:array, {:array, :uint64}},           # Nested arrays
  nullable_list: {:array, {:nullable, :string}}  # Arrays with nulls
]

columns = %{
  tags: [["web", "mobile"], ["desktop"], []],
  matrix: [[[1, 2], [3, 4]], [[5, 6]]],
  nullable_list: [["a", nil, "b"], [nil, "c"]]
}
```

#### Maps and Tuples
```elixir
schema = [
  properties: {:map, :string, :uint64},
  location: {:tuple, [:string, :float64, :float64]},
  metrics: {:map, :string, {:nullable, :uint64}}  # Maps with nullable values
]

columns = %{
  properties: [%{"clicks" => 10, "views" => 100}, %{"shares" => 5}],
  location: [{"NYC", 40.7128, -74.0060}, {"LA", 34.0522, -118.2437}],
  metrics: [%{"count" => 100, "missing" => nil}, %{"total" => nil}]
}
```

#### Enums and LowCardinality
```elixir
schema = [
  status: {:enum8, [{"pending", 1}, {"active", 2}, {"archived", 3}]},
  category: {:low_cardinality, :string},
  tags: {:array, {:low_cardinality, {:nullable, :string}}}  # Complex nesting!
]

columns = %{
  status: ["pending", "active", "pending"],
  category: ["news", "sports", "news"],
  tags: [["tech", nil], ["sports"], ["tech", "startup"]]
}
```

## Usage Guide

### Connection Management

```elixir
# Basic connection
{:ok, conn} = Natch.Connection.start_link(
  host: "localhost",
  port: 9000
)

# With authentication and options
{:ok, conn} = Natch.Connection.start_link(
  host: "clickhouse.example.com",
  port: 9000,
  database: "analytics",
  user: "app_user",
  password: "secret",
  compression: :lz4,
  name: MyApp.ClickHouse
)
```

Connection options:
- `:host` - Server hostname (default: `"localhost"`)
- `:port` - Native TCP port (default: `9000`)
- `:database` - Database name (default: `"default"`)
- `:user` - Username (optional)
- `:password` - Password (optional)
- `:compression` - Compression: `:lz4`, `:none` (default: `:lz4`)
- `:name` - Register connection with a name (optional)

### Executing Queries

#### DDL Operations
```elixir
# Create table
:ok = Natch.Connection.execute(conn, """
CREATE TABLE users (
  id UInt64,
  name String,
  created DateTime
) ENGINE = MergeTree()
ORDER BY id
""")

# Drop table
:ok = Natch.Connection.execute(conn, "DROP TABLE users")

# Alter table
:ok = Natch.Connection.execute(conn, "ALTER TABLE users ADD COLUMN age UInt8")
```

#### SELECT Queries

Natch provides two query formats to suit different use cases:

##### Row-Major Format (Traditional)
Returns results as a list of maps, where each map represents a row:

```elixir
# Simple query
{:ok, rows} = Natch.Connection.select_rows(conn, "SELECT * FROM users")
# => {:ok, [%{id: 1, name: "Alice"}, %{id: 2, name: "Bob"}]}

# With WHERE clause
{:ok, rows} = Natch.Connection.select_rows(conn, "SELECT * FROM users WHERE id > 100")

# Aggregations
{:ok, [result]} = Natch.Connection.select_rows(conn, """
  SELECT
    event_type,
    count() as count,
    uniqExact(user_id) as unique_users
  FROM events
  GROUP BY event_type
  ORDER BY count DESC
""")
```

##### Columnar Format (Efficient for Analytics)
Returns results as a map of column lists, ideal for large result sets and data analysis:

```elixir
# Query returns columnar format
{:ok, cols} = Natch.Connection.select_cols(conn, "SELECT * FROM users")
# => {:ok, %{id: [1, 2, 3], name: ["Alice", "Bob", "Charlie"]}}

# Perfect for analytics workflows
{:ok, data} = Natch.Connection.select_cols(conn, "SELECT user_id, value FROM events")
# => {:ok, %{user_id: [1, 2, 1, 3], value: [10.5, 20.0, 15.5, 30.0]}}

# Easy integration with data processing libraries
%{user_id: user_ids, value: values} = data
total = Enum.sum(values)
```

### Inserting Data

#### High-Level API (Recommended)
```elixir
# Columnar format - optimal performance
columns = %{
  id: [1, 2, 3],
  name: ["Alice", "Bob", "Charlie"]
}

schema = [id: :uint64, name: :string]

:ok = Natch.insert(conn, "users", columns, schema)
```

#### Low-Level API (Advanced)
```elixir
# Build block manually for maximum control
block = Natch.Native.block_create()

# Create and populate columns
id_col = Natch.Column.new(:uint64)
Natch.Column.append_bulk(id_col, [1, 2, 3])
Natch.Native.block_append_column(block, "id", id_col.ref)

name_col = Natch.Column.new(:string)
Natch.Column.append_bulk(name_col, ["Alice", "Bob", "Charlie"])
Natch.Native.block_append_column(block, "name", name_col.ref)

# Get client and insert
client_ref = GenServer.call(conn, :get_client)
Natch.Native.client_insert(client_ref, "users", block)
```

## Performance Tips

### 1. Use Columnar Format
```elixir
# ❌ BAD: Row-oriented (requires conversion)
rows = [
  %{id: 1, name: "Alice"},
  %{id: 2, name: "Bob"}
]

# βœ… GOOD: Columnar (direct insertion)
columns = %{
  id: [1, 2],
  name: ["Alice", "Bob"]
}
```

### 2. Batch Your Inserts
```elixir
# Insert in batches of 10,000-100,000 rows for optimal throughput
chunk_size = 50_000

data
|> Stream.chunk_every(chunk_size)
|> Enum.each(fn chunk ->
  columns = transpose_to_columnar(chunk)
  Natch.insert(conn, "table", columns, schema)
end)
```

### 3. Use Appropriate Types
```elixir
# βœ… GOOD: LowCardinality for repeated strings
schema = [status: {:low_cardinality, :string}]

# βœ… GOOD: Enum for known values
schema = [priority: {:enum8, [{"low", 1}, {"medium", 2}, {"high", 3}]}]

# βœ… GOOD: Use smallest integer type that fits
schema = [age: :uint8]  # Not :uint64
```

### 4. Enable Compression
```elixir
# LZ4 compression reduces bandwidth by ~70% for typical workloads
{:ok, conn} = Natch.Connection.start_link(
  host: "localhost",
  port: 9000,
  compression: :lz4  # Enabled by default
)
```

## Complex Nesting Examples

Natch supports arbitrarily complex nested types:

```elixir
# Triple-nested arrays with nullables
schema = [matrix: {:array, {:array, {:nullable, :uint64}}}]
columns = %{matrix: [[[1, nil, 3], [nil, 5]], [[10, 20], [], [nil]]]}

# Maps with array values
schema = [data: {:map, :string, {:array, :uint64}}]
columns = %{data: [%{"ids" => [1, 2, 3], "counts" => [10, 20]}]}

# Tuples with complex elements
schema = [record: {:tuple, [:string, {:array, :uint64}, {:nullable, :float64}]}]
columns = %{record: [{"Alice", [1, 2, 3], 99.9}, {"Bob", [4, 5], nil}]}

# Array of low cardinality nullable strings (triple wrapper!)
schema = [tags: {:array, {:low_cardinality, {:nullable, :string}}}]
columns = %{tags: [["tech", nil, "startup"], [nil, "news"]]}
```

All these patterns work with full INSERT→SELECT roundtrip fidelity.

## Architecture

Natch uses a three-layer architecture:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Elixir Application Layer           β”‚
β”‚  - Natch.insert/4                    β”‚
β”‚  - Natch.Connection GenServer         β”‚
β”‚  - Idiomatic Elixir API             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FINE NIF Layer (C++)               β”‚
β”‚  - Type conversion Elixir ↔ C++     β”‚
β”‚  - Resource management              β”‚
β”‚  - Exception handling               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  clickhouse-cpp Library             β”‚
β”‚  - Native TCP protocol              β”‚
β”‚  - Binary columnar format           β”‚
β”‚  - LZ4/ZSTD compression             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Why FINE + clickhouse-cpp?

- **Native Protocol** - Binary columnar format with efficient compression
- **Mature Library** - Leverage official ClickHouse C++ client
- **Type Safety** - FINE provides crash-proof NIFs
- **Fast Development** - 4-6 weeks vs 4-6 months for pure Elixir

## Development

### Running ClickHouse Locally

```bash
# Start ClickHouse
docker-compose up -d

# Check it's running
clickhouse-client --query "SELECT version()"
```

### Running Tests

```bash
# Run all tests
mix test

# Run with tracing
mix test --trace

# Run specific test file
mix test test/nesting_integration_test.exs
```

### Test Coverage

- βœ… **227 tests passing** (as of Phase 5E)
- βœ… All primitive types (integers, floats, strings, bools)
- βœ… All temporal types (Date, DateTime, DateTime64)
- βœ… All special types (UUID, Decimal64, Enum8/16, LowCardinality)
- βœ… All complex types (Array, Map, Tuple, Nullable)
- βœ… 14 comprehensive nesting integration tests
- βœ… Full INSERTβ†’SELECT roundtrip validation

## Roadmap

### Completed (Phase 1-5)
- βœ… Native TCP protocol support
- βœ… All ClickHouse primitive types
- βœ… All temporal types (Date, DateTime, DateTime64)
- βœ… UUID and Decimal64 support
- βœ… Nullable types
- βœ… Array types with arbitrary nesting
- βœ… Map and Tuple types
- βœ… Enum8/Enum16 types
- βœ… LowCardinality types
- βœ… Complex type nesting (Array(Map(String, Nullable(T))), etc.)
- βœ… Columnar insert API
- βœ… LZ4 compression

### Planned (Phase 6+)
- ⏳ Explorer DataFrame integration (zero-copy)
- ⏳ SSL/TLS support
- ⏳ Connection pooling
- ⏳ Async query execution
- ⏳ Prepared statements

### Not Planned
- ❌ Ecto integration (ClickHouse is OLAP, not OLTP - not a good fit)
- ❌ HTTP protocol support (use native TCP for better performance)

## Contributing

Contributions are welcome! Areas where we'd love help:

1. **Additional type support** - FixedString, IPv4/IPv6, Geo types
2. **Performance optimization** - Zero-copy paths, SIMD operations
3. **Documentation** - More examples, guides
4. **Testing** - Edge cases, stress tests

Please feel free to submit a Pull Request or open an issue.

## License

MIT License - See LICENSE file for details.

## Acknowledgments

- Built with [FINE](https://github.com/elixir-nx/fine) for crash-proof NIFs
- Powered by [clickhouse-cpp](https://github.com/ClickHouse/clickhouse-cpp) official C++ client
- Inspired by the excellent work of the ClickHouse and Elixir communities

## Resources

- [ClickHouse Documentation](https://clickhouse.com/docs)
- [ClickHouse Data Types](https://clickhouse.com/docs/en/sql-reference/data-types)
- [FINE Documentation](https://hexdocs.pm/fine)
- [Implementation Plan](FINE_PLAN.md) - Detailed architecture and design decisions