README.md

# Chunkex

A powerful Elixir tool for intelligently chunking your project's source code to support Local RAG (Retrieval-Augmented Generation) systems. Chunkex extracts functions from your Elixir codebase and creates structured chunks optimized for embedding and retrieval.

## Overview

Chunkex is designed to bridge the gap between your Elixir codebase and AI agents by creating semantic chunks that can be:

- **Embedded** into vector representations by embedding models
- **Stored** in vector databases for efficient retrieval
- **Retrieved** through hybrid search to provide precise context to AI agents
- **Used as JIT context** instead of loading entire files

This approach enables your AI agents to access relevant code snippets with high precision, improving response quality while reducing token usage.

## Features

- **Function-level chunking**: Extracts individual functions (`def`, `defp`, `defmacro`) as semantic units
- **Precise location tracking**: Maintains file paths and line ranges for each chunk
- **Rich metadata**: Includes language identification, SHA hashes, and structured text
- **Mix task integration**: Easy-to-use command-line interface
- **Error resilient**: Gracefully handles syntax errors and malformed files
- **JSONL output**: Structured format optimized for embedding pipelines

## Installation

Add `chunkex` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:chunkex, "~> 0.1.0"}
  ]
end
```

Then run `mix deps.get` to install the dependency.

## Usage

### Command Line Interface

The easiest way to use Chunkex is through the Mix task:

```bash
# Chunk the current project
mix chunkex run

# Chunk a specific directory
mix chunkex run /path/to/your/project
```

### Programmatic Usage

You can also use Chunkex directly in your Elixir code:

```elixir
# Chunk the current directory
Chunkex.chunk()

# Chunk a specific directory
Chunkex.chunk("/path/to/your/project")
```

## Output Format

Chunkex generates a `tmp/chunks.jsonl` file containing one JSON object per line. Each chunk includes:

```json
{
  "path": "lib/my_module.ex",
  "line_start": 15,
  "line_end": 25,
  "lang": "elixir",
  "text": "  def calculate_total(items) do\n    items |> Enum.sum()\n  end",
  "sha": "a1b2c3d4e5f6..."
}
```

### Field Descriptions

- **`path`**: Relative path to the source file
- **`line_start`**: Starting line number of the function
- **`line_end`**: Ending line number of the function
- **`lang`**: Language identifier (always "elixir")
- **`text`**: The actual function code with proper indentation
- **`sha`**: SHA-256 hash of the source file for change detection

## Integration with RAG Systems

### 1. Embedding Pipeline

```elixir
# Example: Process chunks for embedding
chunks = File.stream!("tmp/chunks.jsonl")
|> Stream.map(&Jason.decode!/1)
|> Enum.map(fn chunk ->
  %{
    id: "#{chunk["path"]}:#{chunk["line_start"]}",
    content: chunk["text"],
    metadata: %{
      file: chunk["path"],
      line_start: chunk["line_start"],
      line_end: chunk["line_end"],
      sha: chunk["sha"]
    }
  }
end)
```

### 2. Vector Storage

Store the embedded chunks in your preferred vector database (Pinecone, Weaviate, Chroma, etc.) with the metadata for hybrid search.

### 3. Retrieval for AI Agents

When your AI agent needs context about specific functionality:

```elixir
# Retrieve relevant chunks based on query
relevant_chunks = vector_db.similarity_search(
  query: "How to calculate totals?",
  filter: %{lang: "elixir"},
  limit: 5
)

# Use chunks as JIT context instead of entire files
context = relevant_chunks
|> Enum.map(& &1.content)
|> Enum.join("\n\n")
```

## Benefits for Local RAG

### **Precision over Recall**
- Function-level chunks provide focused context
- Reduces noise from irrelevant code sections
- Improves AI agent response accuracy

### **Token Efficiency**
- Smaller, targeted chunks reduce token usage
- Enables more context within token limits
- Cost-effective for API-based AI services

### **Incremental Updates**
- SHA hashes enable change detection
- Update only modified chunks in your vector store
- Efficient synchronization with codebase changes

### **Semantic Structure**
- Functions represent logical code units
- Natural boundaries for embedding models
- Better semantic understanding by AI agents

## Development

### Running Tests

```bash
mix test
```

### Building Documentation

```bash
mix docs
```

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Roadmap

- [ ] Support for additional Elixir constructs (modules, typespecs, etc.)
- [ ] Configurable chunking strategies
- [ ] Integration with popular vector databases
- [ ] Real-time file watching for incremental updates

---

**Built To Save the Tokens**