# Chunkex
A powerful Elixir tool for intelligently chunking your project's source code to support Local RAG (Retrieval-Augmented Generation) systems. Chunkex extracts functions from your Elixir codebase and creates structured chunks optimized for embedding and retrieval.
## Overview
Chunkex is designed to bridge the gap between your Elixir codebase and AI agents by creating semantic chunks that can be:
- **Embedded** into vector representations by embedding models
- **Stored** in vector databases for efficient retrieval
- **Retrieved** through hybrid search to provide precise context to AI agents
- **Used as JIT context** instead of loading entire files
This approach enables your AI agents to access relevant code snippets with high precision, improving response quality while reducing token usage.
## Features
- **Function-level chunking**: Extracts individual functions (`def`, `defp`, `defmacro`) as semantic units
- **Precise location tracking**: Maintains file paths and line ranges for each chunk
- **Rich metadata**: Includes language identification, SHA hashes, and structured text
- **Mix task integration**: Easy-to-use command-line interface
- **Error resilient**: Gracefully handles syntax errors and malformed files
- **JSONL output**: Structured format optimized for embedding pipelines
## Installation
Add `chunkex` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:chunkex, "~> 0.1.0"}
]
end
```
Then run `mix deps.get` to install the dependency.
## Usage
### Command Line Interface
The easiest way to use Chunkex is through the Mix task:
```bash
# Chunk the current project
mix chunkex run
# Chunk a specific directory
mix chunkex run /path/to/your/project
```
### Programmatic Usage
You can also use Chunkex directly in your Elixir code:
```elixir
# Chunk the current directory
Chunkex.chunk()
# Chunk a specific directory
Chunkex.chunk("/path/to/your/project")
```
## Output Format
Chunkex generates a `tmp/chunks.jsonl` file containing one JSON object per line. Each chunk includes:
```json
{
"path": "lib/my_module.ex",
"line_start": 15,
"line_end": 25,
"lang": "elixir",
"text": " def calculate_total(items) do\n items |> Enum.sum()\n end",
"sha": "a1b2c3d4e5f6..."
}
```
### Field Descriptions
- **`path`**: Relative path to the source file
- **`line_start`**: Starting line number of the function
- **`line_end`**: Ending line number of the function
- **`lang`**: Language identifier (always "elixir")
- **`text`**: The actual function code with proper indentation
- **`sha`**: SHA-256 hash of the source file for change detection
## Integration with RAG Systems
### 1. Embedding Pipeline
```elixir
# Example: Process chunks for embedding
chunks = File.stream!("tmp/chunks.jsonl")
|> Stream.map(&Jason.decode!/1)
|> Enum.map(fn chunk ->
%{
id: "#{chunk["path"]}:#{chunk["line_start"]}",
content: chunk["text"],
metadata: %{
file: chunk["path"],
line_start: chunk["line_start"],
line_end: chunk["line_end"],
sha: chunk["sha"]
}
}
end)
```
### 2. Vector Storage
Store the embedded chunks in your preferred vector database (Pinecone, Weaviate, Chroma, etc.) with the metadata for hybrid search.
### 3. Retrieval for AI Agents
When your AI agent needs context about specific functionality:
```elixir
# Retrieve relevant chunks based on query
relevant_chunks = vector_db.similarity_search(
query: "How to calculate totals?",
filter: %{lang: "elixir"},
limit: 5
)
# Use chunks as JIT context instead of entire files
context = relevant_chunks
|> Enum.map(& &1.content)
|> Enum.join("\n\n")
```
## Benefits for Local RAG
### **Precision over Recall**
- Function-level chunks provide focused context
- Reduces noise from irrelevant code sections
- Improves AI agent response accuracy
### **Token Efficiency**
- Smaller, targeted chunks reduce token usage
- Enables more context within token limits
- Cost-effective for API-based AI services
### **Incremental Updates**
- SHA hashes enable change detection
- Update only modified chunks in your vector store
- Efficient synchronization with codebase changes
### **Semantic Structure**
- Functions represent logical code units
- Natural boundaries for embedding models
- Better semantic understanding by AI agents
## Development
### Running Tests
```bash
mix test
```
### Building Documentation
```bash
mix docs
```
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Roadmap
- [ ] Support for additional Elixir constructs (modules, typespecs, etc.)
- [ ] Configurable chunking strategies
- [ ] Integration with popular vector databases
- [ ] Real-time file watching for incremental updates
---
**Built To Save the Tokens**