# Tokenizers Guide
**Updated for TantivyEx v0.2.0** - This comprehensive guide covers the new advanced tokenization system, text analysis capabilities, and multi-language support in TantivyEx.
## Quick Start
```elixir
# Start with default tokenizers (recommended)
TantivyEx.Tokenizer.register_default_tokenizers()
# List available tokenizers
tokenizers = TantivyEx.Tokenizer.list_tokenizers()
# ["default", "simple", "keyword", "whitespace", "raw", "en_stem", "fr_stem", ...]
# Test tokenization
tokens = TantivyEx.Tokenizer.tokenize_text("default", "The quick brown foxes are running!")
# ["quick", "brown", "fox", "run"] # Notice stemming: foxes -> fox, running -> run
# Create schema with tokenizers
schema = TantivyEx.Schema.new()
|> TantivyEx.Schema.add_text_field_with_tokenizer("content", :text, "default")
|> TantivyEx.Schema.add_text_field_with_tokenizer("tags", :text, "whitespace")
```
## Related Documentation
- **[Schema Design Guide](schema.md)** - Choose the right tokenizers for your field types
- **[Document Operations Guide](documents.md)** - Understand how tokenizers affect document indexing
- **[Search Guide](search.md)** - Use tokenizer knowledge to write better queries
- **[Search Results Guide](search_results.md)** - Leverage tokenization for highlighting and snippets
## Table of Contents
- [Quick Start](#quick-start)
- [TantivyEx.Tokenizer Module](#tantivyextokenizer-module)
- [Understanding Tokenizers](#understanding-tokenizers)
- [Built-in Tokenizers](#built-in-tokenizers)
- [Advanced Text Analyzers](#advanced-text-analyzers)
- [Multi-Language Support](#multi-language-support)
- [Tokenizer Selection Guide](#tokenizer-selection-guide)
- [Language-Specific Tokenization](#language-specific-tokenization)
- [Performance Considerations](#performance-considerations)
- [Real-world Examples](#real-world-examples)
- [Troubleshooting](#troubleshooting)
## TantivyEx.Tokenizer Module
**New in v0.2.0:** The `TantivyEx.Tokenizer` module provides comprehensive tokenization functionality with a clean, Elixir-friendly API.
### Core Functions
#### Tokenizer Registration
```elixir
# Register all default tokenizers at once
"Default tokenizers registered successfully" = TantivyEx.Tokenizer.register_default_tokenizers()
# Registers: "default", "simple", "keyword", "whitespace", "raw"
# Plus language variants: "en_stem", "fr_stem", "de_stem", etc.
# And analyzers: "en_text" (English with stop words + stemming)
# Register specific tokenizer types
{:ok, _msg} = TantivyEx.Tokenizer.register_simple_tokenizer("my_simple")
{:ok, _msg} = TantivyEx.Tokenizer.register_whitespace_tokenizer("my_whitespace")
{:ok, _msg} = TantivyEx.Tokenizer.register_regex_tokenizer("email", "\\b[\\w._%+-]+@[\\w.-]+\\.[A-Z|a-z]{2,}\\b")
{:ok, _msg} = TantivyEx.Tokenizer.register_ngram_tokenizer("fuzzy", 2, 4, false)
```
#### Text Analyzers
```elixir
# Full-featured text analyzer with all filters
{:ok, _msg} = TantivyEx.Tokenizer.register_text_analyzer(
"english_complete", # name
"simple", # base tokenizer ("simple" or "whitespace")
true, # lowercase filter
"english", # stop words language (or nil)
"english", # stemming language (or nil)
50 # max token length (or nil)
)
# Language-specific convenience functions (Elixir wrapper functions)
{:ok, _msg} = TantivyEx.Tokenizer.register_language_analyzer("french") # -> "french_text"
{:ok, _msg} = TantivyEx.Tokenizer.register_stemming_tokenizer("german") # -> "german_stem"
```
#### Tokenization Operations
```elixir
# Basic tokenization
tokens = TantivyEx.Tokenizer.tokenize_text("default", "Hello world!")
# ["hello", "world"]
# Detailed tokenization with positions
detailed = TantivyEx.Tokenizer.tokenize_text_detailed("simple", "Hello World")
# [{"hello", 0, 5}, {"world", 6, 11}]
# List all registered tokenizers
available = TantivyEx.Tokenizer.list_tokenizers()
# ["default", "simple", "keyword", "whitespace", ...]
# Process pre-tokenized text
result = TantivyEx.Tokenizer.process_pre_tokenized_text(["pre", "tokenized", "words"])
```
#### Performance Testing
```elixir
# Benchmark tokenizer performance
{final_tokens, avg_microseconds} = TantivyEx.Tokenizer.benchmark_tokenizer(
"default",
"Sample text to tokenize",
1000 # iterations
)
```
## Understanding Tokenizers
Tokenizers are the foundation of text search - they determine how your documents are broken down into searchable terms. Understanding tokenization is crucial for achieving accurate and efficient search results.
### What Tokenizers Do
The tokenization process involves several critical steps:
1. **Text Segmentation**: Breaking text into individual units (words, phrases, or characters)
2. **Normalization**: Converting text to a standard form (lowercase, Unicode normalization)
3. **Filtering**: Removing or transforming tokens (stop words, punctuation, special characters)
4. **Stemming/Lemmatization**: Reducing words to their root forms
5. **Token Generation**: Creating the final searchable terms for the index
### The Tokenization Pipeline
```text
Raw Text → Segmentation → Normalization → Filtering → Stemming → Index Terms
```
**Example transformation:**
```text
"The QUICK brown foxes are running!"
→ ["The", "QUICK", "brown", "foxes", "are", "running", "!"] # Segmentation
→ ["the", "quick", "brown", "foxes", "are", "running", "!"] # Normalization
→ ["quick", "brown", "foxes", "running"] # Stop word removal
→ ["quick", "brown", "fox", "run"] # Stemming
```
### Default Behavior
When you don't specify a tokenizer, TantivyEx uses Tantivy's default tokenizer:
```elixir
# Uses default tokenizer
{:ok, schema} = Schema.add_text_field(schema, "content", :text)
# Equivalent to specifying "default" explicitly
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "content", :text, "default")
```
### Impact on Search Behavior
Different tokenizers produce different search behaviors:
```elixir
# With stemming tokenizer
# Query: "running" matches documents containing "run", "runs", "running", "ran"
# With simple tokenizer
# Query: "running" only matches documents containing exactly "running"
# With keyword tokenizer
# Query: "running" only matches if the entire field value is "running"
```
## Built-in Tokenizers
### Default Tokenizer
The default tokenizer provides comprehensive text processing suitable for most search scenarios.
**Features:**
- Splits on whitespace and punctuation
- Converts to lowercase
- Removes common stop words
- Applies stemming
**Best for:**
- General content search
- Blog posts and articles
- Product descriptions
- Most text fields
**Example:**
```elixir
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "article_content", :text, "default")
# Input: "The Quick Brown Foxes are Running!"
# Tokens: ["quick", "brown", "fox", "run"] # Notice stemming: foxes -> fox, running -> run
```
### Simple Tokenizer
The simple tokenizer performs minimal processing, preserving the original text structure.
**Features:**
- Splits only on whitespace
- Converts to lowercase
- Preserves punctuation and special characters
**Best for:**
- Product codes and SKUs
- Exact phrase matching
- Technical identifiers
- Fields where punctuation matters
**Example:**
```elixir
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "product_code", :text, "simple")
# Input: "SKU-12345-A"
# Tokens: ["sku-12345-a"] # Preserved hyphen, lowercased
```
### Whitespace Tokenizer
The whitespace tokenizer splits only on whitespace characters.
**Features:**
- Splits only on spaces, tabs, newlines
- Preserves case
- Preserves all punctuation
**Best for:**
- Tag fields
- Category lists
- Fields where case and punctuation are significant
**Example:**
```elixir
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "tags", :text, "whitespace")
# Input: "JavaScript React.js Node.js"
# Tokens: ["JavaScript", "React.js", "Node.js"] # Case and dots preserved
```
### Keyword Tokenizer
The keyword tokenizer treats the entire input as a single token.
**Features:**
- No splitting - entire field becomes one token
- Preserves case and punctuation
- Exact matching only
**Best for:**
- Status fields
- Category hierarchies
- Exact match requirements
- Enumerated values
**Example:**
```elixir
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "status", :text, "keyword")
# Input: "In Progress - Pending Review"
# Tokens: ["In Progress - Pending Review"] # Single token, exact match required
```
## Custom Tokenizer Usage
### Basic Custom Tokenizer Setup
```elixir
alias TantivyEx.Schema
# Create schema with different tokenizers for different fields
{:ok, schema} = Schema.new()
# Article content - use default for comprehensive search
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "title", :text_stored, "default"
)
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content", :text, "default"
)
# Product codes - use simple for exact matching
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "sku", :text_stored, "simple"
)
# Tags - use whitespace to preserve structure
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "tags", :text, "whitespace"
)
# Status - use keyword for exact matching
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "status", :indexed, "keyword"
)
```
### Tokenizer Comparison Example
```elixir
defmodule MyApp.TokenizerDemo do
alias TantivyEx.{Schema, Index}
def demonstrate_tokenizers do
# Create index with different tokenizer fields
{:ok, schema} = create_demo_schema()
{:ok, index} = Index.create("/tmp/tokenizer_demo", schema)
# Add sample document
sample_text = "The Quick-Brown Fox's Email: fox@example.com"
document = %{
"default_field" => sample_text,
"simple_field" => sample_text,
"whitespace_field" => sample_text,
"keyword_field" => sample_text
}
{:ok, writer} = TantivyEx.IndexWriter.new(index)
TantivyEx.IndexWriter.add_document(writer, document)
TantivyEx.IndexWriter.commit(writer)
# Test different search behaviors
test_searches(index)
end
defp create_demo_schema do
{:ok, schema} = Schema.new()
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "default_field", :text, "default")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "simple_field", :text, "simple")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "whitespace_field", :text, "whitespace")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "keyword_field", :text, "keyword")
{:ok, schema}
end
defp test_searches(index) do
searches = [
{"fox", "Search for 'fox'"},
{"quick-brown", "Search for 'quick-brown'"},
{"fox@example.com", "Search for email"},
{"Fox's", "Search with apostrophe"},
{"The Quick-Brown Fox's Email: fox@example.com", "Exact match"}
]
Enum.each(searches, fn {query, description} ->
IO.puts("\n#{description}: '#{query}'")
test_field_searches(index, query)
end)
end
defp test_field_searches(index, query) do
fields = ["default_field", "simple_field", "whitespace_field", "keyword_field"]
searcher = TantivyEx.Searcher.new(index)
Enum.each(fields, fn field ->
field_query = "#{field}:(#{query})"
case TantivyEx.Searcher.search(searcher, field_query, 1) do
{:ok, results} ->
found = length(results) > 0
IO.puts(" #{field}: #{if found, do: "✓ Found", else: "✗ Not found"}")
{:error, _} ->
IO.puts(" #{field}: ✗ Search error")
end
end)
end
end
# Run the demo
MyApp.TokenizerDemo.demonstrate_tokenizers()
```
## Tokenizer Management
TantivyEx provides comprehensive tokenizer management capabilities through the native interface, allowing you to register, configure, and enumerate tokenizers dynamically.
### Registering Custom Tokenizers
#### Default Tokenizers
The most convenient way to set up commonly used tokenizers is with the default registration:
```elixir
# Register all default tokenizers at once
message = TantivyEx.Tokenizer.register_default_tokenizers()
IO.puts(message) # "Default tokenizers registered successfully"
# This registers: "default", "simple", "keyword", "whitespace", "raw",
# plus language-specific variants like "en_stem", "fr_stem", etc.
```
#### Individual Tokenizer Registration
For more granular control, register tokenizers individually:
```elixir
# Simple tokenizer - splits on punctuation and whitespace
{:ok, msg} = TantivyEx.Tokenizer.register_simple_tokenizer("my_simple")
# Whitespace tokenizer - splits only on whitespace
{:ok, msg} = TantivyEx.Tokenizer.register_whitespace_tokenizer("my_whitespace")
# Regex tokenizer - custom splitting pattern
{:ok, msg} = TantivyEx.Tokenizer.register_regex_tokenizer("my_regex", "\\w+")
# N-gram tokenizer - fixed-size character sequences
{:ok, msg} = TantivyEx.Tokenizer.register_ngram_tokenizer("my_ngram", 2, 3, false)
```
#### Multi-Language Text Analyzers
For sophisticated text processing with filters and stemming:
```elixir
# English text analyzer with full processing
{:ok, msg} = TantivyEx.Tokenizer.register_text_analyzer(
"english_full", # name
"simple", # base tokenizer
true, # lowercase
"english", # stop words language
"english", # stemming language
true # remove long tokens
)
# Multi-language support
languages = ["english", "french", "german", "spanish"]
for language <- languages do
TantivyEx.Tokenizer.register_text_analyzer(
"#{language}_analyzer",
"simple",
true,
language,
language,
true
)
end
```
### Listing Available Tokenizers
**New in v0.2.0:** You can now enumerate all registered tokenizers to verify configuration or implement dynamic tokenizer selection:
```elixir
# List all currently registered tokenizers
tokenizers = TantivyEx.Tokenizer.list_tokenizers()
IO.inspect(tokenizers)
# ["default", "simple", "keyword", "whitespace", "en_stem", "fr_stem", ...]
# Check if a specific tokenizer is available
if "my_custom" in TantivyEx.Tokenizer.list_tokenizers() do
IO.puts("Custom tokenizer is available")
else
# Register it if missing
TantivyEx.Tokenizer.register_simple_tokenizer("my_custom")
end
```
### Dynamic Tokenizer Configuration
Build tokenizer configurations based on runtime requirements:
```elixir
defmodule MyApp.TokenizerConfig do
alias TantivyEx.Tokenizer
def setup_for_language(language) do
# Register default tokenizers first
Tokenizer.register_default_tokenizers()
# Check what's available
available = Tokenizer.list_tokenizers()
IO.puts("Available tokenizers: #{inspect(available)}")
# Add language-specific configuration
case language do
"en" -> setup_english_tokenizers()
"es" -> setup_spanish_tokenizers()
"multi" -> setup_multilingual_tokenizers()
_ -> :ok
end
# Verify final configuration
final_tokenizers = TantivyEx.Tokenizer.list_tokenizers()
IO.puts("Final tokenizer count: #{length(final_tokenizers)}")
end
defp setup_english_tokenizers do
TantivyEx.Tokenizer.register_text_analyzer("en_blog", "simple", true, "english", "english", true)
TantivyEx.Tokenizer.register_text_analyzer("en_legal", "simple", true, "english", nil, false)
TantivyEx.Tokenizer.register_regex_tokenizer("en_email", "[\\w\\._%+-]+@[\\w\\.-]+\\.[A-Za-z]{2,}")
end
defp setup_spanish_tokenizers do
TantivyEx.Tokenizer.register_text_analyzer("es_content", "simple", true, "spanish", "spanish", true)
TantivyEx.Tokenizer.register_regex_tokenizer("es_phone", "\\+?[0-9]{2,3}[\\s-]?[0-9]{3}[\\s-]?[0-9]{3}[\\s-]?[0-9]{3}")
end
defp setup_multilingual_tokenizers do
# Minimal processing for multi-language content
TantivyEx.Tokenizer.register_simple_tokenizer("multi_simple")
TantivyEx.Tokenizer.register_whitespace_tokenizer("multi_whitespace")
end
end
```
### Testing Tokenizers
Verify tokenizer behavior before using in production schemas:
```elixir
defmodule MyApp.TokenizerTester do
alias TantivyEx.Tokenizer
def test_tokenizer_suite do
# Register all tokenizers
Tokenizer.register_default_tokenizers()
test_text = "Hello World! This is a TEST email: user@example.com"
# Test each available tokenizer
Tokenizer.list_tokenizers()
|> Enum.each(fn tokenizer_name ->
IO.puts("\n--- Testing: #{tokenizer_name} ---")
case Tokenizer.tokenize_text(tokenizer_name, test_text) do
{:ok, tokens} ->
IO.puts("Tokens: #{inspect(tokens)}")
IO.puts("Count: #{length(tokens)}")
{:error, reason} ->
IO.puts("Error: #{reason}")
end
end)
end
def compare_tokenizers(text, tokenizer_names) do
results =
tokenizer_names
|> Enum.map(fn name ->
try do
tokens = TantivyEx.Tokenizer.tokenize_text(name, text)
{name, tokens}
rescue
_ -> {name, :error}
end
end)
|> Enum.filter(fn {_, tokens} -> tokens != :error end)
IO.puts("\nTokenization Comparison for: \"#{text}\"")
IO.puts(String.duplicate("-", 60))
Enum.each(results, fn {name, tokens} ->
IO.puts("#{String.pad_trailing(name, 15)}: #{inspect(tokens)}")
end)
results
end
end
# Usage
MyApp.TokenizerTester.test_tokenizer_suite()
MyApp.TokenizerTester.compare_tokenizers(
"user@example.com",
["default", "simple", "whitespace", "keyword"]
)
```
### Error Handling
Handle tokenizer registration and usage errors gracefully:
```elixir
defmodule MyApp.SafeTokenizers do
alias TantivyEx.Tokenizer
def safe_register_analyzer(name, config) do
case Tokenizer.register_text_analyzer(
name,
config[:base] || "simple",
config[:lowercase] || true,
config[:stop_words],
config[:stemming],
config[:remove_long]
) do
{:ok, message} ->
{:ok, message}
{:error, reason} ->
Logger.warning("Failed to register tokenizer #{name}: #{reason}")
{:error, reason}
end
end
def ensure_tokenizer_exists(name) do
if name in TantivyEx.Tokenizer.list_tokenizers() do
:ok
else
Logger.info("Tokenizer #{name} not found, registering default")
TantivyEx.Tokenizer.register_simple_tokenizer(name)
end
end
end
```
## Tokenizer Selection Guide
### Decision Matrix
| Use Case | Tokenizer | Reason |
|----------|-----------|--------|
| Article content | `default` | Full-text search with stemming |
| Product names | `default` | Natural language search |
| Product codes/SKUs | `simple` | Preserve structure, case-insensitive |
| Email addresses | `simple` | Preserve @ and dots |
| Tags/keywords | `whitespace` | Preserve individual terms |
| Status values | `keyword` | Exact matching only |
| Category paths | `keyword` | Exact hierarchy matching |
| User input/queries | `default` | Natural language processing |
| URLs | `simple` | Preserve structure |
| Technical terms | `whitespace` | Preserve case and dots |
### Content Type Guidelines
#### Blog/CMS Content
```elixir
# Title and content - natural language search
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "title", :text_stored, "default")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "content", :text, "default")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "excerpt", :text_stored, "default")
# Tags - preserve individual tags
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "tags", :text, "whitespace")
# Author name - could be searched as phrase
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "author", :text_stored, "simple")
# Category - exact matching
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "category", :indexed, "keyword")
```
#### E-commerce Products
```elixir
# Product name and description - natural search
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "name", :text_stored, "default")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "description", :text, "default")
# Brand - could be searched as phrase
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "brand", :text_stored, "simple")
# SKU/model - preserve structure
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "sku", :text_stored, "simple")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "model", :text_stored, "simple")
# Product features/specs - preserve technical terms
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "features", :text, "whitespace")
```
#### User Profiles
```elixir
# Name fields - phrase search
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "full_name", :text_stored, "simple")
# Bio/description - natural language
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "bio", :text, "default")
# Username/email - exact structure
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "username", :text_stored, "simple")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "email", :STORED, "simple")
# Skills/interests - individual terms
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "skills", :text, "whitespace")
```
#### Log Analysis
```elixir
# Log message - natural language search
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "message", :text, "default")
# Service/host names - preserve structure
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "service", :indexed, "simple")
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "hostname", :indexed, "simple")
# Log level - exact matching
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "level", :indexed, "keyword")
# Request URLs - preserve structure
{:ok, schema} = Schema.add_text_field_with_tokenizer(schema, "request_url", :text, "simple")
```
### Multi-Language Support
**New in v0.2.0:** TantivyEx provides built-in support for 17+ languages with language-specific stemming and stop word filtering.
#### Supported Languages
| Language | Code | Stemming | Stop Words | Convenience Function |
|----------|------|----------|------------|---------------------|
| English | `"en"` / `"english"` | ✅ | ✅ | `register_language_analyzer("english")` |
| French | `"fr"` / `"french"` | ✅ | ✅ | `register_language_analyzer("french")` |
| German | `"de"` / `"german"` | ✅ | ✅ | `register_language_analyzer("german")` |
| Spanish | `"es"` / `"spanish"` | ✅ | ✅ | `register_language_analyzer("spanish")` |
| Italian | `"it"` / `"italian"` | ✅ | ✅ | `register_language_analyzer("italian")` |
| Portuguese | `"pt"` / `"portuguese"` | ✅ | ✅ | `register_language_analyzer("portuguese")` |
| Russian | `"ru"` / `"russian"` | ✅ | ✅ | `register_language_analyzer("russian")` |
| Arabic | `"ar"` / `"arabic"` | ✅ | ✅ | `register_language_analyzer("arabic")` |
| Danish | `"da"` / `"danish"` | ✅ | ✅ | `register_language_analyzer("danish")` |
| Dutch | `"nl"` / `"dutch"` | ✅ | ✅ | `register_language_analyzer("dutch")` |
| Finnish | `"fi"` / `"finnish"` | ✅ | ✅ | `register_language_analyzer("finnish")` |
| Greek | `"el"` / `"greek"` | ✅ | ✅ | `register_language_analyzer("greek")` |
| Hungarian | `"hu"` / `"hungarian"` | ✅ | ✅ | `register_language_analyzer("hungarian")` |
| Norwegian | `"no"` / `"norwegian"` | ✅ | ✅ | `register_language_analyzer("norwegian")` |
| Romanian | `"ro"` / `"romanian"` | ✅ | ✅ | `register_language_analyzer("romanian")` |
| Swedish | `"sv"` / `"swedish"` | ✅ | ✅ | `register_language_analyzer("swedish")` |
| Tamil | `"ta"` / `"tamil"` | ✅ | ✅ | `register_language_analyzer("tamil")` |
| Turkish | `"tr"` / `"turkish"` | ✅ | ✅ | `register_language_analyzer("turkish")` |
#### Language-Specific Usage
```elixir
# Complete language analyzer (lowercase + stop words + stemming)
TantivyEx.Tokenizer.register_language_analyzer("english")
# Creates "english_text" tokenizer
# Stemming only
TantivyEx.Tokenizer.register_stemming_tokenizer("french")
# Creates "french_stem" tokenizer
# Custom language configuration
TantivyEx.Tokenizer.register_text_analyzer(
"german_custom",
"simple",
true, # lowercase
"german", # stop words
"german", # stemming
100 # max token length
)
```
### Advanced Text Analyzers
Text analyzers provide the most sophisticated text processing by chaining multiple filters:
```elixir
defmodule MyApp.LanguageAwareSchema do
def create_multilingual_schema do
{:ok, schema} = Schema.new()
# English content - default tokenizer works well
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content_en", :text, "default"
)
# For languages without word separators (e.g., Chinese, Japanese)
# Simple tokenizer might be more appropriate
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content_cjk", :text, "simple"
)
# For technical content with lots of punctuation
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "technical_content", :text, "whitespace"
)
{:ok, schema}
end
end
```
### Handling Special Characters
Different tokenizers handle special characters differently:
```elixir
defmodule MyApp.SpecialCharacterDemo do
def test_special_characters do
test_cases = [
"user@example.com",
"C++ programming",
"file.extension.tar.gz",
"API-v2.1",
"$100.50",
"multi-word-identifier",
"IPv4: 192.168.1.1"
]
Enum.each(test_cases, &analyze_tokenization/1)
end
defp analyze_tokenization(text) do
IO.puts("\nAnalyzing: '#{text}'")
IO.puts("Default tokenizer: comprehensive processing")
IO.puts("Simple tokenizer: preserves structure, lowercased")
IO.puts("Whitespace tokenizer: preserves case and punctuation")
IO.puts("Keyword tokenizer: exact match only")
end
end
```
### Custom Search Strategies
Different tokenizers enable different search strategies:
```elixir
defmodule MyApp.SearchStrategies do
alias TantivyEx.Index
def demonstrate_search_strategies(index) do
# Strategy 1: Fuzzy matching with default tokenizer
fuzzy_search(index, "programming")
# Strategy 2: Exact code matching with simple tokenizer
exact_code_search(index, "API-v2.1")
# Strategy 3: Tag searching with whitespace tokenizer
tag_search(index, "JavaScript")
# Strategy 4: Status filtering with keyword tokenizer
status_filter(index, "In Progress")
end
defp fuzzy_search(index, term) do
# Works well with default tokenizer due to stemming
queries = [
term, # Exact term
"#{term}~", # Fuzzy search
"#{String.slice(term, 0, -2)}*" # Wildcard search
]
searcher = TantivyEx.Searcher.new(index)
Enum.each(queries, fn query ->
{:ok, results} = TantivyEx.Searcher.search(searcher, "content:(#{query})", 5)
IO.puts("Query '#{query}': #{length(results)} results")
end)
end
defp exact_code_search(index, code) do
# Best with simple tokenizer for technical identifiers
query = "sku:(#{code})"
searcher = TantivyEx.Searcher.new(index)
{:ok, results} = TantivyEx.Searcher.search(searcher, query, 10)
IO.puts("Exact code search: #{length(results)} results")
defp fuzzy_search_example(index, term) do
# Works well with "default" tokenizer
query = "content:(#{term}~)"
searcher = TantivyEx.Searcher.new(index)
{:ok, results} = TantivyEx.Searcher.search(searcher, query, 10)
IO.puts("Fuzzy search: #{length(results)} results")
end
defp tag_search(index, tag) do
# Whitespace tokenizer preserves individual tags
query = "tags:(#{tag})"
searcher = TantivyEx.Searcher.new(index)
{:ok, results} = TantivyEx.Searcher.search(searcher, query, 10)
IO.puts("Tag search: #{length(results)} results")
end
defp status_filter(index, status) do
# Keyword tokenizer for exact status matching
query = "status:(\"#{status}\")"
searcher = TantivyEx.Searcher.new(index)
{:ok, results} = TantivyEx.Searcher.search(searcher, query, 10)
IO.puts("Status filter: #{length(results)} results")
end
end
```
## Language-Specific Tokenization
### Multilingual Content
Handle multiple languages in your search index:
```elixir
defmodule MyApp.MultilingualTokenizer do
def setup_multilingual_schema do
{:ok, schema} = Schema.new()
# English content with stemming
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content_en", :text, "en_stem"
)
# Spanish content with stemming
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content_es", :text, "default"
)
# Generic multilingual field
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content_raw", :text, "simple"
)
schema
end
def index_multilingual_document(index, content, language) do
{:ok, writer} = TantivyEx.IndexWriter.new(index)
case language do
"en" ->
document = %{
"content_en" => content,
"content_raw" => content
}
TantivyEx.IndexWriter.add_document(writer, document)
"es" ->
document = %{
"content_es" => content,
"content_raw" => content
}
TantivyEx.IndexWriter.add_document(writer, document)
_ ->
document = %{"content_raw" => content}
TantivyEx.IndexWriter.add_document(writer, document)
end
TantivyEx.IndexWriter.commit(writer)
end
def search_multilingual(index, query, language \\ nil) do
searcher = TantivyEx.Searcher.new(index)
case language do
"en" -> TantivyEx.Searcher.search(searcher, "content_en:(#{query})", 10)
"es" -> TantivyEx.Searcher.search(searcher, "content_es:(#{query})", 10)
nil -> TantivyEx.Searcher.search(searcher, "content_raw:(#{query})", 10)
_ -> TantivyEx.Searcher.search(searcher, "content_raw:(#{query})", 10)
end
end
end
```
### CJK (Chinese, Japanese, Korean) Support
Handle CJK languages that don't use spaces between words:
```elixir
defmodule MyApp.CJKTokenizer do
def setup_cjk_schema do
{:ok, schema} = Schema.new()
# Use simple tokenizer for CJK content
# In production, you might use specialized CJK tokenizers
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content_cjk", :text, "simple"
)
# Keep original for fallback
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content_original", :text, "raw"
)
schema
end
def preprocess_cjk_text(text) do
text
|> String.replace(~r/[[:punct:]]+/u, " ") # Replace punctuation with spaces
|> String.split() # Split on whitespace
|> Enum.join(" ") # Rejoin with single spaces
end
def index_cjk_document(index, content) do
processed_content = preprocess_cjk_text(content)
document = %{
"content_cjk" => processed_content,
"content_original" => content
}
{:ok, writer} = TantivyEx.IndexWriter.new(index)
TantivyEx.IndexWriter.add_document(writer, document)
TantivyEx.IndexWriter.commit(writer)
end
end
```
## Performance Considerations
### Tokenizer Performance Comparison
Different tokenizers have varying performance characteristics:
```elixir
defmodule MyApp.TokenizerBenchmark do
def benchmark_tokenizers(sample_texts) do
tokenizers = ["simple", "keyword", "default", "raw"]
Enum.each(tokenizers, fn tokenizer ->
{time, _result} = :timer.tc(fn ->
benchmark_tokenizer(tokenizer, sample_texts)
end)
IO.puts("#{tokenizer}: #{time / 1000}ms")
end)
end
defp benchmark_tokenizer(tokenizer, texts) do
{:ok, schema} = Schema.new()
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content", :text, tokenizer
)
{:ok, index} = Index.create("/tmp/benchmark_#{tokenizer}", schema)
{:ok, writer} = TantivyEx.IndexWriter.new(index)
Enum.each(texts, fn text ->
TantivyEx.IndexWriter.add_document(writer, %{"content" => text})
end)
TantivyEx.IndexWriter.commit(writer)
end
end
```
### Memory Usage Optimization
Monitor and optimize memory usage during tokenization:
```elixir
defmodule MyApp.TokenizerMemoryOptimizer do
def optimize_for_large_documents(index, large_documents) do
# Process documents in smaller chunks
chunk_size = 100
large_documents
|> Stream.chunk_every(chunk_size)
|> Enum.each(fn chunk ->
process_chunk(index, chunk)
# Force garbage collection between chunks
:erlang.garbage_collect()
# Optional: brief pause to prevent memory spikes
Process.sleep(10)
end)
{:ok, writer} = TantivyEx.IndexWriter.new(index)
TantivyEx.IndexWriter.commit(writer)
end
defp process_chunk(index, documents) do
{:ok, writer} = TantivyEx.IndexWriter.new(index)
Enum.each(documents, fn doc ->
# Truncate very large fields to prevent memory issues
truncated_doc = truncate_large_fields(doc)
TantivyEx.IndexWriter.add_document(writer, truncated_doc)
end)
TantivyEx.IndexWriter.commit(writer)
end
defp truncate_large_fields(document) do
max_field_size = 50_000 # 50KB per field
Enum.reduce(document, %{}, fn {key, value}, acc ->
truncated_value =
if is_binary(value) && byte_size(value) > max_field_size do
binary_part(value, 0, max_field_size)
else
value
end
Map.put(acc, key, truncated_value)
end)
end
end
```
### Index Size Optimization
Choose tokenizers based on index size requirements:
```elixir
defmodule MyApp.IndexSizeAnalyzer do
def analyze_tokenizer_impact(documents) do
tokenizers = ["simple", "keyword", "default"]
Enum.map(tokenizers, fn tokenizer ->
index_path = "/tmp/size_test_#{tokenizer}"
create_test_index(index_path, tokenizer, documents)
index_size = calculate_index_size(index_path)
%{
tokenizer: tokenizer,
size_mb: index_size,
size_per_doc: index_size / length(documents)
}
end)
end
defp create_test_index(path, tokenizer, documents) do
{:ok, schema} = Schema.new()
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content", :text, tokenizer
)
{:ok, index} = Index.create(path, schema)
{:ok, writer} = TantivyEx.IndexWriter.new(index)
Enum.each(documents, &TantivyEx.IndexWriter.add_document(writer, &1))
TantivyEx.IndexWriter.commit(writer)
end
defp calculate_index_size(path) do
case File.stat(path) do
{:ok, stat} -> stat.size / (1024 * 1024) # Convert to MB
{:error, _} -> 0
end
end
end
```
## Real-world Examples
### E-commerce Product Search
Optimize tokenization for product catalogs:
```elixir
defmodule MyApp.EcommerceTokenizer do
def setup_product_schema do
{:ok, schema} = Schema.new()
# Product titles - stemming for better matching
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "title", :text, "default"
)
# Brand names - exact matching important
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "brand", :text, "keyword"
)
# SKUs and model numbers - no transformation
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "sku", :text, "raw"
)
# Product descriptions - full text search
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "description", :text, "default"
)
# Category paths - hierarchical
{:ok, schema} = Schema.add_facet_field(schema, "category")
schema
end
def index_product(index, product) do
document = %{
"title" => product.title,
"brand" => product.brand,
"sku" => product.sku,
"description" => product.description,
"category" => format_category_path(product.category_path)
}
{:ok, writer} = TantivyEx.IndexWriter.new(index)
TantivyEx.IndexWriter.add_document(writer, document)
TantivyEx.IndexWriter.commit(writer)
end
def search_products(index, query, filters \\ %{}) do
# Build multi-field query
search_query = """
(title:(#{query})^3 OR
brand:(#{query})^2 OR
description:(#{query}) OR
sku:#{query}^4)
"""
# Add filters
filtered_query = apply_product_filters(search_query, filters)
searcher = TantivyEx.Searcher.new(index)
TantivyEx.Searcher.search(searcher, filtered_query, 50)
end
defp format_category_path(path_list) do
"/" <> Enum.join(path_list, "/")
end
defp apply_product_filters(base_query, filters) do
filter_parts = []
filter_parts =
if Map.has_key?(filters, :brand) do
["brand:\"#{filters.brand}\"" | filter_parts]
else
filter_parts
end
filter_parts =
if Map.has_key?(filters, :category) do
["category:\"#{filters.category}/*\"" | filter_parts]
else
filter_parts
end
if length(filter_parts) > 0 do
filter_string = Enum.join(filter_parts, " AND ")
"(#{base_query}) AND (#{filter_string})"
else
base_query
end
end
end
```
### Document Management System
Handle various document types with appropriate tokenization:
```elixir
defmodule MyApp.DocumentTokenizer do
def setup_document_schema do
{:ok, schema} = Schema.new()
# Document titles
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "title", :text, "default"
)
# Full document content
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content", :text, "default"
)
# Author names - minimal processing
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "author", :text, "simple"
)
# File paths and names
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "filename", :text, "filename"
)
# Tags - exact matching
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "tags", :text, "keyword"
)
schema
end
def index_document(index, doc_metadata, content) do
# Extract meaningful content based on file type
processed_content = process_by_file_type(content, doc_metadata.file_type)
document = %{
"title" => doc_metadata.title || extract_title_from_filename(doc_metadata.filename),
"content" => processed_content,
"author" => doc_metadata.author,
"filename" => doc_metadata.filename,
"tags" => format_tags(doc_metadata.tags)
}
{:ok, writer} = TantivyEx.IndexWriter.new(index)
TantivyEx.IndexWriter.add_document(writer, document)
TantivyEx.IndexWriter.commit(writer)
end
defp process_by_file_type(content, file_type) do
case file_type do
"pdf" -> clean_pdf_content(content)
"html" -> strip_html_tags(content)
"markdown" -> process_markdown(content)
"code" -> preserve_code_structure(content)
_ -> content
end
end
defp clean_pdf_content(content) do
content
|> String.replace(~r/\s+/, " ") # Normalize whitespace
|> String.replace(~r/[^\p{L}\p{N}\p{P}\p{Z}]/u, "") # Remove non-printable chars
|> String.trim()
end
defp strip_html_tags(html_content) do
html_content
|> String.replace(~r/<[^>]*>/, " ") # Remove HTML tags
|> String.replace(~r/&\w+;/, " ") # Remove HTML entities
|> String.replace(~r/\s+/, " ") # Normalize whitespace
|> String.trim()
end
defp process_markdown(markdown) do
markdown
|> String.replace(~r/#+\s*/, "") # Remove markdown headers
|> String.replace(~r/\*+([^*]+)\*+/, "\\1") # Remove emphasis
|> String.replace(~r/`([^`]+)`/, "\\1") # Remove inline code
|> String.replace(~r/\[([^\]]+)\]\([^)]+\)/, "\\1") # Extract link text
end
defp preserve_code_structure(code) do
# Preserve important code elements for search
code
|> String.replace(~r/\/\*.*?\*\//s, " ") # Remove block comments
|> String.replace(~r/\/\/.*$/m, "") # Remove line comments
|> String.replace(~r/\s+/, " ") # Normalize whitespace
end
defp extract_title_from_filename(filename) do
filename
|> Path.basename()
|> Path.rootname()
|> String.replace(~r/[_-]/, " ")
|> String.split()
|> Enum.map(&String.capitalize/1)
|> Enum.join(" ")
end
defp format_tags(tags) when is_list(tags) do
Enum.join(tags, " ")
end
defp format_tags(tags) when is_binary(tags), do: tags
defp format_tags(_), do: ""
end
```
## Troubleshooting
### Common Tokenization Issues
#### Issue: Search not finding expected results
**Problem**: Queries like "running" don't match documents containing "run"
**Solution**: Use a stemming tokenizer instead of simple:
```elixir
# Change from simple tokenizer
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content", :text, "simple"
)
# To default tokenizer (includes stemming)
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "content", :text, "default"
)
```
#### Issue: Case-sensitive search behavior
**Problem**: Search for "iPhone" doesn't match "iphone"
**Solution**: Ensure your tokenizer includes lowercasing:
```elixir
# Raw tokenizer preserves case
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "product_name", :text, "raw"
)
# Simple tokenizer normalizes case
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "product_name", :text, "simple"
)
```
#### Issue: Special characters causing search problems
**Problem**: Searches for "<user@example.com>" or "API-v2.1" fail
**Solution**: Use keyword tokenizer for identifiers:
```elixir
# For email addresses, URLs, version numbers
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "email", :text, "keyword"
)
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "version", :text, "keyword"
)
```
#### Issue: Poor performance with large documents
**Problem**: Indexing is slow with very large text fields
**Solution**: Consider field-specific optimization:
```elixir
defmodule MyApp.LargeDocumentOptimizer do
def optimize_large_document(document) do
# Truncate very large fields
optimized = %{
"title" => document["title"],
"summary" => extract_summary(document["content"]),
"content" => truncate_content(document["content"], 10_000),
"keywords" => extract_keywords(document["content"])
}
optimized
end
defp extract_summary(content) do
content
|> String.split(~r/\.\s+/)
|> Enum.take(3)
|> Enum.join(". ")
end
defp truncate_content(content, max_chars) do
if String.length(content) > max_chars do
String.slice(content, 0, max_chars)
else
content
end
end
defp extract_keywords(content) do
# Simple keyword extraction
content
|> String.downcase()
|> String.split(~r/\W+/)
|> Enum.frequencies()
|> Enum.sort_by(&elem(&1, 1), :desc)
|> Enum.take(20)
|> Enum.map(&elem(&1, 0))
|> Enum.join(" ")
end
end
```
### Debugging Tokenization
Test how your tokenizer processes text:
```elixir
defmodule MyApp.TokenizerDebugger do
def debug_tokenization(text, tokenizer) do
# Create a temporary index to test tokenization
{:ok, schema} = Schema.new()
{:ok, schema} = Schema.add_text_field_with_tokenizer(
schema, "test_field", :text, tokenizer
)
{:ok, index} = Index.create("/tmp/debug_tokenizer", schema)
# Add document
document = %{"test_field" => text}
{:ok, writer} = TantivyEx.IndexWriter.new(index)
TantivyEx.IndexWriter.add_document(writer, document)
TantivyEx.IndexWriter.commit(writer)
# Test search behavior
IO.puts("Original text: #{text}")
IO.puts("Tokenizer: #{tokenizer}")
# Test various search patterns
test_searches = [
String.downcase(text),
String.upcase(text),
text,
"\"#{text}\"", # Exact phrase
String.split(text) |> hd(), # First word
String.split(text) |> List.last() # Last word
]
searcher = TantivyEx.Searcher.new(index)
Enum.each(test_searches, fn query ->
case TantivyEx.Searcher.search(searcher, query, 10) do
{:ok, results} ->
found = length(results) > 0
IO.puts("Query '#{query}': #{if found, do: "FOUND", else: "NOT FOUND"}")
{:error, reason} ->
IO.puts("Query '#{query}': ERROR - #{inspect(reason)}")
end
end)
# Cleanup
File.rm_rf("/tmp/debug_tokenizer")
end
end
# Usage
MyApp.TokenizerDebugger.debug_tokenization("Hello World", "simple")
MyApp.TokenizerDebugger.debug_tokenization("user@example.com", "keyword")
```
This comprehensive guide should help you understand and effectively use tokenizers in TantivyEx for optimal search performance and accuracy.