docs/documents.md

Select File:
# Document Guide

This comprehensive guide covers document operations in TantivyEx, including creation, validation, indexing, and best practices for working with documents.

## Table of Contents

- [Document Fundamentals](#document-fundamentals)
- [Document Creation](#document-creation)
- [Field Types and Values](#field-types-and-values)
- [Document Validation](#document-validation)
- [Indexing Operations](#indexing-operations)
- [Batch Processing](#batch-processing)
- [Advanced Topics](#advanced-topics)
- [Best Practices](#best-practices)
- [Troubleshooting](#troubleshooting)

## Document Fundamentals

### What is a Document?

In TantivyEx, a document is a collection of fields and their values that represents a single unit of data in your search index. Documents are typically represented as Elixir maps where keys are field names and values are the field data.

```elixir
# Basic document structure
document = %{
  "title" => "Introduction to Elixir",
  "content" => "Elixir is a dynamic, functional language designed for building maintainable applications.",
  "author" => "José Valim",
  "published_at" => "2011-07-11T00:00:00Z",
  "tags" => "/programming/functional/elixir",
  "price" => 49.99,
  "available" => true
}
```

### Document Schema Relationship

Documents must conform to the schema you've defined for your index. The schema determines:

- Which fields are available
- What types of data each field can contain
- How fields are indexed and stored
- Whether fields support faceting, full-text search, or fast filtering

```elixir
# Define schema first
{:ok, schema} = TantivyEx.Schema.new()
  |> TantivyEx.Schema.add_text_field("title", stored: true, indexed: true)
  |> TantivyEx.Schema.add_text_field("content", stored: true, indexed: true)
  |> TantivyEx.Schema.add_u64_field("price", stored: true, fast: true)
  |> TantivyEx.Schema.build()

# Then create documents that match the schema
document = %{
  "title" => "Sample Document",
  "content" => "This document matches our schema perfectly.",
  "price" => 1999  # Note: price as integer (u64)
}
```

## Document Creation

### Basic Document Creation

Create documents as simple Elixir maps:

```elixir
# Text document for a blog post
blog_post = %{
  "title" => "Getting Started with TantivyEx",
  "content" => "TantivyEx brings powerful full-text search capabilities to Elixir applications...",
  "author" => "Your Name",
  "published_at" => "2024-01-15T10:30:00Z",
  "category" => "/blog/tutorials/elixir"
}

# Product document for e-commerce
product = %{
  "name" => "Wireless Headphones",
  "description" => "High-quality wireless headphones with noise cancellation",
  "price" => 19999,  # Price in cents
  "brand" => "AudioTech",
  "category" => "/electronics/audio/headphones",
  "in_stock" => true,
  "release_date" => "2024-01-01T00:00:00Z"
}

# User document for search
user = %{
  "username" => "john_doe",
  "email" => "john@example.com",
  "full_name" => "John Doe",
  "bio" => "Software developer with 10 years of experience",
  "location" => "San Francisco, CA",
  "joined_at" => "2023-06-15T14:22:00Z",
  "is_verified" => true
}
```

### Dynamic Document Creation

Build documents programmatically:

```elixir
def create_article_document(article, author) do
  %{
    "title" => article.title,
    "content" => article.body,
    "author" => author.name,
    "author_id" => author.id,
    "published_at" => DateTime.to_iso8601(article.published_at),
    "word_count" => String.split(article.body) |> length(),
    "category" => "/articles/#{article.category}",
    "tags" => Enum.join(article.tags, " "),
    "featured" => article.featured?,
    "views" => article.view_count
  }
end

# Usage
document = create_article_document(article, author)
```

## Field Types and Values

### Text Fields

Text fields support full-text search and tokenization:

```elixir
# Simple text
%{"title" => "Machine Learning Basics"}

# Long text content
%{
  "content" => """
  Machine learning is a method of data analysis that automates analytical
  model building. It is a branch of artificial intelligence based on the
  idea that systems can learn from data, identify patterns and make
  decisions with minimal human intervention.
  """
}

# Multiple language support
%{
  "title_en" => "Hello World",
  "title_es" => "Hola Mundo",
  "title_fr" => "Bonjour le Monde"
}
```

### Numeric Fields

Numeric fields support range queries and sorting:

```elixir
# Unsigned 64-bit integers (u64)
%{
  "price" => 2999,        # Price in cents
  "views" => 1024,        # View count
  "likes" => 42           # Social engagement
}

# Signed 64-bit integers (i64)
%{
  "temperature" => -15,   # Can be negative
  "elevation" => 2847,    # Altitude in meters
  "score_diff" => -5      # Score difference
}

# Floating-point numbers (f64)
%{
  "rating" => 4.7,        # Star rating
  "longitude" => -122.4194, # GPS coordinates
  "latitude" => 37.7749,
  "price_usd" => 29.99    # Decimal price
}
```

### Boolean Fields

Boolean fields for true/false values:

```elixir
%{
  "published" => true,
  "featured" => false,
  "in_stock" => true,
  "on_sale" => false,
  "verified" => true
}
```

### Date Fields

Date and time values (stored as Unix timestamps):

```elixir
# ISO 8601 string format (recommended)
%{
  "created_at" => "2024-01-15T10:30:00Z",
  "updated_at" => "2024-01-15T14:45:30.123Z",
  "published_at" => "2024-01-15T09:00:00-08:00"  # With timezone
}

# Unix timestamp (integer seconds)
%{
  "created_at" => 1705317000,  # Equivalent to above
  "expires_at" => 1705403400   # 24 hours later
}

# Current time helper
%{
  "indexed_at" => DateTime.utc_now() |> DateTime.to_iso8601()
}
```

### Facet Fields

Hierarchical facets for categorization and filtering:

```elixir
%{
  # Product categories
  "category" => "/electronics/computers/laptops",

  # Geographic hierarchy
  "location" => "/usa/california/san_francisco",

  # Content taxonomy
  "topic" => "/programming/languages/elixir/otp",

  # Multi-level tags
  "tags" => "/blog/technical/tutorial"
}

# Multiple facets
%{
  "primary_category" => "/books/fiction/mystery",
  "secondary_category" => "/books/bestsellers",
  "age_rating" => "/ratings/mature"
}
```

### Binary Data (Bytes)

Store binary data as base64-encoded strings:

```elixir
# Base64-encoded data
%{
  "thumbnail" => "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg==",
  "signature" => "SGVsbG8gV29ybGQ=",  # "Hello World" in base64
  "metadata" => Base.encode64(Jason.encode!(%{version: "1.0", type: "image"}))
}

# Binary data from files
%{
  "file_content" => File.read!("document.pdf") |> Base.encode64(),
  "image_data" => File.read!("image.png") |> Base.encode64()
}
```

### JSON Objects

Complex nested data structures:

```elixir
%{
  "metadata" => %{
    "version" => "2.1",
    "format" => "json",
    "compression" => "gzip",
    "size_bytes" => 1024
  },

  "settings" => %{
    "notifications" => %{
      "email" => true,
      "push" => false,
      "sms" => true
    },
    "privacy" => %{
      "public_profile" => false,
      "show_email" => false
    }
  }
}
```

### IP Addresses

IPv4 and IPv6 addresses:

```elixir
%{
  "client_ip" => "192.168.1.1",           # IPv4
  "server_ip" => "2001:db8::1",           # IPv6
  "proxy_ip" => "10.0.0.1",               # Private IPv4
  "cdn_ip" => "2606:4700:3034::ac43:c427" # IPv6 CDN
}
```

## Document Validation

### Schema-Based Validation

Validate documents against your schema before indexing:

```elixir
# Define your schema
{:ok, schema} = TantivyEx.Schema.new()
  |> TantivyEx.Schema.add_text_field("title", stored: true)
  |> TantivyEx.Schema.add_u64_field("price", stored: true)
  |> TantivyEx.Schema.add_bool_field("available", stored: true)
  |> TantivyEx.Schema.build()

# Create a document
document = %{
  "title" => "Product Name",
  "price" => 1999,
  "available" => true
}

# Validate the document
case TantivyEx.Document.validate(document, schema) do
  {:ok, validated_doc} ->
    IO.puts("Document is valid!")
    # Proceed with indexing

  {:error, reason} ->
    IO.puts("Validation failed: #{reason}")
    # Handle validation error
end
```

### Type Conversion and Validation

TantivyEx automatically converts compatible types:

```elixir
# These values will be automatically converted:
document = %{
  "price" => "1999",           # String -> u64
  "rating" => "4.5",           # String -> f64
  "available" => "true",       # String -> boolean
  "count" => 42.0              # f64 -> u64 (if whole number)
}

# Validation with helpful error messages
{:error, "Field 'price': Expected numeric value"} =
  TantivyEx.Document.validate(%{"price" => "not_a_number"}, schema)
```

### Custom Validation Functions

Create custom validation logic:

```elixir
defmodule MyApp.DocumentValidator do
  def validate_product(document) do
    with {:ok, doc} <- validate_required_fields(document),
         {:ok, doc} <- validate_price_range(doc),
         {:ok, doc} <- validate_category_format(doc) do
      {:ok, doc}
    end
  end

  defp validate_required_fields(doc) do
    required = ["title", "price", "category"]
    missing = required -- Map.keys(doc)

    case missing do
      [] -> {:ok, doc}
      fields -> {:error, "Missing required fields: #{Enum.join(fields, ", ")}"}
    end
  end

  defp validate_price_range(doc) do
    price = Map.get(doc, "price", 0)

    if price > 0 and price < 1_000_000 do
      {:ok, doc}
    else
      {:error, "Price must be between 1 and 999,999"}
    end
  end

  defp validate_category_format(doc) do
    category = Map.get(doc, "category", "")

    if String.starts_with?(category, "/") do
      {:ok, doc}
    else
      {:error, "Category must start with '/'"}
    end
  end
end
```

## Indexing Operations

### Single Document Indexing

Add individual documents to the index:

```elixir
# Create or open an index and get a writer
{:ok, index} = TantivyEx.Index.create_in_dir("path/to/index", schema)
{:ok, writer} = TantivyEx.IndexWriter.new(index)

# Create and validate document
document = %{
  "title" => "New Article",
  "content" => "Article content here...",
  "published_at" => DateTime.utc_now() |> DateTime.to_iso8601()
}

# Add document to index
case TantivyEx.IndexWriter.add_document(writer, document) do
  :ok ->
    IO.puts("Document indexed successfully")

  {:error, reason} ->
    IO.puts("Failed to index document: #{reason}")
end

# Commit changes to make them searchable
:ok = TantivyEx.IndexWriter.commit(writer)
```

### Schema-Aware Indexing

Use schema information for better type handling:

```elixir
# Add document with schema validation
case TantivyEx.IndexWriter.add_document(writer, document) do
  :ok ->
    IO.puts("Document added with schema validation")

  {:error, reason} ->
    IO.puts("Schema validation failed: #{reason}")
end
```

### Handling Indexing Errors

Robust error handling for production use:

```elixir
def safely_index_document(writer, document, schema) do
  try do
    case TantivyEx.Document.validate(document, schema) do
      {:ok, validated_doc} ->
        case TantivyEx.Document.add_with_schema(writer, validated_doc, schema) do
          :ok ->
            {:ok, :indexed}
          {:error, reason} ->
            {:error, {:indexing_failed, reason}}
        end

      {:error, reason} ->
        {:error, {:validation_failed, reason}}
    end
  rescue
    exception ->
      {:error, {:exception, Exception.message(exception)}}
  end
end
```

## Batch Processing

### Batch Document Addition

Process multiple documents efficiently:

```elixir
# Prepare batch of documents
documents = [
  %{"title" => "Doc 1", "content" => "Content 1"},
  %{"title" => "Doc 2", "content" => "Content 2"},
  %{"title" => "Doc 3", "content" => "Content 3"}
]

# Batch add with comprehensive results
case TantivyEx.Document.add_batch(writer, documents, schema) do
  {:ok, results} ->
    IO.puts("Batch completed: #{results}")
    # Results format: {"successful": 3, "errors": 0}

  {:error, errors} ->
    IO.puts("Batch had errors: #{inspect(errors)}")
    # Errors format: [{index, error_message}, ...]
end
```

### Processing Large Datasets

Handle large document collections efficiently:

```elixir
defmodule MyApp.BulkIndexer do
  @batch_size 1000

  def index_all_documents(writer, documents, schema) do
    documents
    |> Stream.chunk_every(@batch_size)
    |> Stream.with_index()
    |> Enum.reduce({0, []}, fn {batch, batch_num}, {total_success, all_errors} ->
      IO.puts("Processing batch #{batch_num + 1}")

      case TantivyEx.Document.add_batch(writer, batch, schema) do
        {:ok, result} ->
          success_count = parse_success_count(result)
          {total_success + success_count, all_errors}

        {:error, errors} ->
          {total_success, all_errors ++ errors}
      end
    end)
  end

  defp parse_success_count(result_json) do
    case Jason.decode(result_json) do
      {:ok, %{"successful" => count}} -> count
      _ -> 0
    end
  end
end

# Usage
{success_count, errors} = MyApp.BulkIndexer.index_all_documents(writer, documents, schema)
IO.puts("Indexed #{success_count} documents with #{length(errors)} errors")
```

### Memory-Efficient Streaming

Stream large datasets without loading everything into memory:

```elixir
defmodule MyApp.StreamingIndexer do
  def index_from_stream(writer, document_stream, schema) do
    document_stream
    |> Stream.map(&validate_and_prepare/1)
    |> Stream.filter(&match?({:ok, _}, &1))
    |> Stream.map(fn {:ok, doc} -> doc end)
    |> Stream.chunk_every(500)
    |> Enum.each(fn batch ->
      case TantivyEx.Document.add_batch(writer, batch, schema) do
        {:ok, _} -> :ok
        {:error, errors} ->
          Logger.error("Batch indexing errors: #{inspect(errors)}")
      end
    end)
  end

  defp validate_and_prepare(raw_document) do
    # Custom preparation logic
    case prepare_document(raw_document) do
      {:ok, doc} -> {:ok, doc}
      {:error, reason} ->
        Logger.warning("Document preparation failed: #{reason}")
        {:error, reason}
    end
  end
end
```

## Advanced Topics

### Document Updates

TantivyEx doesn't support in-place updates, but you can rebuild indices:

```elixir
defmodule MyApp.DocumentUpdater do
  def update_document(old_index_path, new_index_path, doc_id, updates, schema) do
    # Create new index
    {:ok, new_index} = TantivyEx.Index.create(new_index_path, schema)
    {:ok, writer} = TantivyEx.IndexWriter.new(new_index)

    # Copy all documents except the one being updated
    {:ok, searcher} = TantivyEx.Searcher.new(old_index_path)

    # This is a simplified approach - in practice you'd want to
    # stream through all documents more efficiently
    documents = get_all_documents(searcher)

    updated_documents = documents
    |> Enum.map(fn doc ->
      if doc["id"] == doc_id do
        Map.merge(doc, updates)
      else
        doc
      end
    end)

    # Index all documents to new index
    case TantivyEx.IndexWriter.add_document(writer, updated_documents) do
      :ok ->
        :ok = TantivyEx.IndexWriter.commit(writer)
        {:ok, new_index}

      {:error, reason} ->
        {:error, reason}
    end
  end
end
```

### Document Deletion

Implement document deletion through index rebuilding:

```elixir
def delete_documents(index_path, new_index_path, doc_ids, schema) do
  delete_set = MapSet.new(doc_ids)

  {:ok, new_index} = TantivyEx.Index.create(new_index_path, schema)
  {:ok, writer} = TantivyEx.IndexWriter.new(new_index)

  # Copy all documents except those being deleted
  documents = get_all_documents_from_index(index_path)

  filtered_documents = Enum.reject(documents, fn doc ->
    MapSet.member?(delete_set, doc["id"])
  end)

  case TantivyEx.Document.add_batch(writer, filtered_documents, schema) do
    {:ok, _} ->
      :ok = TantivyEx.IndexWriter.commit(writer)
      {:ok, new_index}

    {:error, reason} ->
      {:error, reason}
  end
end
```

### Complex Document Structures

Handle nested data and complex transformations:

```elixir
defmodule MyApp.ComplexDocuments do
  def transform_product_for_search(product) do
    %{
      "id" => product.id,
      "name" => product.name,
      "description" => product.description,

      # Flatten nested attributes
      "brand" => product.brand.name,
      "brand_id" => product.brand.id,

      # Create searchable text from multiple fields
      "searchable_text" => build_searchable_text(product),

      # Price information
      "price_cents" => product.price_cents,
      "price_usd" => product.price_cents / 100.0,

      # Category hierarchy
      "category" => build_category_path(product.categories),

      # Inventory data
      "in_stock" => product.inventory.quantity > 0,
      "quantity" => product.inventory.quantity,

      # Dates
      "created_at" => DateTime.to_iso8601(product.inserted_at),
      "updated_at" => DateTime.to_iso8601(product.updated_at),

      # Features and tags
      "features" => Enum.join(product.features, " "),
      "tags" => build_tag_facets(product.tags)
    }
  end

  defp build_searchable_text(product) do
    [
      product.name,
      product.description,
      product.brand.name,
      Enum.join(product.features, " "),
      Enum.map(product.tags, & &1.name) |> Enum.join(" ")
    ]
    |> Enum.join(" ")
    |> String.downcase()
  end

  defp build_category_path(categories) do
    categories
    |> Enum.map(& &1.slug)
    |> Enum.join("/")
    |> then(&"/#{&1}")
  end

  defp build_tag_facets(tags) do
    tags
    |> Enum.map(& &1.category)
    |> Enum.uniq()
    |> Enum.join("/")
    |> then(&"/tags/#{&1}")
  end
end
```

## Best Practices

### Document Design Principles

1. **Keep Fields Focused**: Each field should have a single, clear purpose
2. **Use Appropriate Types**: Choose the right field type for your data
3. **Design for Queries**: Structure documents to support your search patterns
4. **Normalize When Appropriate**: Consider denormalizing data for search performance

```elixir
# Good: Clear field purposes
%{
  "title" => "Document title",
  "content" => "Main document content",
  "author" => "Author name",
  "published_at" => "2024-01-15T10:30:00Z",
  "word_count" => 1500,
  "category" => "/articles/technical"
}

# Avoid: Mixed purposes in single fields
%{
  "metadata" => "Title: Document | Author: John | Date: 2024-01-15"  # Hard to search
}
```

### Performance Optimization

1. **Batch Operations**: Use batch processing for multiple documents
2. **Schema Design**: Design schema fields to match query patterns
3. **Field Options**: Use appropriate indexing options (stored, fast, indexed)
4. **Memory Management**: Process large datasets in chunks

```elixir
# Efficient batch processing
def index_efficiently(writer, large_dataset, schema) do
  large_dataset
  |> Stream.chunk_every(1000)  # Process in batches
  |> Stream.map(fn batch ->
    # Pre-validate batch
    validated_batch = Enum.filter(batch, &valid_document?/1)
    TantivyEx.Document.add_batch(writer, validated_batch, schema)
  end)
  |> Stream.run()  # Execute the stream
end
```

### Error Handling

1. **Validate Early**: Check documents before attempting to index
2. **Graceful Degradation**: Handle partial failures in batch operations
3. **Logging**: Log validation and indexing errors for debugging
4. **Recovery**: Implement retry logic for transient failures

```elixir
defmodule MyApp.SafeIndexer do
  require Logger

  def safe_index(writer, document, schema, retries \\ 3) do
    case TantivyEx.Document.validate(document, schema) do
      {:ok, validated_doc} ->
        attempt_index(writer, validated_doc, schema, retries)

      {:error, reason} ->
        Logger.error("Document validation failed: #{reason}")
        {:error, {:validation, reason}}
    end
  end

  defp attempt_index(writer, document, schema, retries) when retries > 0 do
    case TantivyEx.Document.add_with_schema(writer, document, schema) do
      :ok ->
        {:ok, :indexed}

      {:error, reason} when retries > 1 ->
        Logger.warning("Indexing failed, retrying: #{reason}")
        :timer.sleep(100)  # Brief delay
        attempt_index(writer, document, schema, retries - 1)

      {:error, reason} ->
        Logger.error("Indexing failed after retries: #{reason}")
        {:error, {:indexing, reason}}
    end
  end
end
```

### Data Consistency

1. **Atomic Operations**: Ensure related documents are indexed together
2. **Version Control**: Include version information in documents
3. **Validation**: Implement comprehensive validation rules
4. **Backup Strategy**: Regular index backups before major updates

```elixir
# Version-aware documents
%{
  "id" => "doc_123",
  "version" => 3,
  "title" => "Updated Document",
  "last_modified" => DateTime.utc_now() |> DateTime.to_iso8601(),
  "checksum" => :crypto.hash(:sha256, content) |> Base.encode16()
}
```

## Troubleshooting

### Common Issues

#### Type Mismatch Errors

```elixir
# Problem: Wrong data type
document = %{"price" => "not_a_number"}

# Solution: Proper type conversion
document = %{"price" => String.to_integer(price_string)}
```

#### Missing Required Fields

```elixir
# Problem: Document missing schema fields
document = %{"title" => "Test"}  # Missing other required fields

# Solution: Complete document validation
defp ensure_required_fields(document, required_fields) do
  missing = required_fields -- Map.keys(document)
  if missing == [], do: {:ok, document}, else: {:error, "Missing: #{inspect(missing)}"}
end
```

#### Large Document Performance

```elixir
# Problem: Very large documents causing memory issues
huge_document = %{"content" => very_large_text}

# Solution: Content chunking or field splitting
def split_large_content(content, max_size \\ 10_000) do
  if String.length(content) > max_size do
    content
    |> String.graphemes()
    |> Enum.chunk_every(max_size)
    |> Enum.map(&Enum.join/1)
  else
    [content]
  end
end
```

#### Batch Processing Failures

```elixir
# Problem: Entire batch fails due to one bad document
documents = [good_doc1, bad_doc, good_doc2]

# Solution: Filter and process valid documents
def process_with_filtering(writer, documents, schema) do
  {valid_docs, invalid_docs} =
    Enum.split_with(documents, &valid_document?(&1, schema))

  result = TantivyEx.Document.add_batch(writer, valid_docs, schema)

  unless Enum.empty?(invalid_docs) do
    Logger.warning("Skipped #{length(invalid_docs)} invalid documents")
  end

  result
end
```

### Debugging Tips

1. **Enable Logging**: Use Logger to track document processing
2. **Validate Incrementally**: Test schema and documents separately
3. **Check Field Types**: Verify field type definitions match your data
4. **Monitor Memory**: Watch memory usage during large batch operations
5. **Test with Small Batches**: Start with small batches to identify issues

```elixir
# Debug document structure
def debug_document(document, schema) do
  IO.puts("Document keys: #{inspect(Map.keys(document))}")
  IO.puts("Schema fields: #{inspect(TantivyEx.Schema.field_names(schema))}")

  # Check each field type
  Enum.each(document, fn {field, value} ->
    IO.puts("#{field}: #{inspect(value)} (#{inspect(value.__struct__ || typeof(value))})")
  end)
end
```

This comprehensive document guide provides everything you need to work effectively with documents in TantivyEx, from basic operations to advanced patterns and troubleshooting.