stuff/docs/ANALYSIS.md

# Ragex Code Analysis Guide

Comprehensive guide to Ragex's code analysis capabilities powered by Metastatic and semantic embeddings.

## Table of Contents

1. [Overview](#overview)
2. [Analysis Approaches](#analysis-approaches)
3. [Code Duplication Detection](#code-duplication-detection)
4. [Dead Code Detection](#dead-code-detection)
5. [Dependency Analysis](#dependency-analysis)
6. [Impact Analysis](#impact-analysis)
7. [MCP Tools Reference](#mcp-tools-reference)
8. [Best Practices](#best-practices)
9. [Troubleshooting](#troubleshooting)

## Overview

Ragex provides advanced code analysis capabilities through two complementary approaches:

1. **AST-Based Analysis** - Precise structural analysis via Metastatic
2. **Embedding-Based Analysis** - Semantic similarity via ML embeddings

All analysis features are accessible via MCP tools and can be integrated into your development workflow.

### Supported Languages

- Elixir (.ex, .exs)
- Erlang (.erl, .hrl)
- Python (.py)
- JavaScript/TypeScript (.js, .ts)
- Ruby (.rb)
- Haskell (.hs)

## Analysis Approaches

### AST-Based Analysis (Metastatic)

**Advantages:**
- Precise structural matching
- Language-aware analysis
- Detects subtle code patterns
- No training required

**Use Cases:**
- Exact and near-exact code duplication
- Dead code detection (unreachable code)
- Structural similarity analysis

### Embedding-Based Analysis

**Advantages:**
- Semantic understanding
- Cross-language similarity
- Finds conceptually similar code
- Works with comments and documentation

**Use Cases:**
- Finding semantically similar functions
- Code smell detection
- Refactoring opportunities
- Cross-project similarity

## Code Duplication Detection

Ragex detects four types of code clones using Metastatic's AST comparison:

### Clone Types

#### Type I: Exact Clones
Identical code with only whitespace/comment differences.

```elixir
# File 1
defmodule A do
  def calculate(x, y) do
    x + y * 2
  end
end

# File 2 (Type I clone)
defmodule A do
  def calculate(x, y) do
    x + y * 2
  end
end
```

#### Type II: Renamed Clones
Same structure with different identifiers.

```elixir
# File 1
defmodule A do
  def process(data, options) do
    Map.put(data, :result, options.value)
  end
end

# File 2 (Type II clone)
defmodule A do
  def process(input, config) do
    Map.put(input, :result, config.value)
  end
end
```

#### Type III: Near-Miss Clones
Similar structure with minor modifications.

```elixir
# File 1
defmodule A do
  def process(x) do
    result = x * 10
    result + 100
  end
end

# File 2 (Type III clone)
defmodule A do
  def process(x) do
    result = x * 10
    result + 200  # Different constant
  end
end
```

#### Type IV: Semantic Clones
Different syntax, same behavior.

```elixir
# File 1
def sum_list(items) do
  Enum.reduce(items, 0, &+/2)
end

# File 2 (Type IV clone)
def sum_list(items) do
  items |> Enum.sum()
end
```

### API Usage

#### Detect Duplicates Between Two Files

```elixir
alias Ragex.Analysis.Duplication

# Basic usage
{:ok, result} = Duplication.detect_between_files("lib/a.ex", "lib/b.ex")

if result.duplicate? do
  IO.puts("Found #{result.clone_type} clone")
  IO.puts("Similarity: #{result.similarity_score}")
end

# With options
{:ok, result} = Duplication.detect_between_files(
  "lib/a.ex", 
  "lib/b.ex",
  threshold: 0.9  # Stricter matching
)
```

#### Detect Duplicates Across Multiple Files

```elixir
files = ["lib/a.ex", "lib/b.ex", "lib/c.ex"]
{:ok, clones} = Duplication.detect_in_files(files)

Enum.each(clones, fn clone ->
  IO.puts("#{clone.file1} <-> #{clone.file2}")
  IO.puts("  Type: #{clone.clone_type}")
  IO.puts("  Similarity: #{clone.similarity}")
end)
```

#### Scan Directory for Duplicates

```elixir
# Recursive scan with defaults
{:ok, clones} = Duplication.detect_in_directory("lib/")

# Custom options
{:ok, clones} = Duplication.detect_in_directory("lib/", 
  recursive: true,
  threshold: 0.8,
  exclude_patterns: ["_build", "deps", ".git", "test"]
)

IO.puts("Found #{length(clones)} duplicate pairs")
```

#### Embedding-Based Similarity

```elixir
# Find similar functions using embeddings
{:ok, similar} = Duplication.find_similar_functions(
  threshold: 0.95,  # High similarity
  limit: 20,
  node_type: :function
)

Enum.each(similar, fn pair ->
  IO.puts("#{inspect(pair.function1)} ~ #{inspect(pair.function2)}")
  IO.puts("  Similarity: #{pair.similarity}")
  IO.puts("  Method: #{pair.method}")  # :embedding
end)
```

#### Generate Comprehensive Report

```elixir
{:ok, report} = Duplication.generate_report("lib/", 
  include_embeddings: true,
  threshold: 0.8
)

IO.puts(report.summary)
IO.puts("AST clones: #{report.ast_clones.total}")
IO.puts("Embedding similar: #{report.embedding_similar.total}")

# Access detailed data
report.ast_clones.by_type  # %{type_i: 5, type_ii: 3, ...}
report.ast_clones.pairs    # List of clone pairs
report.embedding_similar.pairs  # List of similar pairs
```

### MCP Tools

#### `find_duplicates`

Detect duplicates using AST-based analysis.

```json
{
  "name": "find_duplicates",
  "arguments": {
    "mode": "directory",
    "path": "lib/",
    "threshold": 0.8,
    "format": "detailed"
  }
}
```

**Modes:**
- `"directory"` - Scan entire directory
- `"files"` - Compare specific files (provide `file1` and `file2`)

**Formats:**
- `"summary"` - Brief overview
- `"detailed"` - Full clone information
- `"json"` - Machine-readable JSON

#### `find_similar_code`

Find semantically similar code using embeddings.

```json
{
  "name": "find_similar_code",
  "arguments": {
    "threshold": 0.95,
    "limit": 20,
    "format": "summary"
  }
}
```

## Dead Code Detection

Ragex provides two types of dead code detection:

### 1. Interprocedural (Graph-Based)

Detects unused functions by analyzing the call graph.

```elixir
alias Ragex.Analysis.DeadCode

# Find unused public functions
{:ok, unused_exports} = DeadCode.find_unused_exports()
# Returns: [{:module, ModuleName, :function_name, arity}, ...]

# Find unused private functions
{:ok, unused_private} = DeadCode.find_unused_private()

# Find unused modules
{:ok, unused_modules} = DeadCode.find_unused_modules()

# Generate removal suggestions
{:ok, suggestions} = DeadCode.removal_suggestions(confidence_threshold: 0.8)
```

### 2. Intraprocedural (AST-Based via Metastatic)

Detects unreachable code patterns within functions.

```elixir
# Analyze single file
{:ok, patterns} = DeadCode.analyze_file("lib/my_module.ex")

Enum.each(patterns, fn pattern ->
  IO.puts("#{pattern.type}: Line #{pattern.line}")
  IO.puts("  #{pattern.description}")
end)

# Analyze directory
{:ok, results} = DeadCode.analyze_files("lib/")
# Returns: Map of file paths to dead code patterns
```

**Detected Patterns:**
- Unreachable code after `return`
- Constant conditions (always true/false)
- Unused variables
- Dead branches

### MCP Tools

#### `find_dead_code`

Graph-based unused function detection.

```json
{
  "name": "find_dead_code",
  "arguments": {
    "confidence_threshold": 0.8,
    "include_private": true,
    "format": "detailed"
  }
}
```

#### `analyze_dead_code_patterns`

AST-based unreachable code detection.

```json
{
  "name": "analyze_dead_code_patterns",
  "arguments": {
    "path": "lib/my_module.ex",
    "format": "json"
  }
}
```

## Dependency Analysis

Analyze module dependencies and coupling.

### Finding Circular Dependencies

```elixir
alias Ragex.Analysis.DependencyGraph

# Find all circular dependencies
{:ok, cycles} = DependencyGraph.find_cycles()

Enum.each(cycles, fn cycle ->
  IO.puts("Cycle: #{inspect(cycle)}")
end)
```

### Coupling Metrics

```elixir
# Calculate coupling for a module
metrics = DependencyGraph.coupling_metrics(MyModule)

IO.puts("Afferent coupling: #{metrics.afferent}")  # Incoming deps
IO.puts("Efferent coupling: #{metrics.efferent}")  # Outgoing deps
IO.puts("Instability: #{metrics.instability}")     # 0.0 to 1.0
```

**Instability** = efferent / (afferent + efferent)
- 0.0 = Stable (many dependents, few dependencies)
- 1.0 = Unstable (few dependents, many dependencies)

### Finding God Modules

```elixir
# Modules with high coupling
{:ok, god_modules} = DependencyGraph.find_god_modules(threshold: 10)
```

### MCP Tools

#### `analyze_dependencies`

```json
{
  "name": "analyze_dependencies",
  "arguments": {
    "module": "MyModule",
    "include_transitive": true
  }
}
```

#### `find_circular_dependencies`

```json
{
  "name": "find_circular_dependencies",
  "arguments": {
    "min_cycle_length": 2
  }
}
```

#### `coupling_report`

```json
{
  "name": "coupling_report",
  "arguments": {
    "format": "json",
    "sort_by": "instability"
  }
}
```

## Impact Analysis

Predict the impact of code changes before making them using graph traversal and metrics.

### Overview

Impact Analysis answers critical questions:
- Which code will be affected by this change?
- Which tests need to run?
- How risky is this refactoring?
- How much effort will this take?

**Key Features:**
- Graph-based call chain analysis
- Risk scoring (importance + coupling + complexity)
- Effort estimation for refactoring operations
- Test discovery

### Analyzing Change Impact

```elixir
alias Ragex.Analysis.Impact

# Analyze impact of changing a function
{:ok, analysis} = Impact.analyze_change({:function, MyModule, :process, 2})

IO.puts("Direct callers: #{length(analysis.direct_callers)}")
IO.puts("Total affected: #{analysis.affected_count}")
IO.puts("Risk score: #{analysis.risk_score}")
IO.puts("Importance: #{analysis.importance}")

# Show recommendations
Enum.each(analysis.recommendations, &IO.puts/1)
```

**Parameters:**
- `depth` - Maximum traversal depth (default: 5)
- `include_tests` - Include test files in analysis (default: true)
- `exclude_modules` - Modules to exclude from traversal

**Returns:**
- `target` - The node being analyzed
- `direct_callers` - Functions that directly call this
- `all_affected` - All reachable callers (transitive)
- `affected_count` - Total number of affected nodes
- `risk_score` - Overall risk (0.0 to 1.0)
- `importance` - PageRank-based importance
- `recommendations` - Actionable advice

### Finding Affected Tests

```elixir
# Find tests that will be affected by changing this function
{:ok, tests} = Impact.find_affected_tests({:function, MyModule, :process, 2})

IO.puts("#{length(tests)} tests affected")

Enum.each(tests, fn {:function, module, name, arity} ->
  IO.puts("  - #{module}.#{name}/#{arity}")
end)
```

**Custom Test Patterns:**
```elixir
# Support non-standard test naming (e.g., specs)
{:ok, tests} = Impact.find_affected_tests(
  {:function, MyModule, :process, 2},
  test_patterns: ["Spec", "Test", "_test"]
)
```

### Estimating Refactoring Effort

```elixir
# Estimate effort for rename operation
{:ok, estimate} = Impact.estimate_effort(
  :rename_function,
  {:function, MyModule, :old_name, 2}
)

IO.puts("Operation: #{estimate.operation}")
IO.puts("Changes needed: #{estimate.estimated_changes} locations")
IO.puts("Complexity: #{estimate.complexity}")
IO.puts("Time estimate: #{estimate.estimated_time}")

# Review risks
IO.puts("\nRisks:")
Enum.each(estimate.risks, fn risk ->
  IO.puts("  - #{risk}")
end)

# Review recommendations
IO.puts("\nRecommendations:")
Enum.each(estimate.recommendations, fn rec ->
  IO.puts("  - #{rec}")
end)
```

**Supported Operations:**
- `:rename_function` - Rename a function
- `:rename_module` - Rename a module
- `:extract_function` - Extract code into new function
- `:inline_function` - Inline a function
- `:move_function` - Move function to another module
- `:change_signature` - Change function signature

**Complexity Levels:**
- `:low` - < 5 affected locations (< 30 min)
- `:medium` - 5-20 locations (30 min - 2 hours)
- `:high` - 20-50 locations (2-4 hours)
- `:very_high` - 50+ locations (1+ day)

### Risk Assessment

```elixir
# Calculate risk score for a change
{:ok, risk} = Impact.risk_score({:function, MyModule, :critical_fn, 1})

IO.puts("Target: #{inspect(risk.target)}")
IO.puts("Overall risk: #{risk.overall} (#{risk.level})")
IO.puts("\nComponents:")
IO.puts("  Importance: #{risk.importance}  # PageRank")
IO.puts("  Coupling: #{risk.coupling}      # Edges")
IO.puts("  Complexity: #{risk.complexity}  # Code metrics")
```

**Risk Levels:**
- `:low` - Overall < 0.3 (safe to change)
- `:medium` - 0.3 ≤ Overall < 0.6 (needs review)
- `:high` - 0.6 ≤ Overall < 0.8 (risky, comprehensive testing)
- `:critical` - Overall ≥ 0.8 (very risky, plan carefully)

**Risk Components:**
1. **Importance** - Based on PageRank (how central in the call graph)
2. **Coupling** - Number of incoming/outgoing edges (normalized)
3. **Complexity** - Code complexity metrics (if available)

### MCP Tools

#### `analyze_impact`

Analyze the impact of changing a function or module.

```json
{
  "name": "analyze_impact",
  "arguments": {
    "target": "MyModule.process/2",
    "depth": 5,
    "include_tests": true,
    "format": "detailed"
  }
}
```

**Target Formats:**
- `"Module.function/arity"` - Specific function
- `"Module"` - Entire module

#### `estimate_refactoring_effort`

Estimate effort for a refactoring operation.

```json
{
  "name": "estimate_refactoring_effort",
  "arguments": {
    "operation": "rename_function",
    "target": "MyModule.old_name/2",
    "format": "summary"
  }
}
```

**Operations:** `rename_function`, `rename_module`, `extract_function`, `inline_function`, `move_function`, `change_signature`

#### `risk_assessment`

Calculate risk score for a change.

```json
{
  "name": "risk_assessment",
  "arguments": {
    "target": "MyModule.critical/1",
    "format": "detailed"
  }
}
```

### Workflow Example

**Before Refactoring:**

```elixir
# Step 1: Analyze impact
{:ok, impact} = Impact.analyze_change({:function, MyModule, :old_name, 2})

if impact.affected_count > 20 do
  IO.puts("Warning: Large impact (#{impact.affected_count} locations)")
end

# Step 2: Find affected tests
{:ok, tests} = Impact.find_affected_tests({:function, MyModule, :old_name, 2})
IO.puts("Tests to run: #{length(tests)}")

# Step 3: Estimate effort
{:ok, estimate} = Impact.estimate_effort(
  :rename_function, 
  {:function, MyModule, :old_name, 2}
)
IO.puts("Estimated time: #{estimate.estimated_time}")

# Step 4: Assess risk
{:ok, risk} = Impact.risk_score({:function, MyModule, :old_name, 2})

case risk.level do
  :low -> IO.puts("✓ Safe to proceed")
  :medium -> IO.puts("⚠ Review carefully")
  :high -> IO.puts("⚠ High risk - thorough testing required")
  :critical -> IO.puts("❌ Critical risk - consider alternative approach")
end

# Step 5: Proceed with refactoring if acceptable
if risk.level in [:low, :medium] do
  # Run refactoring
  # Run affected tests
  # Commit changes
end
```

### Best Practices

1. **Always analyze before refactoring** - Know the scope of changes
2. **Check risk levels** - Don't proceed with critical-risk changes without planning
3. **Run affected tests** - Use test discovery to optimize CI time
4. **Review transitive callers** - Indirect impacts can be significant
5. **Consider alternatives** - High-risk operations may have safer approaches
6. **Document high-impact changes** - Leave notes for future maintainers
7. **Use depth wisely** - Deep traversal (depth > 10) can be expensive
8. **Exclude test files for production impact** - Use `include_tests: false`

### Limitations

**Current limitations:**
- Dynamic function calls (apply, send) not fully tracked
- Macros may not be accurately analyzed
- Cross-module dependencies require full analysis
- Complexity metrics require quality analysis to be run first

**Workarounds:**
- Run comprehensive analysis before impact analysis
- Manually review dynamic call sites
- Use conservative estimates for macro-heavy code
- Lower confidence scores indicate potential dynamic usage

## MCP Tools Reference

### Summary of All Analysis Tools

| Tool | Purpose | Analysis Type |
|------|---------|---------------|
| `find_duplicates` | Code duplication detection | AST (Metastatic) |
| `find_similar_code` | Semantic similarity | Embedding |
| `find_dead_code` | Unused functions | Graph |
| `analyze_dead_code_patterns` | Unreachable code | AST (Metastatic) |
| `analyze_dependencies` | Module dependencies | Graph |
| `find_circular_dependencies` | Circular deps | Graph |
| `coupling_report` | Coupling metrics | Graph |
| `analyze_impact` | Change impact analysis | Graph |
| `estimate_refactoring_effort` | Effort estimation | Graph + Metrics |
| `risk_assessment` | Risk scoring | Graph + PageRank |

### Common Parameters

**Formats:**
- `"summary"` - Brief, human-readable
- `"detailed"` - Complete information
- `"json"` - Machine-readable JSON

**Thresholds:**
- Duplication: 0.8-0.95 (higher = stricter)
- Similarity: 0.9-0.99 (higher = more similar)
- Confidence: 0.7-0.9 (higher = more certain)

## Best Practices

### Duplication Detection

1. **Start with high thresholds** (0.9+) to find obvious duplicates
2. **Lower gradually** to find near-misses
3. **Review Type II/III clones carefully** - they may be intentional
4. **Use embedding-based search** for conceptual similarity
5. **Exclude build artifacts** - always exclude `_build`, `deps`, etc.

### Dead Code Detection

1. **Check confidence scores** - low confidence may indicate dynamic calls
2. **Review entry points** - callbacks, GenServer handlers, etc. may not show up in call graph
3. **Combine both approaches** - graph-based for unused functions, AST-based for unreachable code
4. **Run regularly** - integrate into CI/CD pipeline
5. **Keep whitelist** of intentionally unused functions (e.g., API compatibility)

### Dependency Analysis

1. **Monitor instability** - high instability modules are risky to change
2. **Break circular dependencies** - they indicate poor separation of concerns
3. **Watch for God modules** - high coupling suggests need for refactoring
4. **Track trends over time** - coupling should decrease as code improves

### Performance Tips

1. **Use incremental analysis** - only analyze changed files
2. **Exclude test directories** for production analysis
3. **Limit depth** for transitive dependency analysis
4. **Cache results** - Ragex automatically caches embeddings
5. **Run in parallel** - analysis operations are concurrent-safe

## Troubleshooting

### No Duplicates Found (Expected Some)

**Possible causes:**
- Threshold too high - try lowering to 0.7-0.8
- Files not in supported languages - check file extensions
- Structural differences too large - use embedding-based similarity

**Solutions:**
```elixir
# Try lower threshold
{:ok, clones} = Duplication.detect_in_directory("lib/", threshold: 0.7)

# Or use embedding-based similarity
{:ok, similar} = Duplication.find_similar_functions(threshold: 0.85)
```

### Too Many False Positives

**Possible causes:**
- Threshold too low
- Structural patterns common in the language (e.g., GenServer boilerplate)
- Short functions with similar structure

**Solutions:**
```elixir
# Increase threshold
{:ok, clones} = Duplication.detect_in_directory("lib/", threshold: 0.95)

# Filter by minimum size
clones
|> Enum.filter(fn clone -> 
  clone.details.locations 
  |> Enum.any?(fn loc -> loc.lines > 5 end)
end)
```

### Dead Code False Positives

**Possible causes:**
- Dynamic function calls (`apply/3`, `__MODULE__`)
- Reflection usage
- Entry points not in call graph (callbacks, tests)

**Solutions:**
1. Check confidence scores - low confidence = likely dynamic
2. Maintain whitelist of known entry points
3. Review before deletion

### Parse Errors

**Possible causes:**
- Invalid syntax in source files
- Unsupported language features
- Missing language parser

**Solutions:**
```elixir
# Check logs for specific parse errors
# Ragex logs warnings for unparseable files

# Exclude problematic files
{:ok, clones} = Duplication.detect_in_directory("lib/", 
  exclude_patterns: ["problem_file.ex"]
)
```

### Performance Issues

**Symptoms:**
- Slow analysis on large codebases
- Memory usage spikes

**Solutions:**
1. Analyze incrementally (changed files only)
2. Exclude large generated files
3. Use streaming for large result sets
4. Increase system resources

```elixir
# Analyze only changed files
changed_files = ["lib/a.ex", "lib/b.ex"]
{:ok, clones} = Duplication.detect_in_files(changed_files)
```

## Integration Examples

### CI/CD Pipeline

```bash
#!/bin/bash
# detect_issues.sh

# Find duplicates
echo "Checking for code duplication..."
mix ragex.analyze.duplicates --threshold 0.9 --format json > duplicates.json

# Find dead code
echo "Checking for dead code..."
mix ragex.analyze.dead_code --confidence 0.8 --format json > dead_code.json

# Check for circular dependencies
echo "Checking for circular dependencies..."
mix ragex.analyze.cycles --format json > cycles.json

# Fail if issues found
if [ -s duplicates.json ] || [ -s dead_code.json ] || [ -s cycles.json ]; then
  echo "Code quality issues detected!"
  exit 1
fi
```

### Pre-commit Hook

```bash
#!/bin/bash
# .git/hooks/pre-commit

# Get staged Elixir files
STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM | grep -E '\\.ex$|\\.exs$')

if [ -n "$STAGED_FILES" ]; then
  echo "Checking staged files for duplication..."
  mix ragex.analyze.duplicates --files $STAGED_FILES --threshold 0.95
fi
```

### Interactive Analysis

```elixir
# In IEx
alias Ragex.Analysis.Duplication

# Generate report
{:ok, report} = Duplication.generate_report("lib/")

# Display summary
IO.puts(report.summary)

# Investigate specific clones
report.ast_clones.pairs
|> Enum.filter(&(&1.clone_type == :type_i))
|> Enum.each(fn clone ->
  IO.puts("\n#{clone.file1} <-> #{clone.file2}")
  IO.puts("  #{clone.details.summary}")
end)
```

## Further Reading

- [Metastatic Documentation](https://github.com/oeditus/metastatic)

---

**Version:** Ragex 0.2.0
**Last Updated:** January 2026
**Status:** Production Ready