# Portfolio Index
<p align="center">
<img src="assets/portfolio_index.svg" alt="Portfolio Index Logo" width="200">
</p>
<p align="center">
<a href="https://hex.pm/packages/portfolio_index"><img alt="Hex.pm" src="https://img.shields.io/hexpm/v/portfolio_index.svg"></a>
<a href="https://hexdocs.pm/portfolio_index"><img alt="Documentation" src="https://img.shields.io/badge/docs-hexdocs-purple.svg"></a>
<a href="https://github.com/nshkrdotcom/portfolio_index/actions"><img alt="Build Status" src="https://img.shields.io/github/actions/workflow/status/nshkrdotcom/portfolio_index/ci.yml"></a>
<a href="https://opensource.org/licenses/MIT"><img alt="License" src="https://img.shields.io/hexpm/l/portfolio_index.svg"></a>
</p>
**Production adapters and pipelines for the PortfolioCore hexagonal architecture. Vector stores, graph databases, embedders, Broadway pipelines, and advanced RAG strategies.**
---
## Overview
Portfolio Index implements the port specifications defined in [Portfolio Core](https://github.com/nshkrdotcom/portfolio_core), providing:
- **Vector Store Adapters** - pgvector (PostgreSQL)
- **Graph Store Adapters** - Neo4j via boltx
- **Embedding Providers** - Google Gemini
- **LLM Providers** - Google Gemini (gemini-2.5-flash)
- **Broadway Pipelines** - Ingestion and embedding with backpressure
- **RAG Strategies** - Hybrid (RRF fusion), Self-RAG (self-critique)
## Prerequisites
### PostgreSQL with pgvector
```bash
# Ubuntu/WSL
sudo apt install postgresql postgresql-contrib libpq-dev postgresql-16-pgvector
# Create database
createdb portfolio_index_dev
```
### Neo4j
```bash
# Install via apt (Ubuntu/WSL)
curl -fsSL https://debian.neo4j.com/neotechnology.gpg.key | \
sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/neo4j.gpg
echo "deb https://debian.neo4j.com stable latest" | \
sudo tee /etc/apt/sources.list.d/neo4j.list
sudo apt update && sudo apt install neo4j
# Start service
sudo systemctl enable neo4j && sudo systemctl start neo4j
# Set password
sudo neo4j-admin dbms set-initial-password password
```
**Access Points:**
| Service | URL | Credentials |
|---------------|-------------------------|--------------------|
| Neo4j Browser | http://localhost:7474 | neo4j / password |
| Bolt endpoint | bolt://localhost:7687 | neo4j / password |
### Gemini API Key
```bash
export GEMINI_API_KEY="your-api-key"
```
## Installation
Add `portfolio_index` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:portfolio_index, "~> 0.1.0"}
]
end
```
Then run:
```bash
mix deps.get
mix ecto.create
mix ecto.migrate
```
## Quick Start
### Vector Search
```elixir
alias PortfolioIndex.Adapters.VectorStore.Pgvector
alias PortfolioIndex.Adapters.Embedder.Gemini
# Create index
:ok = Pgvector.create_index("docs", %{dimensions: 768, metric: :cosine})
# Generate embedding and store
{:ok, %{vector: vec}} = Gemini.embed("Hello, world!")
:ok = Pgvector.store("docs", "doc_1", vec, %{content: "Hello, world!"})
# Search
{:ok, results} = Pgvector.search("docs", query_vector, 10, [])
```
### Graph Operations
```elixir
alias PortfolioIndex.Adapters.GraphStore.Neo4j
# Create a graph namespace
:ok = Neo4j.create_graph("knowledge", %{})
# Create nodes
{:ok, node1} = Neo4j.create_node("knowledge", %{
labels: ["Concept"],
properties: %{name: "Elixir", type: "language"}
})
{:ok, node2} = Neo4j.create_node("knowledge", %{
labels: ["Concept"],
properties: %{name: "GenServer", type: "behaviour"}
})
# Create relationship
{:ok, _edge} = Neo4j.create_edge("knowledge", %{
from_id: node1.id,
to_id: node2.id,
type: "HAS_FEATURE",
properties: %{since: "1.0"}
})
# Query neighbors
{:ok, neighbors} = Neo4j.get_neighbors("knowledge", node1.id, direction: :outgoing)
```
### RAG Query
```elixir
alias PortfolioIndex.RAG.Strategies.Hybrid
{:ok, result} = Hybrid.retrieve(
"How does authentication work?",
%{index_id: "docs"},
k: 10
)
# result.items contains ranked results
# result.timing_ms contains query duration
```
### Self-RAG with Critique
```elixir
alias PortfolioIndex.RAG.Strategies.SelfRAG
{:ok, result} = SelfRAG.retrieve(
"What is GenServer?",
%{index_id: "docs"},
k: 5, min_critique_score: 3
)
# result.answer contains the generated answer
# result.critique contains relevance/support/completeness scores
```
### Broadway Pipeline
```elixir
# Start ingestion pipeline
{:ok, _} = PortfolioIndex.Pipelines.Ingestion.start(
paths: ["/path/to/docs"],
patterns: ["**/*.md", "**/*.ex"],
index_id: "my_index",
chunk_size: 1000,
chunk_overlap: 200
)
# Start embedding pipeline
{:ok, _} = PortfolioIndex.Pipelines.Embedding.start(
index_id: "my_index",
rate_limit: 100,
batch_size: 50
)
```
## Configuration
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `DATABASE_URL` | PostgreSQL connection URL | - |
| `NEO4J_URI` | Neo4j Bolt URI | `bolt://localhost:7687` |
| `NEO4J_USER` | Neo4j username | `neo4j` |
| `NEO4J_PASSWORD` | Neo4j password | - |
| `GEMINI_API_KEY` | Google Gemini API key | - |
### Config Files
```elixir
# config/dev.exs
config :portfolio_index, PortfolioIndex.Repo,
username: "postgres",
password: "postgres",
hostname: "localhost",
database: "portfolio_index_dev"
config :boltx, Boltx,
uri: "bolt://localhost:7687",
auth: [username: "neo4j", password: "password"],
pool_size: 10
```
## Adapters
### Vector Store
| Adapter | Backend | Features |
|---------|---------|----------|
| `Pgvector` | PostgreSQL + pgvector | IVFFlat, HNSW indexes, cosine/euclidean/dot_product |
### Graph Store
| Adapter | Backend | Features |
|---------|---------|----------|
| `Neo4j` | Neo4j via boltx | Multi-graph isolation, Cypher queries |
### Embedder
| Adapter | Provider | Model |
|---------|----------|-------|
| `Gemini` | Google | text-embedding-004 (768 dims) |
### LLM
| Adapter | Provider | Model |
|---------|----------|-------|
| `Gemini` | Google | gemini-2.5-flash |
### Chunker
| Adapter | Strategy | Features |
|---------|----------|----------|
| `Recursive` | Recursive text splitting | Format-aware (markdown, code, plain) |
## RAG Strategies
| Strategy | Description |
|----------|-------------|
| `Hybrid` | Vector + keyword search with Reciprocal Rank Fusion |
| `SelfRAG` | Retrieval with self-critique and answer refinement |
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Portfolio Index │
├─────────────────────────────────────────────────────────────┤
│ Adapters │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Vector Store │ │ Graph Store │ │ Embedder │ │
│ │ • Pgvector │ │ • Neo4j │ │ • Gemini │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ LLM │ │ Chunker │ │ Document Store│ │
│ │ • Gemini │ │ • Recursive │ │ • Postgres │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Pipelines (Broadway) │
│ ┌───────────────────────────┐ ┌───────────────────────────┐│
│ │ Ingestion │ │ Embedding ││
│ │ FileProducer → Chunker │ │ ETSProducer → VectorStore ││
│ └───────────────────────────┘ └───────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│ RAG Strategies │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Hybrid │ │ Self-RAG │ │
│ │ Vector + RRF fusion│ │ Critique + Refine │ │
│ └────────────────────┘ └────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Portfolio Core │
│ (Port Specifications & Registry) │
└─────────────────────────────────────────────────────────────┘
```
## Testing
```bash
# Run unit tests (mocked adapters)
mix test
# Run integration tests (requires running services)
mix test --include integration
# Run only Neo4j integration tests
mix test test/adapters/graph_store/neo4j_test.exs --include integration
# Run only Pgvector integration tests
mix test test/adapters/vector_store/pgvector_test.exs --include integration
```
### Test Structure
The test suite separates **unit tests** (mocked, fast) from **integration tests** (live services):
| Test Type | Tag | Services Required | Run Command |
|-----------|-----|-------------------|-------------|
| Unit | (default) | None | `mix test` |
| Integration | `@tag :integration` | Neo4j, PostgreSQL | `mix test --include integration` |
Integration tests are **excluded by default** in `test/test_helper.exs`:
```elixir
ExUnit.start(exclude: [:integration, :skip])
```
### Test Fixtures
`test/support/fixtures.ex` provides test data generators:
```elixir
alias PortfolioIndex.Fixtures
# Vector fixtures
Fixtures.random_vector(768) # Random 768-dim vector
Fixtures.random_normalized_vector(768) # Normalized (unit length)
# Graph fixtures
Fixtures.sample_node("node_1") # %{id, labels, properties}
Fixtures.sample_edge("from", "to") # %{id, type, from_id, to_id, properties}
Fixtures.sample_graph(5) # %{nodes: [...], edges: [...]}
# Document fixtures
Fixtures.sample_document() # Sample markdown content
Fixtures.sample_code() # Sample Elixir code
Fixtures.sample_chunks(content, 3) # Split content into chunks
```
## Neo4j Details
### Schema Management
Unlike SQL databases, Neo4j doesn't use traditional migrations. Instead, `PortfolioIndex.Adapters.GraphStore.Neo4j.Schema` provides schema management:
```elixir
alias PortfolioIndex.Adapters.GraphStore.Neo4j.Schema
# Setup all constraints and indexes
Schema.setup!()
# Check current schema version
Schema.version()
#=> 1
# Run migrations up to a specific version
Schema.migrate!(2)
# Reset database (DANGEROUS - testing only)
Schema.reset!()
# Clean a specific graph namespace
Schema.clean_graph!("my_graph")
```
### Schema Versioning
Schema versions are tracked in a `:SchemaVersion` node:
```cypher
(:SchemaVersion {id: 'current', version: 1, updated_at: datetime()})
```
Each migration is idempotent and can be re-run safely.
### Constraints and Indexes
The schema setup creates:
| Type | Name | Description |
|------|------|-------------|
| Constraint | `node_id_unique` | Unique node IDs within a graph |
| Constraint | `edge_id_unique` | Unique edge IDs within a graph |
| Index | `idx_node_graph_id` | Fast graph isolation queries |
| Index | `idx_node_labels` | Label-based queries |
| Index | `idx_fulltext_content` | Full-text search on content/name/title |
### Multi-Graph Isolation
All nodes and edges include a `_graph_id` property for namespace isolation:
```elixir
# Each graph is isolated by its graph_id
Neo4j.create_graph("project_a", %{})
Neo4j.create_graph("project_b", %{})
# Nodes in different graphs don't interfere
Neo4j.create_node("project_a", %{labels: ["File"], properties: %{path: "/app.ex"}})
Neo4j.create_node("project_b", %{labels: ["File"], properties: %{path: "/app.ex"}})
# Queries are scoped to a graph
Neo4j.get_neighbors("project_a", node_id, direction: :outgoing)
```
The underlying Cypher uses `_graph_id` for isolation:
```cypher
MATCH (n {id: $node_id, _graph_id: $graph_id})
RETURN n, labels(n) as labels
```
### Custom Cypher Queries
Execute arbitrary Cypher with automatic graph_id injection:
```elixir
cypher = """
MATCH (p:Person {_graph_id: $graph_id})
WHERE p.age > $min_age
RETURN p.name AS name, p.age AS age
ORDER BY p.age DESC
"""
{:ok, result} = Neo4j.query("my_graph", cypher, %{min_age: 25})
# result.records contains [%{"name" => "Alice", "age" => 30}, ...]
```
Both `$graph_id` and `$_graph_id` are available in queries.
### Boltx Driver
This adapter uses [boltx](https://hex.pm/packages/boltx) (v0.0.6+) for Neo4j connectivity:
```elixir
# config/dev.exs
config :boltx, Boltx,
uri: "bolt://localhost:7687",
auth: [username: "neo4j", password: "password"],
pool_size: 10,
name: Boltx # Required for connection pool registration
```
### Neo4j Integration Tests
Integration tests create isolated graph namespaces per test:
```elixir
defmodule MyNeo4jTest do
use ExUnit.Case, async: true
alias PortfolioIndex.Adapters.GraphStore.Neo4j
describe "my feature" do
@tag :integration
test "creates nodes" do
# Create unique graph for this test
graph_id = "test_#{System.unique_integer([:positive])}"
:ok = Neo4j.create_graph(graph_id, %{})
# Test logic...
{:ok, node} = Neo4j.create_node(graph_id, %{
labels: ["Test"],
properties: %{name: "example"}
})
assert is_binary(node.id)
# Cleanup
Neo4j.delete_graph(graph_id)
end
end
end
```
### Telemetry Events
The Neo4j adapter emits telemetry events:
```elixir
# Event names
[:portfolio_index, :graph_store, :create_node]
[:portfolio_index, :graph_store, :create_edge]
[:portfolio_index, :graph_store, :query]
# Measurements
%{duration_ms: 5}
# Metadata
%{graph_id: "my_graph"}
```
Attach handlers for observability:
```elixir
:telemetry.attach(
"neo4j-logger",
[:portfolio_index, :graph_store, :query],
fn _event, %{duration_ms: ms}, %{graph_id: id}, _config ->
Logger.info("Neo4j query on #{id} took #{ms}ms")
end,
nil
)
```
## Pgvector Details
### PostgreSQL Setup
```bash
# Install PostgreSQL and pgvector extension
sudo apt install postgresql postgresql-contrib libpq-dev postgresql-16-pgvector
# Create database
createdb portfolio_index_dev
# Enable pgvector extension
psql -d portfolio_index_dev -c "CREATE EXTENSION IF NOT EXISTS vector;"
# Run migrations
mix ecto.migrate
```
### Index Configuration
Create vector indexes with customizable parameters:
```elixir
alias PortfolioIndex.Adapters.VectorStore.Pgvector
# Basic index with defaults
:ok = Pgvector.create_index("docs", %{dimensions: 768})
# Cosine similarity with HNSW index
:ok = Pgvector.create_index("embeddings", %{
dimensions: 768,
metric: :cosine,
index_type: :hnsw,
options: %{m: 16, ef_construction: 64}
})
# Euclidean distance with IVFFlat index
:ok = Pgvector.create_index("images", %{
dimensions: 512,
metric: :euclidean,
index_type: :ivfflat,
options: %{lists: 100}
})
```
### Distance Metrics
| Metric | Operator | Use Case |
|--------|----------|----------|
| `:cosine` | `<=>` | Text embeddings, normalized vectors |
| `:euclidean` | `<->` | Image embeddings, spatial data |
| `:dot_product` | `<#>` | When vectors are already normalized |
### Index Types
| Type | Description | Best For |
|------|-------------|----------|
| `:ivfflat` | Inverted file index | Large datasets, good recall |
| `:hnsw` | Hierarchical navigable small world | Fast queries, high recall |
| `:flat` | No index (exact search) | Small datasets, perfect accuracy |
### Vector Operations
```elixir
alias PortfolioIndex.Adapters.VectorStore.Pgvector
# Store a vector with metadata
:ok = Pgvector.store("docs", "doc_1", embedding_vector, %{
source: "/path/to/file.md",
title: "My Document",
chunk_index: 0
})
# Batch store (more efficient)
items = [
{"doc_1", vector1, %{source: "/a.md"}},
{"doc_2", vector2, %{source: "/b.md"}},
{"doc_3", vector3, %{source: "/c.md"}}
]
{:ok, 3} = Pgvector.store_batch("docs", items)
# Search with k nearest neighbors
{:ok, results} = Pgvector.search("docs", query_vector, 10, [])
# results = [%{id: "doc_1", score: 0.95, metadata: %{...}}, ...]
# Search with metadata filter
{:ok, results} = Pgvector.search("docs", query_vector, 10,
filter: %{source: "/a.md"}
)
# Search with minimum score threshold
{:ok, results} = Pgvector.search("docs", query_vector, 10,
min_score: 0.8
)
# Include vectors in results
{:ok, results} = Pgvector.search("docs", query_vector, 10,
include_vector: true
)
# Delete a vector
:ok = Pgvector.delete("docs", "doc_1")
# Get index statistics
{:ok, stats} = Pgvector.index_stats("docs")
# stats = %{count: 1000, dimensions: 768, metric: :cosine, size_bytes: ...}
# Check if index exists
Pgvector.index_exists?("docs") # => true or false
# Delete entire index
:ok = Pgvector.delete_index("docs")
```
### Table Structure
Each index creates a table with this schema:
```sql
CREATE TABLE vectors_<index_id> (
id VARCHAR(255) PRIMARY KEY,
embedding vector(<dimensions>),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
```
Index metadata is tracked in the registry:
```sql
CREATE TABLE vector_index_registry (
index_id VARCHAR(255) PRIMARY KEY,
dimensions INTEGER NOT NULL,
metric VARCHAR(50) NOT NULL,
index_type VARCHAR(50) NOT NULL,
options JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
```
### Ecto Configuration
```elixir
# config/dev.exs
config :portfolio_index, PortfolioIndex.Repo,
username: "postgres",
password: "postgres",
hostname: "localhost",
database: "portfolio_index_dev",
pool_size: 10
# config/test.exs
config :portfolio_index, PortfolioIndex.Repo,
username: "postgres",
password: "postgres",
hostname: "localhost",
database: "portfolio_index_test",
pool: Ecto.Adapters.SQL.Sandbox
```
### Pgvector Integration Tests
Integration tests use Ecto sandbox for isolation:
```elixir
defmodule MyVectorTest do
use ExUnit.Case, async: false
alias PortfolioIndex.Adapters.VectorStore.Pgvector
setup do
pid = Ecto.Adapters.SQL.Sandbox.start_owner!(PortfolioIndex.Repo, shared: true)
on_exit(fn -> Ecto.Adapters.SQL.Sandbox.stop_owner(pid) end)
:ok
end
@tag :integration
test "stores and searches vectors" do
index_id = "test_#{System.unique_integer([:positive])}"
:ok = Pgvector.create_index(index_id, %{dimensions: 768})
vector = for _ <- 1..768, do: :rand.uniform()
:ok = Pgvector.store(index_id, "doc_1", vector, %{})
{:ok, results} = Pgvector.search(index_id, vector, 1, [])
assert hd(results).id == "doc_1"
Pgvector.delete_index(index_id)
end
end
```
### Telemetry Events
The Pgvector adapter emits telemetry events:
```elixir
# Event names
[:portfolio_index, :vector_store, :store]
[:portfolio_index, :vector_store, :store_batch]
[:portfolio_index, :vector_store, :search]
# Measurements
%{duration_ms: 5} # store
%{duration_ms: 50, count: 100} # store_batch
%{duration_ms: 10, k: 10, results: 8} # search
# Metadata
%{index_id: "my_index"}
```
Attach handlers for monitoring:
```elixir
:telemetry.attach(
"pgvector-logger",
[:portfolio_index, :vector_store, :search],
fn _event, %{duration_ms: ms, results: n}, %{index_id: id}, _config ->
Logger.info("Search on #{id}: #{n} results in #{ms}ms")
end,
nil
)
```
### Performance Tips
1. **Use HNSW for production** - Better query performance than IVFFlat
2. **Batch inserts** - Use `store_batch/2` for bulk ingestion
3. **Tune HNSW parameters**:
- `m`: Higher = better recall, more memory (default: 16)
- `ef_construction`: Higher = better index quality, slower build (default: 64)
4. **Use metadata filters** - Reduces search space before vector comparison
5. **Set appropriate `min_score`** - Filters low-quality matches early
## Documentation
- [HexDocs](https://hexdocs.pm/portfolio_index)
## Related Packages
- [`portfolio_core`](https://github.com/nshkrdotcom/portfolio_core) - Hexagonal architecture primitives
- [`portfolio_manager`](https://github.com/nshkrdotcom/portfolio_manager) - CLI and application layer
## License
MIT License - see [LICENSE](LICENSE) for details.