README.md

# ConfluenceLoader

An Elixir library for fetching and reading Confluence pages, inspired by the Python llama-index-readers-confluence library.

## Features

- Fetch pages from Confluence Cloud and Server instances
- Support for both REST API v1 and v2
- Convert Confluence pages to a document format suitable for LLMs
- Pagination support for large result sets
- Flexible authentication (API tokens for Cloud, username/password for Server)

## Installation

Add `confluence_loader` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:confluence_loader, "~> 0.1.0"}
  ]
end
```

## Configuration

You'll need:
- Your Confluence instance URL (e.g., `https://your-domain.atlassian.net`)
- Authentication credentials:
  - For Confluence Cloud: Email address and API token
  - For Confluence Server: Username and password

## Usage

### Creating a Client

```elixir
# For Confluence Cloud (default)
client = ConfluenceLoader.new_client(
  "https://your-domain.atlassian.net",
  "your-email@example.com",
  "your-api-token"
)

# For Confluence Server with custom API path
client = ConfluenceLoader.new_client(
  "https://confluence.company.com",
  "username",
  "password",
  api_base_path: "/rest/api"  # v1 API path
)
```

### Loading Documents

```elixir
# Load all pages as documents
{:ok, documents} = ConfluenceLoader.load_documents(client)

# Load pages from a specific space
{:ok, documents} = ConfluenceLoader.load_space_documents(client, "PROJ")

# Load with specific parameters
{:ok, documents} = ConfluenceLoader.load_documents(client, %{
  space_id: ["123", "456"],  # Multiple space IDs
  limit: 50,                 # Number of pages per request
  status: "current",         # Page status
  body_format: "storage"     # Format of the content body
})

# Load documents created since a specific timestamp
{:ok, since_date} = DateTime.new(~D[2024-01-01], ~T[00:00:00], "Etc/UTC")
{:ok, recent_docs} = ConfluenceLoader.load_documents_since(client, "PROJ", since_date)

# Or using an ISO timestamp string
{:ok, recent_docs} = ConfluenceLoader.load_documents_since(client, "PROJ", "2024-01-01T00:00:00Z")
```

### Document Structure

Each document contains:
- `id`: The page ID
- `text`: The page content (HTML stripped)
- `metadata`: Additional information including:
  - `title`: Page title
  - `space_id`: Space ID
  - `space_key`: Space key
  - `status`: Page status
  - `created_at`: Creation timestamp
  - `updated_at`: Last update timestamp
  - `url`: Web URL of the page
  - `parent_id`: Parent page ID (if applicable)

### Working with Documents

```elixir
# Load documents
{:ok, documents} = ConfluenceLoader.load_documents(client)

# Access document properties
Enum.each(documents, fn doc ->
  IO.puts("Title: #{doc.metadata.title}")
  IO.puts("Content: #{String.slice(doc.text, 0, 100)}...")
  
  # Format for LLM consumption
  formatted = ConfluenceLoader.Document.format_for_llm(doc)
  IO.puts(formatted)
end)
```

### Timestamp-Based Filtering

Load documents from a specific space that were created at or after a given timestamp. This is useful for incremental updates or processing only recent content changes:

```elixir
# Using DateTime struct
{:ok, since_date} = DateTime.new(~D[2024-01-01], ~T[00:00:00], "Etc/UTC")
{:ok, recent_docs} = ConfluenceLoader.load_documents_since(client, "PROJ", since_date)

# Using ISO 8601 timestamp string
{:ok, recent_docs} = ConfluenceLoader.load_documents_since(client, "PROJ", "2024-01-01T00:00:00Z")

# With additional parameters
{:ok, recent_docs} = ConfluenceLoader.load_documents_since(client, "PROJ", since_date, %{limit: 50})

# Load documents from the last 30 days
thirty_days_ago = DateTime.utc_now() |> DateTime.add(-30, :day)
{:ok, recent_docs} = ConfluenceLoader.load_documents_since(client, "SPACE_KEY", thirty_days_ago)
```

### Low-Level API Access

You can also use the lower-level API functions directly:

```elixir
# Get pages with specific parameters
{:ok, response} = ConfluenceLoader.get_pages(client)

# Get pages with query parameters
{:ok, response} = ConfluenceLoader.get_pages(client, %{
  space_id: ["123"],
  limit: 25,
  sort: "-created-date"
})

# Get a specific page
{:ok, page} = ConfluenceLoader.get_page(client, "page_id")

# Get pages in a space
{:ok, response} = ConfluenceLoader.get_pages_in_space(client, "PROJ")

# Get pages by label
{:ok, response} = ConfluenceLoader.get_pages_for_label(client, "label_id")
```

### Pagination

The library handles pagination automatically when using `load_documents` functions. For manual pagination with the low-level API:

```elixir
# Create client
client = ConfluenceLoader.new_client(
  "https://your-domain.atlassian.net",
  "email@example.com",
  "api-token"
)

# Function to fetch all pages
def fetch_all_pages(client, cursor \\ nil, accumulated \\ []) do
  params = %{limit: 25}
  params = if cursor, do: Map.put(params, :cursor, cursor), else: params
  
  case ConfluenceLoader.get_pages(client, params) do
    {:ok, response} ->
      pages = response["results"] || []
      accumulated = accumulated ++ pages
      
      case response["_links"]["next"] do
        nil -> 
          {:ok, accumulated}
        _ ->
          # Extract cursor from next link
          alias ConfluenceLoader.Pages
          next_cursor = Pages.extract_cursor_from_next_link(response)
          fetch_all_pages(client, next_cursor, accumulated)
      end
      
    {:error, reason} ->
      {:error, reason}
  end
end

# Use it
{:ok, first_page} = ConfluenceLoader.get_pages(client, %{limit: 25})
```

## Testing

The library includes comprehensive test coverage using Bypass for mocking HTTP requests:

```bash
mix test
```

## Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin feature/my-new-feature`)
5. Create a new Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

This library is inspired by the Python [llama-index-readers-confluence](https://pypi.org/project/llama-index-readers-confluence/) library and provides similar functionality for the Elixir ecosystem.