docs/architecture/pipeline_organization.md

# Pipeline Organization and Categorization System

## Overview
This document defines the organizational structure for the AI engineering pipeline library, establishing a systematic approach to pipeline discovery, reuse, and composition.

## Directory Structure

```
pipeline_ex/
├── pipelines/                      # Main pipeline library
│   ├── registry.yaml              # Global pipeline registry
│   ├── data/                      # Data processing pipelines
│   │   ├── cleaning/
│   │   ├── enrichment/
│   │   ├── transformation/
│   │   └── quality/
│   ├── model/                     # Model development pipelines
│   │   ├── prompt_engineering/
│   │   ├── evaluation/
│   │   ├── comparison/
│   │   └── fine_tuning/
│   ├── code/                      # Code generation pipelines
│   │   ├── api_generation/
│   │   ├── test_generation/
│   │   ├── documentation/
│   │   └── refactoring/
│   ├── analysis/                  # Analysis pipelines
│   │   ├── codebase/
│   │   ├── security/
│   │   ├── performance/
│   │   └── dependencies/
│   ├── content/                   # Content generation pipelines
│   │   ├── blog/
│   │   ├── tutorial/
│   │   ├── api_docs/
│   │   └── changelog/
│   ├── devops/                    # DevOps pipelines
│   │   ├── ci_cd/
│   │   ├── deployment/
│   │   ├── monitoring/
│   │   └── infrastructure/
│   ├── components/                # Reusable components
│   │   ├── steps/                # Reusable step definitions
│   │   ├── prompts/              # Prompt templates
│   │   ├── functions/            # Gemini function definitions
│   │   ├── validators/           # Validation components
│   │   └── transformers/         # Data transformation components
│   └── templates/                 # Pipeline templates
│       ├── basic/                # Simple pipeline patterns
│       ├── advanced/             # Complex pipeline patterns
│       └── enterprise/           # Production-grade patterns
├── examples/                      # Example usage and demos
│   ├── tutorials/                # Step-by-step tutorials
│   └── case_studies/             # Real-world implementations
└── tests/                        # Pipeline-specific tests
    ├── pipeline_tests/           # Integration tests for pipelines
    └── component_tests/          # Unit tests for components
```

## Pipeline Registry Schema

The `registry.yaml` serves as the central catalog of all available pipelines:

```yaml
version: "1.0"
last_updated: "2025-06-30"

pipelines:
  - id: "data-cleaning-standard"
    name: "Standard Data Cleaning Pipeline"
    category: "data/cleaning"
    description: "Multi-stage data cleaning with validation"
    version: "1.0.0"
    tags: ["data", "cleaning", "validation"]
    dependencies:
      - "components/steps/validation"
      - "components/transformers/data"
    complexity: "medium"
    estimated_tokens: 5000
    providers: ["claude", "gemini"]
    
  - id: "api-rest-generator"
    name: "REST API Generator"
    category: "code/api_generation"
    description: "Generate complete REST API with tests"
    version: "2.1.0"
    tags: ["api", "code-generation", "rest"]
    dependencies:
      - "components/steps/code"
      - "components/prompts/api"
    complexity: "high"
    estimated_tokens: 15000
    providers: ["claude"]
```

## Categorization Taxonomy

### 1. Primary Categories
- **Data**: Pipelines focused on data manipulation and processing
- **Model**: AI/ML model development and optimization
- **Code**: Software development and code generation
- **Analysis**: System and code analysis workflows
- **Content**: Documentation and content creation
- **DevOps**: Infrastructure and deployment automation

### 2. Complexity Levels
- **Basic**: Single-step or simple multi-step pipelines
- **Medium**: Multi-step with conditional logic
- **High**: Complex workflows with parallel execution
- **Enterprise**: Production-grade with full error handling

### 3. Provider Requirements
- **Claude-only**: Requires Claude-specific features
- **Gemini-only**: Requires Gemini function calling
- **Multi-provider**: Can use either provider
- **Hybrid**: Requires both providers

## Component Classification

### Step Components
```yaml
# components/steps/validation/input_validator.yaml
component:
  type: "step"
  id: "input-validator"
  name: "Input Validation Step"
  description: "Validates input data against schema"
  
  parameters:
    schema:
      type: "object"
      description: "JSON Schema for validation"
    strict:
      type: "boolean"
      default: true
      
  outputs:
    valid:
      type: "boolean"
    errors:
      type: "array"
      items:
        type: "string"
```

### Prompt Templates
```yaml
# components/prompts/analysis/code_review.yaml
component:
  type: "prompt"
  id: "code-review-prompt"
  name: "Code Review Prompt Template"
  
  variables:
    - code_content
    - review_focus
    - severity_level
    
  template: |
    Review the following code with focus on {review_focus}:
    
    ```
    {code_content}
    ```
    
    Provide feedback at {severity_level} level.
```

## Naming Conventions

### Pipeline Files
- Format: `{purpose}_{variant}_pipeline.yaml`
- Examples:
  - `data_cleaning_standard_pipeline.yaml`
  - `api_generation_rest_pipeline.yaml`
  - `security_audit_comprehensive_pipeline.yaml`

### Component Files
- Format: `{function}_{type}.yaml`
- Examples:
  - `input_validator.yaml`
  - `json_transformer.yaml`
  - `code_review_prompt.yaml`

### Version Tags
- Semantic versioning: `MAJOR.MINOR.PATCH`
- Beta versions: `X.Y.Z-beta.N`
- Release candidates: `X.Y.Z-rc.N`

## Discovery Mechanisms

### 1. CLI Commands
```bash
# List all pipelines
mix pipeline.list

# Search by category
mix pipeline.list --category data/cleaning

# Search by tags
mix pipeline.list --tags "api,rest"

# Show pipeline details
mix pipeline.info api-rest-generator
```

### 2. Web Interface (Future)
- Visual pipeline browser
- Dependency graph visualization
- Performance metrics dashboard
- Usage analytics

### 3. API Access
```elixir
# Pipeline discovery API
Pipeline.Registry.list_by_category("data/cleaning")
Pipeline.Registry.search(tags: ["api", "rest"])
Pipeline.Registry.get_details("api-rest-generator")
```

## Metadata Standards

Each pipeline must include:
1. Unique identifier
2. Descriptive name
3. Clear category placement
4. Version information
5. Dependency declarations
6. Performance estimates
7. Provider requirements
8. Comprehensive tags

## Migration Path

For existing pipelines:
1. Analyze current pipeline files
2. Categorize according to new taxonomy
3. Add required metadata
4. Update file locations
5. Register in central registry
6. Update references in code

## Governance

### Adding New Pipelines
1. Define clear purpose and category
2. Follow naming conventions
3. Include all required metadata
4. Add comprehensive tests
5. Document usage examples
6. Submit for review

### Deprecation Process
1. Mark as deprecated in registry
2. Add deprecation notice to file
3. Provide migration guide
4. Maintain for 2 major versions
5. Archive after removal

## Benefits

1. **Discoverability**: Easy to find relevant pipelines
2. **Reusability**: Clear component boundaries
3. **Maintainability**: Organized structure
4. **Scalability**: Supports growth
5. **Consistency**: Enforced standards
6. **Quality**: Review process