Architecture
The Autonomous Incident Agent (AIA) is a modular, event-driven system designed to automatically detect, analyze, and create fix suggestions for application failures. It uses OpenTelemetry for observability, AI for analysis, and GitHub for delivering fixes.
#System Overview
AIA is composed of several specialized microservices that coordinate to handle the incident lifecycle.
Core Services
1. Agent (apps/agent)
The OpenTelemetry receiver running on port 4318. Receives OTEL traces and logs from your application, detects errors using built-in detectors (HTTP 5xx, exceptions, latency spikes, crashes), and forwards incidents to the Router.
Key Features:
- OTEL HTTP endpoint (
/v1/traces,/v1/logs) - Built-in error detectors
- Trace ID-based deduplication (30-second window)
2. Router (apps/router)
The incident orchestrator. Receives detected incidents from the Agent, enriches them with context (snapshots, file paths), stores them in the State service, and triggers the Autopsy service for analysis.
Key Features:
- Incident enrichment with code snapshots
- State persistence coordination
- Autopsy triggering
3. Autopsy (apps/autopsy)
The AI-powered analysis engine. Uses You.com API to analyze incidents and generate:
- Root cause analysis
- Code patches (git diff format)
- AI fix prompts for developers
- Step-by-step manual remediation instructions
Key Features:
- AI reasoning (You.com API)
- Patch generation with validation
- Detailed fix instructions
- Confidence scoring
4. State (apps/state)
The persistence layer using PostgreSQL (Neon). Stores all incident data, autopsy results, and PR metadata.
Database Schema:
incidentstable: Core incident dataautopsy_resultstable: AI analysis results- Indexed by incident_id for fast lookups
5. Git (apps/git)
The GitHub integration service. Handles all git operations:
- Clones repository using GitHub token
- Creates incident branch (
aia/incident-{id}) - Attempts to apply AI-generated patch
- Commits changes (or saves failed patch)
- Pushes to GitHub
- Creates Pull Request with fix details
Key Features:
- Automatic patch application
- Failed patch handling (
patch_failed_*.diff) - PR creation with detailed context
6. Dashboard (apps/dashboard)
The web UI for monitoring incidents. Provides:
- Real-time incident list
- Autopsy results viewer
- AI fix prompt (copy-to-clipboard)
- Manual remediation steps
- GitHub PR links
- PDF report generation
Tech Stack: Bun + HTML templates
7. Web (apps/web)
The marketing/landing page with Clerk authentication.
Tech Stack: Next.js 14, Clerk, shadcn/ui
8. Docs (apps/docs)
The documentation site you're reading now.
Tech Stack: Next.js 14, MDX
9. Sample App (apps/sample-app)
A demo application that emits OTEL traces for testing.
Features:
/triggerendpoint to cause intentional errors- OTEL instrumentation
- Error simulation
#Data Flow
1. Application Error
↓
2. OTEL Trace → Agent (port 4318)
↓
3. Detector identifies incident
↓
4. Router enriches with context
↓
5. State stores incident
↓
6. Autopsy analyzes with AI
↓
7. State stores autopsy result
↓
8. Git creates PR
↓
9. Dashboard displays results
#Storage
Database (PostgreSQL/Neon)
- Incident metadata
- Autopsy results
- PR status
Cloudflare R2 (Optional)
- Autopsy result backups
- Large artifacts
Local Filesystem
- Git workspace (
apps/git/git_workspace/) - Logs
- Failed patches
#Configuration
All services are configured via aia.config.yaml:
- Service ports and URLs
- GitHub credentials
- AI API keys
- Database connection
- Storage settings
See Configuration Reference for details.
#Deployment
Development
bun run dev
Starts all services locally with hot reload.
Production
Services can be deployed as:
- Docker containers
- Standalone Node.js/Bun processes
- Kubernetes pods
- Serverless functions (with modifications)
See deployment guides for specific platforms.