Klarsicht is an AI agent that receives your Grafana alerts, inspects your cluster, and delivers root cause analysis before you open a terminal.
The problem
The fix takes 5 minutes. Finding the root cause takes 2 hours. You already know this.
How it works
Works with your existing stack. No rip-and-replace.
Grafana sends a webhook to Klarsicht. Your existing alert rules, thresholds, and routing stay exactly the same.
Reads pod status, container logs, K8s events, and queries Prometheus metrics. Like a senior SRE, but in 30 seconds.
Root cause, confidence score, fix steps with kubectl commands, and a postmortem draft. Copy as markdown, paste anywhere.
See it work
This is what appears in your dashboard when a CrashLoopBackOff fires.
╭─ Alert received: CrashLoopBackOff ┃ namespace: production pod: api-gateway-7f8b9c-x2k9p ┃ ┣─ Inspecting pod... ┃ State: CrashLoopBackOff (7 restarts in 12m) ┃ Exit code: 1 ┃ ┣─ Reading logs (last 200 lines)... ┃ FATAL: failed to connect to postgres:5432 ┃ error: connection refused, retries exhausted ┃ ┣─ Checking related resources... ┃ postgres-0: OOMKilled at 03:09 (memory: 509Mi/512Mi limit) ┃ Service endpoints: 0/1 ready ┃ ┣─ Querying Prometheus... ┃ Memory spike: 340% increase from batch job (started 01:15) ┃ ┗─ Investigation complete (14.2s) root_cause: "Postgres OOMKilled due to memory spike from overnight batch job. API gateway crash-looping because it cannot connect to the database." confidence: 0.94 category: "resource_exhaustion" fix_steps: 1. Increase postgres memory limit to 1Gi kubectl patch sts postgres -n production -p '...' 2. Restart api-gateway deployment kubectl rollout restart deploy/api-gateway -n production 3. Reschedule batch job to off-peak hours time_to_rca: 14.2 seconds
Architecture
Choose between cloud LLM or fully on-prem. Your data, your rules.
Integrations
Klarsicht connects to the tools already running in your cluster. No new agents, no sidecars, no data export.
Honest comparison
We're not for everyone. Here's where we win and where we don't.
| Klarsicht | IncidentFox | Datadog AI | K8sGPT | |
|---|---|---|---|---|
| Self-hosted / on-prem | ✓ | ✓ Open core | ✗ | ✓ |
| Data stays in-cluster | ✓ Always | SaaS or VPC | ✗ | ✓ |
| Reads pod logs + events | ✓ | ✓ | ✓ | ✓ |
| Queries Prometheus metrics | ✓ | ✓ | ✓ | ✗ |
| Correlates across resources | ✓ | ✓ | ✓ | ✗ |
| Generates postmortem | ✓ | ✓ | ✗ | ✗ |
| Slack-native | Dashboard + API | ✓ Built-in | ✓ | ✗ |
| Integrations | 4 live, 8 coming | 40+ tools | Full Datadog | K8s only |
| Full air-gap / on-prem LLM | ✓ Ollama/vLLM | ✗ | ✗ | ✓ |
| Free tier | ✓ Free to start | Free trial | ✗ $$$$ | ✓ Open source |
| Setup time | 5 minutes | 30 minutes | Days + sales call | 5 minutes |
| EU data residency | ✓ Your infra | VPC option | US-hosted | ✓ Your infra |
Battle-tested
We ran 81 failure scenarios across 8 categories on a live K3s cluster. Here's what happened.
| Category | Tests | Examples | Result |
|---|---|---|---|
| Missing env vars | 10 | DATABASE_URL, API_KEY, JWT_SECRET, REDIS_URL, KAFKA_BROKERS, S3_BUCKET, SMTP_HOST, MONGO_URI | ✓ 100% |
| Connection failures | 10 | API refused, DNS fail, HTTP timeout, gRPC unavailable, Redis, Elasticsearch, RabbitMQ, MySQL, SSL expired, Vault sealed | ✓ 100% |
| Auth & authorization | 10 | OAuth expired, AWS creds, K8s RBAC forbidden, LDAP bind, Stripe key revoked, mTLS mismatch, Firebase, GitHub rate limit | ✓ 100% |
| Data & schema errors | 10 | JSON parse, DB migration, Protobuf decode, Avro schema mismatch, malformed XML, encoding error, YAML syntax, GraphQL validation | ✓ 100% |
| Resource exhaustion | 10 | Python OOM, disk full, fd limit, thread limit, connection pool exhausted, Node.js heap OOM, ephemeral storage, shared memory | ✓ 100% |
| Application logic | 10 | Java NullPointer, ZeroDivision, Go nil pointer, stack overflow, Rust panic, Node.js unhandled promise, PostgreSQL deadlock | ✓ 100% |
| Dependency & version | 10 | Python ModuleNotFoundError, Node module missing, shared library, pip conflict, Java class version, npm peer dep, Go module 410 | ✓ 100% |
| Network & DNS | 10 | NXDOMAIN, network unreachable, TCP reset, envoy 502, mTLS handshake, service mesh, NetworkPolicy blocking, redirect loop | ✓ 100% |
| Startup & init | 1 | Port already in use | ✓ 100% |
We're collecting real-world incident data from early adopters. If you're running Klarsicht on your cluster, we'd love to feature your results here — anonymized, with your permission.
Get started
One Helm chart. Three environment variables. First investigation in under 30 minutes.
1. Install with Helm
# Install Klarsicht into your cluster
helm install klarsicht oci://registry.gitlab.com/outcept/klarsicht/helm/klarsicht \
--namespace klarsicht --create-namespace \
--set agent.llmApiKey=<your-anthropic-api-key> \
--set agent.metricsEndpoint=http://prometheus.monitoring.svc:9090
2. Point Grafana at Klarsicht
# Grafana → Alerting → Contact points → New
# Type: Webhook
# URL:
http://klarsicht-agent.klarsicht.svc:8000/alert
3. Test the pipeline
# Deploy a pod that intentionally crashes (missing env var)
kubectl apply -f https://gitlab.com/outcept/klarsicht/-/raw/main/examples/test-crashloop.yaml
# Or send a mock alert directly
curl -X POST http://klarsicht-agent.klarsicht.svc:8000/test