Self-hosted · Your infrastructure · Your data

Your next K8s incident,
diagnosed in 60 seconds

Klarsicht is an AI agent that receives your Grafana alerts, inspects your cluster, and delivers root cause analysis before you open a terminal.

<60s
Time to RCA
1
Helm chart
0
Data sent outside

The problem

80% of incident time is investigation, not fixing

The fix takes 5 minutes. Finding the root cause takes 2 hours. You already know this.

Without Klarsicht

03:12PagerDuty fires. CrashLoopBackOff in production.
03:15SSH into bastion. Switch kubecontext. Which namespace?
03:24kubectl logs. kubectl describe. Scroll through walls of text.
03:41Open Grafana. Compare dashboards. Was it the deploy?
04:08Message a colleague. They're asleep.
05:15Finally find root cause. Write postmortem tomorrow.
~2 hours

With Klarsicht

03:12PagerDuty fires. Grafana sends webhook to Klarsicht.
03:12Agent inspects pod, reads logs, checks events, queries metrics.
03:13Root cause identified: Postgres OOMKilled, memory limit too low.
03:13Fix steps delivered with kubectl commands. Postmortem drafted.
03:18Fix applied. Incident resolved. Back to sleep.
<6 minutes

How it works

One Helm chart. One webhook. One Slack channel.

Works with your existing stack. No rip-and-replace.

01

Alert fires

Grafana sends a webhook to Klarsicht. Your existing alert rules, thresholds, and routing stay exactly the same.

02

Agent investigates

Reads pod status, container logs, K8s events, and queries Prometheus metrics. Like a senior SRE, but in 30 seconds.

03

RCA delivered

Root cause, confidence score, fix steps with kubectl commands, and a postmortem draft. Copy as markdown, paste anywhere.

See it work

Real output from a real incident

This is what appears in your dashboard when a CrashLoopBackOff fires.

klarsicht — investigation
╭─ Alert received: CrashLoopBackOff
┃  namespace: production  pod: api-gateway-7f8b9c-x2k9p

┣─ Inspecting pod...
┃  State: CrashLoopBackOff (7 restarts in 12m)
┃  Exit code: 1

┣─ Reading logs (last 200 lines)...
  FATAL: failed to connect to postgres:5432
    error: connection refused, retries exhausted

┣─ Checking related resources...
┃  postgres-0: OOMKilled at 03:09 (memory: 509Mi/512Mi limit)
┃  Service endpoints: 0/1 ready

┣─ Querying Prometheus...
┃  Memory spike: 340% increase from batch job (started 01:15)

┗─ Investigation complete (14.2s)

root_cause:
  "Postgres OOMKilled due to memory spike from overnight
   batch job. API gateway crash-looping because it cannot
   connect to the database."

confidence: 0.94
category:  "resource_exhaustion"

fix_steps:
  1. Increase postgres memory limit to 1Gi
     kubectl patch sts postgres -n production -p '...'
  2. Restart api-gateway deployment
     kubectl rollout restart deploy/api-gateway -n production
  3. Reschedule batch job to off-peak hours

time_to_rca: 14.2 seconds

Architecture

Two deployment models. Same intelligence.

Choose between cloud LLM or fully on-prem. Your data, your rules.

Your Kubernetes Cluster
Grafana
Alert webhook
Klarsicht Agent
ReAct reasoning loop
K8s API
Pods, logs, events
Prometheus
PromQL metrics
PostgreSQL
RCA storage
LLM API
External provider
Alert ingestion
In-cluster data
LLM reasoning (external)
Agent Mode — Klarsicht runs in your cluster with read-only K8s and Prometheus access. Only investigation context (pod names, log snippets, metrics) is sent to the LLM for reasoning. No raw data export. Best for fast setup and top-tier reasoning quality.
Your Kubernetes Cluster — fully air-gapped
Grafana
Alert webhook
Klarsicht Agent
ReAct reasoning loop
K8s API
Pods, logs, events
Prometheus
PromQL metrics
PostgreSQL
RCA storage
Local LLM
Ollama / vLLM (in-cluster)
Alert ingestion
In-cluster data
LLM reasoning (in-cluster)
Full On-Prem — Zero external calls. The LLM runs inside your cluster via Ollama, vLLM, or any OpenAI-compatible endpoint. Nothing leaves your network. Ideal for air-gapped environments, regulated industries (FINMA, BaFin, GDPR).

Integrations

Plugs into your existing stack

Klarsicht connects to the tools already running in your cluster. No new agents, no sidecars, no data export.

Live
Kubernetes
Pods, logs, events, deployments, nodes, replicasets. Read-only via K8s API.
Live
Prometheus
PromQL range & instant queries. CPU, memory, error rate, custom metrics.
Live
Mimir
Grafana Mimir compatible. Same PromQL interface, long-term storage.
Live
Grafana
Webhook contact point. Alert history, one-click setup via API.
Coming soon
Loki
LogQL queries across all pods. Correlate logs with metrics during investigation.
Coming soon
Tempo
Distributed trace lookup for latency and error investigations.
Coming soon
❚❚
Slack
Post RCA summaries to incident channels. Interactive thread updates.
Coming soon
ArgoCD
Recent syncs, rollout history. Correlate deployments with incidents.
Coming soon
🔒
Cert-Manager
Certificate expiry checks. Detect TLS-related failures automatically.
Coming soon
Cilium / Hubble
Network flow logs. Identify NetworkPolicy blocks and DNS failures.
Coming soon
Flux
GitOps reconciliation status. Source and kustomization health.
Coming soon
PagerDuty
Bi-directional incident sync. Auto-acknowledge when RCA is delivered.

Honest comparison

How Klarsicht stacks up

We're not for everyone. Here's where we win and where we don't.

KlarsichtIncidentFoxDatadog AIK8sGPT
Self-hosted / on-prem✓ Open core
Data stays in-cluster✓ AlwaysSaaS or VPC
Reads pod logs + events
Queries Prometheus metrics
Correlates across resources
Generates postmortem
Slack-nativeDashboard + API✓ Built-in
Integrations4 live, 8 coming40+ toolsFull DatadogK8s only
Full air-gap / on-prem LLM✓ Ollama/vLLM
Free tier✓ Free to startFree trial✗ $$$$✓ Open source
Setup time5 minutes30 minutesDays + sales call5 minutes
EU data residency✓ Your infraVPC optionUS-hosted✓ Your infra

Battle-tested

154 incidents. 94.5% average confidence.

We ran 81 failure scenarios across 8 categories on a live K3s cluster. Here's what happened.

133
Incidents resolved
94.5%
Avg confidence
38
Unique pods diagnosed
85-100%
Confidence range

Failure scenarios tested

CategoryTestsExamplesResult
Missing env vars10 DATABASE_URL, API_KEY, JWT_SECRET, REDIS_URL, KAFKA_BROKERS, S3_BUCKET, SMTP_HOST, MONGO_URI 100%
Connection failures10 API refused, DNS fail, HTTP timeout, gRPC unavailable, Redis, Elasticsearch, RabbitMQ, MySQL, SSL expired, Vault sealed 100%
Auth & authorization10 OAuth expired, AWS creds, K8s RBAC forbidden, LDAP bind, Stripe key revoked, mTLS mismatch, Firebase, GitHub rate limit 100%
Data & schema errors10 JSON parse, DB migration, Protobuf decode, Avro schema mismatch, malformed XML, encoding error, YAML syntax, GraphQL validation 100%
Resource exhaustion10 Python OOM, disk full, fd limit, thread limit, connection pool exhausted, Node.js heap OOM, ephemeral storage, shared memory 100%
Application logic10 Java NullPointer, ZeroDivision, Go nil pointer, stack overflow, Rust panic, Node.js unhandled promise, PostgreSQL deadlock 100%
Dependency & version10 Python ModuleNotFoundError, Node module missing, shared library, pip conflict, Java class version, npm peer dep, Go module 410 100%
Network & DNS10 NXDOMAIN, network unreachable, TCP reset, envoy 502, mTLS handshake, service mesh, NetworkPolicy blocking, redirect loop 100%
Startup & init1 Port already in use 100%

Root cause categories identified

Misconfiguration
106
Dependency failure
14
Deployment issue
6
Resource exhaustion
4
Coming soon

User-submitted test results

We're collecting real-world incident data from early adopters. If you're running Klarsicht on your cluster, we'd love to feature your results here — anonymized, with your permission.

Get started

Deploy in five minutes

One Helm chart. Three environment variables. First investigation in under 30 minutes.

1. Install with Helm

# Install Klarsicht into your cluster helm install klarsicht oci://registry.gitlab.com/outcept/klarsicht/helm/klarsicht \ --namespace klarsicht --create-namespace \ --set agent.llmApiKey=<your-anthropic-api-key> \ --set agent.metricsEndpoint=http://prometheus.monitoring.svc:9090

2. Point Grafana at Klarsicht

# Grafana → Alerting → Contact points → New # Type: Webhook # URL: http://klarsicht-agent.klarsicht.svc:8000/alert

3. Test the pipeline

# Deploy a pod that intentionally crashes (missing env var) kubectl apply -f https://gitlab.com/outcept/klarsicht/-/raw/main/examples/test-crashloop.yaml # Or send a mock alert directly curl -X POST http://klarsicht-agent.klarsicht.svc:8000/test