Klarsicht — Your next K8s incident, diagnosed in 60 seconds

The problem

80% of incident time is investigation, not fixing

The fix takes 5 minutes. Finding the root cause takes 2 hours. You already know this.

⚠ Without Klarsicht

03:12PagerDuty fires. CrashLoopBackOff in production.

03:15SSH into bastion. Switch kubecontext. Which namespace?

03:24kubectl logs. kubectl describe. Scroll through walls of text.

03:41Open Grafana. Compare dashboards. Was it the deploy?

04:08Message a colleague. They're asleep.

05:15Finally find root cause. Write postmortem tomorrow.

~2 hours

✓ With Klarsicht

03:12PagerDuty fires. Grafana sends webhook to Klarsicht.

03:12Agent inspects pod, reads logs, checks events, queries metrics.

03:13Root cause identified: Postgres OOMKilled, memory limit too low.

03:13Fix steps delivered with kubectl commands. Postmortem drafted.

03:18Fix applied. Incident resolved. Back to sleep.

<6 minutes

How it works

One Helm chart. One webhook. One Slack channel.

Works with your existing stack. No rip-and-replace.

Alert fires

Grafana sends a webhook to Klarsicht. Your existing alert rules, thresholds, and routing stay exactly the same.

Agent investigates

Reads pod status, container logs, K8s events, and queries Prometheus metrics. Like a senior SRE, but in 30 seconds.

RCA delivered

Root cause, confidence score, fix steps with kubectl commands, and a postmortem draft. Copy as markdown, paste anywhere.

See it work

Real output from a real incident

This is what appears in your dashboard when a CrashLoopBackOff fires.

klarsicht — investigation

╭─ Alert received: CrashLoopBackOff
┃  namespace: production  pod: api-gateway-7f8b9c-x2k9p
┃
┣─ Inspecting pod...
┃  State: CrashLoopBackOff (7 restarts in 12m)
┃  Exit code: 1
┃
┣─ Reading logs (last 200 lines)...
┃  FATAL: failed to connect to postgres:5432
┃    error: connection refused, retries exhausted
┃
┣─ Checking related resources...
┃  postgres-0: OOMKilled at 03:09 (memory: 509Mi/512Mi limit)
┃  Service endpoints: 0/1 ready
┃
┣─ Querying Prometheus...
┃  Memory spike: 340% increase from batch job (started 01:15)
┃
┗─ Investigation complete (14.2s)

root_cause:
  "Postgres OOMKilled due to memory spike from overnight
   batch job. API gateway crash-looping because it cannot
   connect to the database."

confidence: 0.94
category:  "resource_exhaustion"

fix_steps:
  1. Increase postgres memory limit to 1Gi
     kubectl patch sts postgres -n production -p '...'
  2. Restart api-gateway deployment
     kubectl rollout restart deploy/api-gateway -n production
  3. Reschedule batch job to off-peak hours

time_to_rca: 14.2 seconds

Architecture

Three deployment models. Same intelligence.

Single cluster, multi-cluster, or fully on-prem. Anthropic, OpenAI, watsonx, Ollama — your choice.

Your Kubernetes Cluster

⚠

Grafana

Alert webhook

⟶

✶

Klarsicht Agent

ReAct reasoning loop

⟶

⚙

K8s API

Pods, logs, events

◫

Prometheus

PromQL metrics

▦

PostgreSQL

RCA storage

⟱

✨

LLM API

External provider

Alert ingestion

In-cluster data

LLM reasoning (external)

Agent Mode — Klarsicht runs in your cluster with read-only K8s and Prometheus access. Only investigation context (pod names, log snippets, metrics) is sent to the LLM for reasoning. No raw data export. Best for fast setup and top-tier reasoning quality.

Your Kubernetes Cluster — fully air-gapped

⚠

Grafana

Alert webhook

⟶

✶

Klarsicht Agent

ReAct reasoning loop

⟶

⚙

K8s API

Pods, logs, events

◫

Prometheus

PromQL metrics

▦

PostgreSQL

RCA storage

⟱

✨

Local LLM

Ollama / vLLM (in-cluster)

Alert ingestion

In-cluster data

LLM reasoning (in-cluster)

Full On-Prem — Zero external calls. The LLM runs inside your cluster via Ollama, vLLM, IBM watsonx, or any OpenAI-compatible endpoint. Nothing leaves your network. Ideal for air-gapped environments, regulated industries (FINMA, BaFin, GDPR).

Integrations

Plugs into your existing stack

Klarsicht connects to the tools already running in your cluster. No new agents, no sidecars, no data export.

Live

⚙

Kubernetes

Pods, logs, events, deployments, nodes, replicasets. Read-only via K8s API.

Live

◫

Prometheus

PromQL range & instant queries. CPU, memory, error rate, custom metrics.

Live

◫

Mimir

Grafana Mimir compatible. Same PromQL interface, long-term storage.

Live

⚠

Grafana

Webhook contact point. Alert history, one-click setup via API.

Live

⚖

GitLab

Pipelines, merge requests, deployments, code search. Correlate incidents with code changes.

Live

❚❚

Slack

RCA summaries with fix steps, evidence, and dashboard link via webhook.

Live

✉

Teams

Adaptive Card with root cause, confidence score, and incident link.

Live

☍

Discord

Rich embed notifications with color-coded confidence and fix steps.

Live

🔒

OIDC Auth

Team-based access control. Map OIDC claims to alert labels. Teams only see their incidents.

Live

☁

IBM watsonx

Native watsonx.ai integration via langchain-ibm. Granite models, IAM auth, custom endpoints.

Live

🔌

Connectivity

HTTP/TCP dependency checks with TLS cert inspection. Verify if DBs, APIs, caches are reachable.

Live

📚

Confluence

Auto-discover BHB operations handbooks. Agent fetches runbooks during investigation.

Live

🌐

Multi-Cluster

Central backend + lightweight agents per cluster. Join-token registration. Inspect any cluster.

Coming soon

☰

Loki

LogQL queries across all pods. Correlate logs with metrics during investigation.

Coming soon

➔

Tempo

Distributed trace lookup for latency and error investigations.

Coming soon

↺

ArgoCD

Recent syncs, rollout history. Correlate deployments with incidents.

Coming soon

◤

Cilium / Hubble

Network flow logs. Identify NetworkPolicy blocks and DNS failures.

Roadmap

Where we're headed

Klarsicht started with Kubernetes. But incidents don't stop at the cluster boundary.

In development

🖥

VM Support

Root cause analysis for virtual machines. SSH-based inspection, syslog analysis, systemd service checks, disk and network diagnostics.

Planned

☁

Cloud Resources

AWS, Azure, GCP resource health checks. Load balancers, managed databases, DNS, IAM issues.

Planned

🔬

Log Aggregation

Loki, Elasticsearch, Splunk integration. Query logs across all infrastructure — not just Kubernetes.

Planned

🔗

Cross-Infrastructure

Correlate incidents across K8s, VMs, and cloud. One agent traces the root cause wherever it lives.

Honest comparison

How Klarsicht stacks up

We're not for everyone. Here's where we win and where we don't.

	Klarsicht	IncidentFox	Datadog AI	K8sGPT
Self-hosted / on-prem	✓	✓ Open core	✗	✓
Data stays in-cluster	✓ Always	SaaS or VPC	✗	✓
Reads pod logs + events	✓	✓	✓	✓
Queries Prometheus metrics	✓	✓	✓	✗
Correlates across resources	✓	✓	✓	✗
Generates postmortem	✓	✓	✗	✗
Slack-native	✓ Webhook	✓ Built-in	✓	✗
Integrations	8 live, 4 coming	40+ tools	Full Datadog	K8s only
Full air-gap / on-prem LLM	✓ Ollama/vLLM/watsonx	✗	✗	✓
Free tier	✓ Free to start	Free trial	✗ $$$$	✓ Open source
Setup time	5 minutes	30 minutes	Days + sales call	5 minutes
EU data residency	✓ Your infra	VPC option	US-hosted	✓ Your infra

Battle-tested

154 incidents. 94.5% average confidence.

We ran 81 failure scenarios across 8 categories on a live K3s cluster. Here's what happened.

133

Incidents resolved

94.5%

Avg confidence

Unique pods diagnosed

85-100%

Confidence range

Failure scenarios tested

Category	Tests	Examples	Result
Missing env vars	10	DATABASE_URL, API_KEY, JWT_SECRET, REDIS_URL, KAFKA_BROKERS, S3_BUCKET, SMTP_HOST, MONGO_URI	✓ 100%
Connection failures	10	API refused, DNS fail, HTTP timeout, gRPC unavailable, Redis, Elasticsearch, RabbitMQ, MySQL, SSL expired, Vault sealed	✓ 100%
Auth & authorization	10	OAuth expired, AWS creds, K8s RBAC forbidden, LDAP bind, Stripe key revoked, mTLS mismatch, Firebase, GitHub rate limit	✓ 100%
Data & schema errors	10	JSON parse, DB migration, Protobuf decode, Avro schema mismatch, malformed XML, encoding error, YAML syntax, GraphQL validation	✓ 100%
Resource exhaustion	10	Python OOM, disk full, fd limit, thread limit, connection pool exhausted, Node.js heap OOM, ephemeral storage, shared memory	✓ 100%
Application logic	10	Java NullPointer, ZeroDivision, Go nil pointer, stack overflow, Rust panic, Node.js unhandled promise, PostgreSQL deadlock	✓ 100%
Dependency & version	10	Python ModuleNotFoundError, Node module missing, shared library, pip conflict, Java class version, npm peer dep, Go module 410	✓ 100%
Network & DNS	10	NXDOMAIN, network unreachable, TCP reset, envoy 502, mTLS handshake, service mesh, NetworkPolicy blocking, redirect loop	✓ 100%
Startup & init	1	Port already in use	✓ 100%

Root cause categories identified

Misconfiguration

106

Dependency failure

Deployment issue

Resource exhaustion

Coming soon

User-submitted test results

We're collecting real-world incident data from early adopters. If you're running Klarsicht on your cluster, we'd love to feature your results here — anonymized, with your permission.

Get started

Deploy in five minutes

One Helm chart. Three environment variables. First investigation in under 30 minutes.

1. Install with Helm

          
          # Install Klarsicht into your cluster
helm repo add klarsicht https://outcept.github.io/Klarsicht
helm install klarsicht klarsicht/klarsicht \
  --namespace klarsicht --create-namespace \
  --set agent.llmApiKey=<your-anthropic-api-key> \
  --set agent.metricsEndpoint=http://prometheus.monitoring.svc:9090
        

2. Point Grafana at Klarsicht

          
          # Grafana → Alerting → Contact points → New
# Type: Webhook
# URL:
http://klarsicht-agent.klarsicht.svc:8000/alert
        

3. Test the pipeline

# Deploy a pod that intentionally crashes (missing env var)
kubectl apply -f https://raw.githubusercontent.com/outcept/Klarsicht/main/examples/test-crashloop.yaml

# Or send a mock alert directly
curl -X POST http://klarsicht-agent.klarsicht.svc:8000/test

Your next K8s incident,diagnosed in 60 seconds

80% of incident time is investigation, not fixing

⚠ Without Klarsicht

✓ With Klarsicht

One Helm chart. One webhook. One Slack channel.

Alert fires

Agent investigates

RCA delivered

Real output from a real incident

Three deployment models. Same intelligence.

Plugs into your existing stack

Where we're headed

How Klarsicht stacks up

154 incidents. 94.5% average confidence.

Failure scenarios tested

Root cause categories identified

User-submitted test results

Deploy in five minutes

Your next K8s incident,
diagnosed in 60 seconds