RUNTIME RISK
ArgoCD Controller OOM After v3.3.2 Upgrade
What Happened
A team upgraded ArgoCD from v3.0.6 to v3.3.2 across a multi-cluster environment. The application controller's memory spiked from ~512 MiB to 1.27 GiB — 27% over the 1 GiB limit. The controller was OOMKilled, and every managed application entered an Unknown state. The team found out through alert noise and post-mortem. No tool gave them a proactive signal before the memory limit was breached.
What ChangeGuard Shows
Runtime Score DropOOMKilled pod degrades runtime category (−24 pts) before the outage is declared
Pod Stability AlertRestart loop flagged with count trend in the risk panel
Cross-Cluster CorrelationInstability correlated across all managed clusters in real time
Historical DecayRepeated restarts degrade the historical category, creating a persistent signal even after recovery
CSC Score Timeline
81
Before Upgrade · Stable
54
Upgrade + OOMKill · Runtime −24pts
46
Post-OOM · Historical decay begins
"Your team found out about the OOM through alert noise and a post-mortem. ChangeGuard would have shown the runtime score dropping in real time — before the controller entered a restart loop."
RUNTIME
CLUSTER
Sharded ArgoCD Controller — Replica Imbalance at 500 Apps
What Happened
An 8-replica sharded ArgoCD controller (consistent-hashing, no dynamic distribution) was managing 500 Kustomize + Helm applications. Some shards sat near-idle while others hit the 3000 MiB memory limit. The overloaded replicas were OOMKilled, and every application on that shard lost its reconciliation loop. No single alert surfaced the imbalance before the crash — it was detected post-crash via pod logs.
What ChangeGuard Shows
Cluster Category DropShard imbalance flagged as infrastructure misconfiguration (−15 pts) before any replica crashes
Runtime Score DropOOMKilled replicas drive runtime down (−28 pts) per affected shard
Fleet View ImpactApps on the failed shard show stale sync state — visible across all 500 apps
Historical DecayRepeated shard failures create a persistent signal that survives recovery
CSC Score Timeline
78
Pre-Failure · Imbalance present
49
Shard OOM · Runtime −28, Cluster −15
41
Apps Stale · Fleet view shows impact
"At 500 applications with a sharded controller, you can't watch every replica. ChangeGuard surfaces the imbalance as a cluster-category risk before a shard OOMs — and when one goes down, you see exactly which apps lost their reconciliation loop."
FLEET MANAGEMENT
Hub-and-Spoke ArgoCD — Blind Spot on Remote Clusters
What Happened
A central ArgoCD instance managed 100+ remote clusters via a hub-and-spoke model. When a network partition disconnected the hub from three spoke clusters, ArgoCD continued showing "Synced" for all applications on those clusters. In reality, the clusters had drifted — deployments were failing and pods were in CrashLoopBackOff. The team discovered it 40 minutes later through customer-reported errors.
What ChangeGuard Shows
Independent Agent MonitoringChangeGuard agents run inside each spoke cluster — they detect issues even when ArgoCD's hub loses connectivity
Cross-Cluster IntelligenceDashboard correlates risk across all 100+ clusters and flags the disconnected ones immediately
Runtime DegradationCrashLoopBackOff pods surface as runtime risks within 10 seconds of agent snapshot
Slack AlertScore drop triggers notification before any customer reports the issue
CSC Score Timeline
92
Normal · Hub connected to all spokes
61
Partition · 3 spoke clusters drifting
38
CrashLoop · Customer impact begins
"ArgoCD showed green across the board. ChangeGuard's independent agents saw the reality: three clusters in trouble, pods crashing, and customers about to feel it. That's the value of monitoring that doesn't depend on your CD tool's connectivity."
POLICY COMPLIANCE
Missing Network Policies — Silent Security Regression
What Happened
A new microservice was deployed without a NetworkPolicy. It worked fine functionally, so no alerts fired. But the pod had unrestricted network access to every other service in the namespace — including the database. A routine security audit caught it 3 weeks later. During that window, any compromised container could have moved laterally.
What ChangeGuard Shows
Policy Score DropMissing NetworkPolicy detected as a policy compliance violation (−8 pts) on the very first snapshot
Change AnalysisNew deployment flagged in Change Analysis with "no network policy" risk label
Persistent SignalScore stays degraded until the policy is applied — it doesn't auto-resolve
Audit TrailThe compliance gap is recorded with timestamp, cluster, and namespace for SOC 2 evidence
CSC Score Timeline
95
Before Deploy · All policies in place
87
Deploy · Missing NetworkPolicy flagged
95
Remediated · Policy applied, score restored
"Your security audit found it 3 weeks later. ChangeGuard would have flagged it on the first snapshot — 10 seconds after the deployment landed."
RUNTIME SECURITY
Cryptominer Process Detected in Production Pod
What Happened
A CI pipeline dependency was compromised with a supply chain attack. The malicious package executed a cryptominer binary inside the container after deployment. CPU spiked to 100%, but on a large node, the pod didn't OOM or trigger HPA — it just silently consumed resources. The team noticed higher-than-expected AWS bills two weeks later.
What ChangeGuard Shows
Falco Alert (Critical)"Terminal shell in container" and "Launch Ingress Remote File Copy Tool" — flagged within seconds of execution. Rule: unexpected process spawned in a container that should only run the app binary.
CSC Score Drop (−16 pts)Two critical Falco alerts (−8 each) immediately degrade the Runtime Signals category
Slack/Teams AlertNotification fires with pod name, namespace, container image, and the exact process that triggered the rule
Image → SBOM CorrelationSyft SBOM shows the compromised package in the image's dependency tree. Grype CVE scan confirms a known vulnerability in that library version.
CSC Score Timeline
92
Before Deploy · Cluster healthy
61
Post-Deploy · Falco: critical process alerts
91
Remediated · Image rebuilt with patched dependency
"Your AWS bill told you 2 weeks later. ChangeGuard's Falco integration would have flagged the cryptominer process in under 30 seconds — before it mined a single block."
IDENTITY RISK
Deploy Bot ServiceAccount Has cluster-admin — Nobody Noticed
What Happened
During initial cluster setup, the CI/CD deploy bot ServiceAccount was bound to cluster-admin for convenience. The plan was to scope it down later. That was 8 months ago. The SA now runs in 4 namespaces, is used by 12 pods, and has full unrestricted access to every resource in the cluster — including secrets containing database credentials, API keys, and TLS certificates. Any compromised pod using this SA can take over the entire cluster.
What ChangeGuard Shows
Identity Risk Score: 100 (Critical)deploy-bot SA scores maximum risk: cluster-admin (+50), cluster blast radius (+20), secrets access (+10), wildcard permissions (+15)
Blast Radius: Cluster12 pods across 4 namespaces use this SA. If any one is compromised, the attacker gets cluster-admin.
Escalation Path DetectedThis SA can create ClusterRoleBindings — meaning it can grant cluster-admin to any other identity
Remediation Guidance"Create a scoped Role with only the permissions needed. Remove cluster-admin binding. Use separate SAs per namespace."
CSC Score Timeline
72
Current · cluster-admin SA drags Policy score
88
After Scoping · Dedicated roles per namespace
94
Fully Remediated · SA per workload, no wildcards
"8 months of cluster-admin exposure — because nobody audits RBAC manually. ChangeGuard's identity risk graph catches it on the first snapshot and scores every identity by blast radius."
See it in your cluster
14-day free trial. One cluster. No credit card. Install the agent in 60 seconds and see your first CSC score immediately.
Start Free Trial →