Use Cases | ChangeGuard

RUNTIME RISK

ArgoCD Controller OOM After v3.3.2 Upgrade

What Happened

A team upgraded ArgoCD from v3.0.6 to v3.3.2 across a multi-cluster environment. The application controller's memory spiked from ~512 MiB to 1.27 GiB — 27% over the 1 GiB limit. The controller was OOMKilled, and every managed application entered an Unknown state. The team found out through alert noise and post-mortem. No tool gave them a proactive signal before the memory limit was breached.

What ChangeGuard Shows

Runtime Score DropOOMKilled pod degrades runtime category (−24 pts) before the outage is declared

Pod Stability AlertRestart loop flagged with count trend in the risk panel

Cross-Cluster CorrelationInstability correlated across all managed clusters in real time

Historical DecayRepeated restarts degrade the historical category, creating a persistent signal even after recovery

CSC Score Timeline

81

Before Upgrade · Stable

54

Upgrade + OOMKill · Runtime −24pts

46

Post-OOM · Historical decay begins

"Your team found out about the OOM through alert noise and a post-mortem. ChangeGuard would have shown the runtime score dropping in real time — before the controller entered a restart loop."

RUNTIME CLUSTER

Sharded ArgoCD Controller — Replica Imbalance at 500 Apps

What Happened

An 8-replica sharded ArgoCD controller (consistent-hashing, no dynamic distribution) was managing 500 Kustomize + Helm applications. Some shards sat near-idle while others hit the 3000 MiB memory limit. The overloaded replicas were OOMKilled, and every application on that shard lost its reconciliation loop. No single alert surfaced the imbalance before the crash — it was detected post-crash via pod logs.

What ChangeGuard Shows

Cluster Category DropShard imbalance flagged as infrastructure misconfiguration (−15 pts) before any replica crashes

Runtime Score DropOOMKilled replicas drive runtime down (−28 pts) per affected shard

Fleet View ImpactApps on the failed shard show stale sync state — visible across all 500 apps

Historical DecayRepeated shard failures create a persistent signal that survives recovery

CSC Score Timeline

78

Pre-Failure · Imbalance present

49

Shard OOM · Runtime −28, Cluster −15

41

Apps Stale · Fleet view shows impact

"At 500 applications with a sharded controller, you can't watch every replica. ChangeGuard surfaces the imbalance as a cluster-category risk before a shard OOMs — and when one goes down, you see exactly which apps lost their reconciliation loop."

FLEET MANAGEMENT

Hub-and-Spoke ArgoCD — Blind Spot on Remote Clusters

What Happened

A central ArgoCD instance managed 100+ remote clusters via a hub-and-spoke model. When a network partition disconnected the hub from three spoke clusters, ArgoCD continued showing "Synced" for all applications on those clusters. In reality, the clusters had drifted — deployments were failing and pods were in CrashLoopBackOff. The team discovered it 40 minutes later through customer-reported errors.

What ChangeGuard Shows

Independent Agent MonitoringChangeGuard agents run inside each spoke cluster — they detect issues even when ArgoCD's hub loses connectivity

Cross-Cluster IntelligenceDashboard correlates risk across all 100+ clusters and flags the disconnected ones immediately

Runtime DegradationCrashLoopBackOff pods surface as runtime risks within 10 seconds of agent snapshot

Slack AlertScore drop triggers notification before any customer reports the issue

CSC Score Timeline

92

Normal · Hub connected to all spokes

61

Partition · 3 spoke clusters drifting

38

CrashLoop · Customer impact begins

"ArgoCD showed green across the board. ChangeGuard's independent agents saw the reality: three clusters in trouble, pods crashing, and customers about to feel it. That's the value of monitoring that doesn't depend on your CD tool's connectivity."

POLICY COMPLIANCE

Missing Network Policies — Silent Security Regression

What Happened

A new microservice was deployed without a NetworkPolicy. It worked fine functionally, so no alerts fired. But the pod had unrestricted network access to every other service in the namespace — including the database. A routine security audit caught it 3 weeks later. During that window, any compromised container could have moved laterally.

What ChangeGuard Shows

Policy Score DropMissing NetworkPolicy detected as a policy compliance violation (−8 pts) on the very first snapshot

Change AnalysisNew deployment flagged in Change Analysis with "no network policy" risk label

Persistent SignalScore stays degraded until the policy is applied — it doesn't auto-resolve

Audit TrailThe compliance gap is recorded with timestamp, cluster, and namespace for SOC 2 evidence

CSC Score Timeline

95

Before Deploy · All policies in place

87

Deploy · Missing NetworkPolicy flagged

95

Remediated · Policy applied, score restored

"Your security audit found it 3 weeks later. ChangeGuard would have flagged it on the first snapshot — 10 seconds after the deployment landed."

RUNTIME SECURITY

Cryptominer Process Detected in Production Pod

What Happened

A CI pipeline dependency was compromised with a supply chain attack. The malicious package executed a cryptominer binary inside the container after deployment. CPU spiked to 100%, but on a large node, the pod didn't OOM or trigger HPA — it just silently consumed resources. The team noticed higher-than-expected AWS bills two weeks later.

What ChangeGuard Shows

Falco Alert (Critical)"Terminal shell in container" and "Launch Ingress Remote File Copy Tool" — flagged within seconds of execution. Rule: unexpected process spawned in a container that should only run the app binary.

CSC Score Drop (−16 pts)Two critical Falco alerts (−8 each) immediately degrade the Runtime Signals category

Slack/Teams AlertNotification fires with pod name, namespace, container image, and the exact process that triggered the rule

Image → SBOM CorrelationSyft SBOM shows the compromised package in the image's dependency tree. Grype CVE scan confirms a known vulnerability in that library version.

CSC Score Timeline

92

Before Deploy · Cluster healthy

61

Post-Deploy · Falco: critical process alerts

91

Remediated · Image rebuilt with patched dependency

"Your AWS bill told you 2 weeks later. ChangeGuard's Falco integration would have flagged the cryptominer process in under 30 seconds — before it mined a single block."

IDENTITY RISK

Deploy Bot ServiceAccount Has cluster-admin — Nobody Noticed

What Happened

During initial cluster setup, the CI/CD deploy bot ServiceAccount was bound to cluster-admin for convenience. The plan was to scope it down later. That was 8 months ago. The SA now runs in 4 namespaces, is used by 12 pods, and has full unrestricted access to every resource in the cluster — including secrets containing database credentials, API keys, and TLS certificates. Any compromised pod using this SA can take over the entire cluster.

What ChangeGuard Shows

Identity Risk Score: 100 (Critical)deploy-bot SA scores maximum risk: cluster-admin (+50), cluster blast radius (+20), secrets access (+10), wildcard permissions (+15)

Blast Radius: Cluster12 pods across 4 namespaces use this SA. If any one is compromised, the attacker gets cluster-admin.

Escalation Path DetectedThis SA can create ClusterRoleBindings — meaning it can grant cluster-admin to any other identity

Remediation Guidance"Create a scoped Role with only the permissions needed. Remove cluster-admin binding. Use separate SAs per namespace."

CSC Score Timeline

72

Current · cluster-admin SA drags Policy score

88

After Scoping · Dedicated roles per namespace

94

Fully Remediated · SA per workload, no wildcards

"8 months of cluster-admin exposure — because nobody audits RBAC manually. ChangeGuard's identity risk graph catches it on the first snapshot and scores every identity by blast radius."

When ChangeGuard would have caught it first

ArgoCD Controller OOM After v3.3.2 Upgrade

What Happened

What ChangeGuard Shows

CSC Score Timeline

Sharded ArgoCD Controller — Replica Imbalance at 500 Apps

What Happened

What ChangeGuard Shows

CSC Score Timeline

Hub-and-Spoke ArgoCD — Blind Spot on Remote Clusters

What Happened

What ChangeGuard Shows

CSC Score Timeline

Missing Network Policies — Silent Security Regression

What Happened

What ChangeGuard Shows

CSC Score Timeline

Cryptominer Process Detected in Production Pod

What Happened

What ChangeGuard Shows

CSC Score Timeline

Deploy Bot ServiceAccount Has cluster-admin — Nobody Noticed

What Happened

What ChangeGuard Shows

CSC Score Timeline

See it in your cluster