industry: AI agents are quietly generating chaos engineering failures

A new category of production incident is emerging in enterprise environments, silently triggered by autonomous AI agents and largely untracked by conventional engineering methodologies. This critical oversight is leading to cascading system failures that go unrecognized as agent-initiated events, according to Sayali Patil, an infrastructure automation expert from Cisco and Splunk, who warns that the gap between governing autonomous agents and practicing chaos engineering is creating significant, undetected risks.

With 79% of organizations already deploying some form of AI agent in production and 96% planning expansion, the scale of this exposure is no longer theoretical. Gartner predicts that 33% of enterprise software will incorporate agentic AI by 2028, yet simultaneously forecasts that 40% of these projects will be canceled due to inadequate risk controls. Patil highlights a specific failure mode occurring between these figures: agents operating as intended, yet inadvertently generating infrastructure events that are not categorized as risks, leading to unacknowledged incidents.

The Hidden Problem: Agents Skipping Critical Judgment

The core issue lies in how autonomous agents interact with complex production systems compared to human engineers. When a human engineer initiates a chaos experiment—deliberately injecting faults to test system resilience—they typically make a critical judgment call. This involves assessing current system capacity, checking dashboards, reviewing error budgets, and evaluating dependency stability. This human-in-the-loop ensures the system can absorb the stress without causing a larger outage.

Autonomous remediation agents, however, lack this holistic judgment. Designed to detect anomalies and act quickly—restarting services, rerouting traffic, or modifying configurations—they operate within a narrow context. Patil describes a common scenario: an agent detects elevated latency and restarts a service cluster. While technically correct given its training, the agent is unaware that three other services are at peak traffic, a shared connection pool is nearing saturation, or a dependent database is undergoing an index rebuild. The restart, intended to fix a minor issue, triggers a “thundering herd” against the recovering service, leading to a cascade of failures never modeled or tested by the organization’s chaos engineering program.

Crucially, these agent-induced failures often remain invisible in post-mortems. Incidents are typically logged as service restarts, connection pool saturations, or latency events, with the agent’s initiating role obscured. The AI Incidents Database reported a 21% increase in AI-related incidents from 2024 to 2025, a figure Patil suggests significantly understates the true exposure due to this classification gap.

The Missing "Absorb Capacity" Language

The underlying systemic problem is the absence of a shared understanding and language for “absorb capacity”—the real-time measure of how much additional stress a system can handle before violating its Service Level Objectives (SLOs). Traditional chaos engineering relies on implicit human judgment or static thresholds that often trigger after a problem has occurred. Agents, meanwhile, don't manage this capacity at all.

Patil proposes a “resilience budget” model, treating absorb capacity as a continuously recomputed and consumable resource. This budget would draw on four live signal classes:

SLO burn rate: Directly reflects the system’s health against commitments.
P99 latency trend: Indicates subtle, ongoing degradation rather than just absolute values.
Dependency saturation state: Crucial for understanding shared resource availability.
Application behavioral signals: User-centric metrics that often precede infrastructure alerts.

This budget would be shared across teams and consumed by both human-initiated chaos experiments and autonomous agent actions. Without such a shared ledger, simultaneous actions from multiple teams or agents can inadvertently combine to create an unmanageable blast radius.

Where AI Helps, and Where it Fails

Large language models (LLMs) show promise in generating chaos hypotheses by analyzing dependency graphs and past incident post-mortems, offering faster insights than manual methods. However, their utility is limited by data staleness; an LLM operating on an outdated dependency graph can confidently propose experiments with incorrect blast radius assumptions, leading to real-world outages. Stanford’s Trustworthy AI Research Lab has highlighted that model-level guardrails are insufficient, reinforcing that models cannot be trusted with critical safety boundaries if their foundational data is flawed.

Patil stresses that while LLMs can derive valuable insights from validated post-mortem data, they should not be entrusted with execution decisions when signals are ambiguous. This judgment requires context beyond any monitoring system, such as pending deployments, on-call staffing levels, or critical customer commitments. Building agent architectures that disregard this limitation inevitably leads to consequential decisions made with incomplete information and no human oversight.

Governing Agents in Production: A Path Forward

The immediate governance implication is clear: every autonomous agent action touching infrastructure must register against the same live signal layer that governs human-initiated chaos experiments. This means agents should be gated by SLO burn rates, latency trends, and dependency saturation states. If the resilience budget falls below a defined threshold, the agent must wait or escalate rather than act.

Furthermore, agent actions should be modeled as experiments, not just logged as events. When an agent restarts a service, the analysis shouldn’t stop at successful completion but extend to evaluating the action’s blast radius and cascading effects relative to available absorb capacity. This data must feed back into the resilience budget model.

Crucially, when signals are ambiguous—due to unclear budget scores, recent topological changes, or flux in dependency states—the execution decision must be handed off to a human. This “circuit breaker” mechanism is not a weakness but a fundamental requirement for making agent architectures trustworthy in production. Intent-based verification, which formalizes correct agent behavior and continuously probes its boundaries, is key to this approach.

Enterprises successfully operating autonomous agents at scale are those that have already recognized that every agent action is inherently a chaos event and have built their governance layers accordingly. The practical first step involves an unglamorous but vital audit of every autonomous agent currently impacting infrastructure. This audit should map agent actions against live SLO burn rate signals and establish explicit floor conditions requiring agents to pause or escalate. Organizations will likely discover agents operating entirely outside their resilience accounting—and it’s critical to find them before production systems do.

FAQ

Q: What is the primary risk posed by autonomous AI agents in enterprise production environments? A: The primary risk is that AI agents are quietly initiating actions that function as chaos engineering experiments, but without the benefit of human judgment or a comprehensive understanding of the system's real-time absorb capacity. This leads to cascading failures that are not properly tracked or attributed to the agent, creating blind spots in incident response and resilience planning.

Q: How can enterprises better govern the actions of AI agents to prevent these hidden failures? A: Enterprises should integrate autonomous agent governance with chaos engineering principles. This involves treating every agent action as an experiment, registering these actions against a live “resilience budget” that tracks system absorb capacity (based on SLOs, latency trends, and dependency states), and implementing human circuit breakers to intervene when signals are ambiguous or critical context is missing.

Q: Can Large Language Models (LLMs) help in improving system resilience with AI agents? A: LLMs can be useful for generating chaos hypotheses by analyzing historical incident data and dependency graphs, speeding up the identification of potential failure modes. However, they are unreliable for making real-time execution decisions, especially when dependency graphs are stale or when human-specific context (like upcoming deployments or staffing levels) is required. Their role should be limited to analysis and hypothesis generation, not autonomous action in ambiguous situations.