The Game Day That Did Not Game
An SRE lead at a financial services company described her organization's first chaos engineering game day to me as a quiet disappointment. The team had planned for six weeks. They had selected scenarios. They had announced the exercise to leadership. They had run the day.
Most scenarios produced no learning because the team had implicitly designed scenarios they knew the system could handle. A few scenarios produced real incidents that the team handled adequately. After-action reports got written. The lessons were modest. Subsequent game days were scheduled and then quietly postponed.
She told me the experience taught her that chaos engineering is easy to do ceremonially and hard to do operationally. The ceremonial version produces leadership-friendly artifacts and limited learning. The operational version produces actual reliability improvement and is more uncomfortable to execute.
The pattern is common. Most chaos engineering programs that get launched do not become sustained operational practice. The programs that do become sustained share recognizable characteristics. Knowing the difference is what distinguishes programs that produce reliability improvements from programs that produce announcements.
Building AI-Ready Healthcare Data Infrastructure
How one health tech CTO unblocked four staged clinical AI models in 90 days with three infrastructure changes.
What Operational Chaos Engineering Actually Looks Like
Three characteristics distinguish operational chaos engineering from ceremonial.
The first characteristic is continuous rather than episodic. Operational chaos engineering runs frequently enough to be part of how the team operates. Not annual exercises. Not quarterly drills. Continuous experiments running in production environments with appropriate guardrails.
Netflix's chaos engineering work pioneered this pattern. Continuous experiments run automatically. Failures get injected into production traffic at controlled blast radius. The system either handles the failure or surfaces a problem. The pattern produces operational learning faster than episodic exercises.
The second characteristic is targeted at known unknowns. Operational chaos engineering tests hypotheses about how the system responds to specific failure modes. The hypotheses are formed by analyzing past incidents and known architectural assumptions. The experiments verify or falsify the hypotheses.
This is different from ceremonial chaos engineering, which often tests scenarios that engineers find interesting rather than scenarios that surface unknowns. The hypothesis-driven approach produces more learning because the experiments are designed to teach something specific.
The third characteristic is integrated with incident response. The chaos engineering practice connects to how the team responds to real incidents. Experiments that succeed inform what the team trusts. Experiments that surface failures trigger remediation. The two practices reinforce each other.
Without the integration, chaos engineering produces findings that do not connect to operational priority. The findings get logged and de-prioritized. The practice loses momentum.
The Adoption Phases That Work
Chaos engineering adoption that succeeds follows recognizable phases. The phases respect the cultural and technical reality of the organization.
The first phase is local resilience testing in lower environments. Specific teams run failure injection against their own services in staging. The blast radius is contained. The learning is concrete. The cultural barrier is low because nothing in production is affected.
This phase establishes practice. Teams learn what kinds of experiments produce useful learning. Teams develop tooling and process. Teams build confidence in chaos engineering as a discipline.
The second phase is targeted production experiments with guardrails. Specific failure injection runs in production with explicit safety mechanisms. Circuit breakers that abort if customer impact exceeds threshold. Limited blast radius. Stop conditions that engineers can trigger.
This phase produces actual operational learning because production behaves differently than staging. Real traffic. Real dependencies. Real load. The learning is what makes chaos engineering valuable.
The third phase is continuous experimentation as part of platform operation. Chaos experiments run automatically on a schedule. The platform team monitors results. The discipline becomes embedded in how the platform operates rather than as a separate initiative.
This phase is what mature chaos engineering looks like. The investment in earlier phases pays back through continuous reliability improvement.
The fourth phase is organization-wide adoption with shared tooling and patterns. Multiple teams operate chaos engineering. Shared infrastructure supports their work. The discipline becomes part of how the organization builds systems rather than as a specialized capability.
Most organizations stay in phases two or three. The progression to phase four requires sustained investment that organizations sometimes do not maintain.
What Goes Wrong With Chaos Engineering Programs
Three patterns of chaos engineering program failure are common.
The first pattern is leadership-driven adoption without engineering buy-in. Leadership decides chaos engineering is important. Initiatives launch with executive sponsorship. Engineering teams comply minimally. The program produces artifacts that satisfy leadership and produces little operational learning.
The second pattern is tooling-first adoption without practice maturity. Teams buy chaos engineering tooling (Gremlin, Chaos Mesh, AWS Fault Injection Service) and expect the tooling to produce the practice. The tooling is useful and does not substitute for the operational discipline. The tooling sits underused.
The third pattern is episodic exercises without continuous practice. Annual or quarterly chaos days produce moments of focused attention. The attention does not translate into ongoing operational rigor. Between exercises, the system reverts to its untested state. The findings from exercises accumulate without being addressed.
Each pattern is preventable through deliberate program design. Engineering ownership of practice. Discipline before tooling. Continuous rather than episodic experimentation.
What Distinguishes Programs That Survive
Programs that produce sustained value share specific characteristics that the failed programs lack.
Engineering ownership and leadership. The program is led by engineers who care about reliability rather than by program managers who care about ceremony. The engineers have credibility with their peers. The program has technical depth rather than process depth.
Specific reliability targets that experiments serve. The program is in service of specific reliability commitments. SLOs, customer experience targets, incident reduction goals. The experiments are means rather than ends.
Integration with normal engineering work. Chaos engineering happens during normal engineering activity rather than as separate initiative. New services get chaos-tested as part of launch. Existing services get continuous experimentation as part of operation. The practice is part of how things are built.
Tooling that supports the practice rather than substituting for it. The team picks tools that fit the workflows they have established. The tools accelerate practice that already exists.
Honest reporting on findings. Experiments that surface problems get visibility. The findings affect priority. The team that runs experiments does not minimize results to avoid uncomfortable conversations.
These characteristics are observable in mature chaos engineering programs across multiple organizations. They are not exotic; they are the practice that survives past launch enthusiasm.
What This Costs
Chaos engineering practice at sustained operation typically requires one or two dedicated engineers plus participation from product engineering teams. The dedicated capacity owns tooling, methodology, and cross-team coordination. The product teams run experiments on their own services with support from the dedicated capacity.
Tooling investment typically lands in the $50K to $200K annual range. Gremlin and similar commercial platforms target this range. Open-source alternatives reduce the licensing cost but increase the engineering investment.
The return is hard to measure precisely. The metric that matters is reliability improvement over time as the discipline matures. Programs that survive past phase two typically show measurable reduction in production incidents within twelve to eighteen months.
Multi-Tenant Healthcare AI
Why row-level security and application-layer RBAC are necessary but not sufficient for multi-tenant clinical AI.
What Logiciel Does Here
Logiciel works with engineering and SRE leadership designing or maturing chaos engineering programs. The work is typically structured around assessment of current practice against the adoption phases followed by sequencing appropriate to organizational readiness.
The Cloud Reliability and Resilience Patterns for Mission-Critical Systems framework covers the broader reliability engineering that chaos engineering serves. The DevOps + Cloud Delivery Pipeline framework covers the pipeline practices that chaos engineering integrates with.
A 30-minute working session is enough to assess your current chaos engineering practice and identify the next phase to pursue.
Frequently Asked Questions
Can I start chaos engineering in production immediately?
Almost never advisable. Start in lower environments. Build practice. Move to production with guardrails. Pure production chaos without prior practice typically produces incidents that erode confidence rather than building reliability.
What is the right team for chaos engineering?
Engineers with strong operational experience and an interest in reliability. The skill profile is closer to SRE than to traditional development. The team also needs interpersonal skills because chaos engineering involves coordination across product teams.
How do I get product teams to participate?
Through value demonstration on a willing team first. The success becomes a reference. Mandates to participate produce reluctant compliance. Demonstrated value produces voluntary participation.
What about chaos engineering for AI workloads?
Applies with AI-specific extensions. Model provider outages. Embedding service unavailability. Vector database failures. AI workload-specific failure modes need explicit testing alongside general infrastructure chaos.
How does this interact with compliance and audit?
Mostly positively. Auditors increasingly view chaos engineering as evidence of operational rigor. Documentation of experiments and findings supports audit narratives. The practice has to be designed with appropriate change management for regulated workloads. Sources: - Netflix Technology Blog, Chaos Engineering Principles - Gremlin, "Chaos Engineering Adoption Report 2024"