The five mistakes I see teams new to Chaos Engineering make

June 5, 2018 - 5 minutes read - 987 words

Chaos Engineering is taking over the world. Chaos Engineering provides a valuable framework and methodology to help teams better understand their complex systems and the behavior of them during real world events. It strives to help improve user experiences by validating assumptions about resiliency and detecting failures in systems before users experience these problems during a real issue. I love showing teams how to begin instituting Chaos Engineering in their organizations, but I tend to see the same problems repeated often.

2063197668_7e0981d1fe_o

Software development teams tend to get excited about Chaos Engineering and go all in a bit too quickly without really thinking about how to best use it to improve the experience for their users. After all, Chaos Engineering is meant to help identify issues during experiments in a controlled environment rather than finding them when users start experiencing problems. It’s great to see teams embrace Chaos Engineering on behalf of their users, but all too often I see them make these mistakes.

1. Not Monitoring Enough

In order to run a successful chaos experiment, teams need to ensure they have a valid hypothesis that can actually be measured. To do so, metrics need to be identified that will prove or disprove the hypothesis, instrumented in the system, and viewed in realtime during the experiment. If the metric begins to show signs of degradation of the user experience, the experiment should be shut off immediately while the issue can be diagnosed and addressed before the next run. Without the right metrics, an experiment is just a shot in the dark and more likely to cause harm than prove a hypothesis.

2. Breaking things just to break them

Often times, teams new to Chaos Engineering get so excited to run their first experiment, they neglect to actually design the experiment in a way that provides value. If a component is already believed to, or known to lack resiliency, an experiment just to prove it will only hurt the user experience, potentially lose users, and degrade trust with other teams who might otherwise support Chaos experiments. Chaos Engineering should never be used as a tool to prove a system is brittle, this can be proved in non-production environments with other types of testing that are lower cost and won’t impact users. Chaos experiments should be used to validate a hypothesis about the system, not to prove a flaw in it.

3. Lacking a proper shutoff switch

The first type of experiments teams should run are the easy to run manual experiments that can be easily stopped to short-circuit user experience issues. With proper monitoring, any degradation in the user experience will be seen quickly and the experiment can be immediately shut off. Experiments like CPU hogs or removing hosts are good for this where the stimulus introduced as part of the experiment can be quickly removed to restore the normal experience. The goal should be to detect a fault as quickly as possible, but also restore normal service to users as quickly as possible. Even organizations using sophisticated automation platforms for experiments should ensure they have a proper big red button that allows them or the platform to kill the experiment immediately.

4. Never running in production

Chaos experiments need to run in production with real users in order to be most effective. Still, many teams lack the nerve to run experiments in production environments, worrying that they’ll impact users. However, these teams may not realize that they have essentially decided that they will let their users find the issues for them instead of proactively detecting them with real experiments. Chaos experiments can find some issues in staging or test environments, but the real issues only occur with actual user behavior and traffic, especially the ones that are unexpected. Cascading failures and retry storms are hard to predict and tend to only occur in the real world. Even the best test environments only closely mirror production ones, so the setup and configuration is never truly the same. In order to really find the production issues which will occur most often at 3am when on-call, experiments need to be run with real users.

5. Replacing other kinds of tests

Chaos Engineering is not a catch-all type of testing. Unit tests, integration tests, and load tests all provide value to software development teams and should, in combination with Chaos Engineering experiments, provide a holistic strategy around testing and resiliency. Each type of test can find and detect issues and defects at lower levels with lower cost to implement and run. Unit tests are great at finding logical bugs or failures to uphold interface contracts, and require only seconds to run, allowing developers to run them with every single build. Integration tests require more time and setup to run, but can find issues across systems that otherwise wouldn’t be found until actual production usage. Chaos experiments are costly to implement relative to unit and integration tests and so are a poor choice for detecting issues unit or integration tests could. Instead, Chaos experiments should be designed to find the truly hidden flaws which only surface in real world usage with real user traffic and production environment configuration.

Chaos Engineering provides a powerful framework and toolset for teams with complex distributed systems to validate or disprove their understanding of these systems. These experiments allow teams to see how their system behave in the real world with stimulus like high CPU load or latency between services. Using these techniques, team can determine how their systems behave while monitoring them in a controlled situation and immediately shut off the experiment if the user experience goes awry. Instead of following these anti-patterns, teams who excel with Chaos Engineering use these experiments to find issues that affect the user experience with a small experiment group rather than waiting for all users to experience a problem. These teams use Chaos Engineering to find Chaos, not to cause it.