I am a force of Chaos: Why I practice Chaos Engineering to improve user experiences

April 17, 2018 - 8 minutes read - 1554 words

As systems grow in size, they inherently grow in complexity. As complexity increases, systems eventually reach a point where no single individual can reasonably understand the entirety. At this point, when the interactions between components are unpredictable, the system becomes chaotic. This chaos can manifest in many ways including unintended retry storms, broken and degraded experiences for users, and often, cascading failures of an entire system. Rather than letting users discover this chaos, Chaos Engineering seeks to identify it in a controlled environment with measurement where it can be addressed before impacting users.

15404507019_9e9377407f_o

Real world events can trigger cascading effects within a complex system, even when each individual component is working exactly as designed. Complex interactions between components can cause a ripple effect that disrupts the entire system, making non-critical components critical in that they can affect key metrics and more importantly, the user experience. Worse, these failures themselves can cause users to try unpredictable behavior in order to get past the issues, causing additional and unexpected load on other parts of the system, causing breaks to cascade throughout.

For many internet media companies, a key metric is playback events started. The most important feature or experience to a user is being able to actually start watching, listening, or viewing content. The may be many systems that are critical in this experience, but many more that are not and should not cause disruptions to it. These companies use Chaos Engineering to deliberately validate that only the critical systems affect this experience and that they have the resiliency to external events they do or do not expect. But introducing Chaos Engineering can be a challenge to organizations that aren’t already practicing it. Fear, trepidation, and doubt can prevent engineering teams from implementing Chaos Engineering because it might cause an issue. To this, I say that Chaos Engineering will never cause an issue that users wouldn’t otherwise also experience. It’s far better to discover the chaos in the system purposefully under a controlled setting when there is time to address it rather than letting users find it in the middle of the night.

Chaos Engineering introduces a set of principles and best practices used to run experiments against a hypothesis around complex distributed systems. Practitioners define a hypothesis and measurement criteria, then introduce deliberate stimuli into the system to verify the hypothesis. For example, we might hypothesize that failures within a catalog or device messaging service will not result in any impact to our key user experience of being able to listen. Chaos Engineering gives us a methodology and framework to validate the hypothesis or discover where it is incorrect and the system breaks.

Modern distributed systems utilizing micro services, or even just a service oriented architecture tend to follow a pattern with services calling each other in order to fetch information. For example, service A calls service B, which in turn calls both service C and service D. If designed well, a failure in service D should be handled gracefully by service B as it can still get data from service C. In many systems though, this assumption does not actually hold and a failure in service D can cause a failure all the way up to service A, impacting the user’s experience.

In traditional testing, such as unit tests, a component or unit of code is tested by providing defined input and validating the output of the unit. A level above, integration tests provide automated validation of systems or services with them, also by testing with known input data and output validation. In both cases, only interactions which are known can be tested. Chaos Experiments provide a way to test against unknowns in a system by introducing stimuli such as failures.

In order to test this, we set up an experiment with a control system and an experimental one. The control reflects the normal healthy state of the system as it is in production. The experiment version introduces a deliberate chaos factor such as high network latency, packet loss, CPU spikes, or data center unavailability. The automated framework routes equal traffic to this control cluster and the experimental one. With this stimulus introduced to the experimental one, automated monitoring verifies the health of the system and impact across the experiment of the stimulus to validate or refute the hypothesis by measuring the metric in the experimental cluster to the control one. If the difference in the metric exceeds a threshold, we know that the hypothesis was incorrect and there is a failure scenario. In this case, the automated framework stops routing traffic to the experimental cluster immediately so users don’t suffer an undue poor experience.

I have conducted several Chaos Experiments including packet loss across systems, high network latency to upstream dependencies, and a high CPU load on individual hosts. As a result of the learnings from these experiments, we have introduced improvements that resulted in increased resiliency and fewer poor user experiences due to failure.

A great source of best practices for Chaos Engineering is the Principles of Chaos (http://principlesofchaos.org/)). Developed by Chaos practitioners at Netflix, Amazon, and others, these principles establish both a set of guidelines to implement Chaos Engineering and how to measure the success and maturity of a Chaos Engineering organization.

In order to implement Chaos Engineering, practitioners follow these guidelines.

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior. For media companies, this might be playback events started. Any change to this metric should indicate a change in the health of the system and be an immediate indication that the experiment has detected an issue.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, high latency, issues with CPUs, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group. Route traffic to both groups and measure the metric and it’s delta across both. Define the threshold at which a delta indicates a failure and when breached, immediately stop the experiment.

Knowing what to experiment with can also be a challenge to teams embracing chaos for the first time. Some good scenarios to begin with are:

Add latency
Make services and dependencies unavailable
Throw exceptions randomly
Cause packet loss across the network
Fail requests by dropping them or providing failure responses
Add resource contention such as CPU hogging processes, high network volume, broken sockets, unavailable file descriptors

In order to run these scenarios, two common frameworks are used; an injection framework and a host agent. The injection framework allows code to be instrumented with these different types of failures based on configuration which can be read quickly at runtime during the experiment. A call to serviceA.call(request) can be wrapped or injected with this failure library which will add the latency, exception, or other failures based on the experiment being run. A host agent is run on system hosts to create the other failure scenarios like resource contention. A simple experiment may run a process which uses 80% CPU. With these tools, it’s easy to begin running experiments.

I also suggest automating Chaos Engineering processes and running them often, without warning as organizations mature in Chaos. Doing so ensures that teams implement the best practices and adhere to them continuously so that they do not fail these tests. Automating Chaos can also be used to immediately shut off an experiment if unintended behavior occurs, ensuring that customers do not suffer as a result. There are several automated Chaos Experimentation frameworks that allow for definition and scheduling of experiments, monitoring, and measuring the results of the experiment. While it’s a good idea to run them when teams are available to respond, a good automation framework can even allow teams to run experiments without any manual intervention and report results automatically.

Chaos Experiments can be run in testing or QA environments at first to get practice, familiarity, and comfort with the process, but at some point they should move to actual production environments. It’s impossible to predict everything users will do, so testing in QA only will never uncover every failure. Only by experimenting in production will real user experience issues be detected and be fixed. Many teams are afraid to run experiments on real production traffic, but doing so greatly increases the value of the experiments and allows issues to be found in a controlled setting. It’s much better to find these issues when teams are around and prepared and can stop the experiment than when user behavior exposes them in the middle of the night. Finding these issues during experiments also allows teams to address them with time and not at night.

Chaos Engineering may be a relatively new discipline in quality within technology organizations, but when used well, it can be an invaluable tool for detecting issues and faults in the resiliency of large scale systems which engineering teams can then mitigate without the users ever suffering a poor experience. Great engineering teams build the tools and frameworks needed to systematically run these experiments and automate them, consistently raising the bar in validating assumptions about the resiliency of their systems on behalf of users.