I'm using Chaos Engineering to monkey with my children

January 9, 2018 - 8 minutes read - 1646 words

What’s more complex, a large-scale distributed system or a child. For parents, it’s obvious. At least systems tend to behave rationally. However, much like children, large systems are inherently complex. As complexity increases, unforeseen behavior emerges and causes unpredictable results. Sound like a child? Chaos Engineering, a software engineering methodology, aims to understand that complexity through experiments.

Chaos Engineering is the practice of utilizing experiments to better understand complex systems by intentionally causing chaos and measuring what results. In large organizations, many teams own components and modules that work together, each with understandable and discrete logic and behavior. As the number of these systems increase, the number of relationships increase, often exponentially, resulting in complexity expanding to the point where no individual can fully understand the entire system. Because of this, individuals may believe they understand the system, the system can actually behave in unpredictable ways.

As an example, an online video streaming service might have components for discrete usages such as a module for serving up the image for the video, one for providing personalized recommendations to a user, and one for checking the user’s location. Each of these may seem discrete and resilient to failure of one of the others. In practice, the image module may encounter an issue where fetching the image takes the entire machine’s memory, leaving none for other processes. In an attempt to minimize hardware costs and because these modules were always minimal in the past, the image module and the recommendations might be running on the same machine. Now, as a result of the problem with the image module, recommendations can’t be fetched. Because users now see blank images and content they don’t expect, they hit refresh a bunch of times, hoping the experience will return to normal. This additional traffic hits the location module hard, bringing it down.

This type of scenario is common in large complex systems. Studies of flight incidents and airline crashes often find similar unintended behavior due to the complex interaction between seemingly unrelated pieces. Chaos Engineering seeks to remedy these issues through understanding the complex relationships and behaviors. First, a hypothesis is formed about the result of certain stimulus to the system. For example, an engineer might hypothesize that an increase in traffic to the recommendation system of 100x will result in no reduction in overall plays of videos across the site, or that the recommendation system being unable to reach the image system will not result in any additional traffic to the location system. An experiment is then run where the stimulus is introduced and the hypothesis is confirmed or refuted by measuring the actual result.

For parents, no system is more complex and chaotic than their children. Their behavior is inherently unpredictable and difficult to understand. Though parents may form hypotheses around children’s behavior, it may also be inaccurate due to the complexity of the system. Just like complex software systems can be better understood and risk can be reduced, Chaos Engineering can help parents better understand their children as well.

One of the simplest tests an organization can run is to bring down one or more instances of a service and see what happens. In a cloud based architecture, systems are supposed to be distributed and resilient to a single node failing, but due to the complex interaction of systems, this isn’t always the case. Kids often get used to things they encounter all the time and can take them for granted. Certain toys, types of food, and daily routines can become ingrained to the point where any derivation causes havoc. Introducing chaos deliberately can help reduce dependencies and make children more resilient to change. In my first chaos experiment with my kids, I changed out a few of their toys and offered them new food that they hadn’t yet experienced. It took time, but by introducing these changes, my kids became less fussy when they didn’t have access to a favorite toy and began to eat more varied food.

Another easy test engineers run is called the CPU Hog. In this experiment, a high percentage of the computational power available to a machine is used up on a machine, leaving a dramatically constrained amount of power left for the actual process. The image component might be left with 5% of the normal power it gets. The purpose of this test is to determine if the system will degrade the experience gracefully, or if such an unexpected reduction will cause breakage or downstream failures. Parents are often required to give children 100% of their attention, also a constrained resource. However, when attention drops, children may lash out and take drastic measures to get that attention such as screaming, tantrums, or throwing things. To purposefully test this chaos and make improvements from this, parents can run purposeful experiments in which their attention capacity is reduced. By constraining attention to less than the normal full capacity, parents can test the results in a controlled environment, quickly back off on the experiment if it goes wrong, and recover, much like Chaos Experiments are run on software. It’s always better to run these experiments in a controlled setting like home where it can be dealt with easier than a less controlled environment like out in public.

A more complex experimental test involves validating the results of what happens when a dependency is unavailable or not behaving in the expected manner. For example, if the image component needs to get data from another system, the test might be set up to make this other system unavailable from the image system to verify that it can still perform some functions and not completely break. For parents, there are countless dependencies that might encounter issues where the parent believes no major problem will result but the reality is quite different. For example, our boys were fairly dependent on their pacifiers to fall asleep each night as well as to stay calm in the car. After weaning them off of them before bed, we thought they were no longer dependent on them in the car either. Our first experiment in which we made this dependency unavailable went terribly wrong. However, we were prepared to back off the experiment, gave them the pacifier after a few minutes, and regrouped. As a result, we knew the dependency existed and what would happen if it was unavailable, so we could ensure we always had a backup. Additionally, we then defined a plan to add other distractions for them, essentially providing options for a graceful fallback. In this case, we brought along more toys and books in the car and provided those instead of going straight to the pacifier. We were able to greatly reduce, though not quite eliminate this dependency.

One of the most complex Chaos experiments is to simulate partial data loss. In software, this can be simulated by injecting packet loss in the network, essentially meaning that not all of the data needed by the system is received. In a truly resilient system, some level of loss can be tolerated without causing a broken or unavailable service. While parents tend to be direct and communicate often with their children, due to noise in the system, mainly attention and distractions, information can often be lost. In this vein we experimented with limiting information to gauge the results. We rely on an app for communicating with their daycare for minor things, and wanted to see how resilient this system was. Instead of communicating over it, we sent in handwritten notes to see if the messages still got through. We found that they did not and that we were more dependent on the app than anticipated and discussed this with the daycare.

Many of these opportunities for testing will arise naturally or accidentally. However, just because they do occur doesn’t mean parents should wait for it. Testing proactively has the advantage of being able to stop the experiment if things start going poorly, being more easily measured and provides more opportunity to gauge lessons learned and make changes than when an actual problem occurs. It’s important to form.a strategy around this experimentation and ensure that the hypothesis is well formed and that there is a mechanism for measuring the results.

One of the hardest parts technology organizations struggle with when first embracing Chaos is to use real traffic and to run the experiment without warning. Real traffic has to be used in order to get the best, most relevant results. Real traffic has patterns and complexities test data will not uncover and leaves risks open. For parents, this means devising real world experiments that take place in actual interactions, not play acting or roleplaying. The reason to not give a warning is that with warnings, the behavior changes. Teams who are aware of the test and prepared can alter behavior to maximize test results which defeats the purpose of testing. Similarly, parents can give a warning to children and need to actually practice.

Chaos Engineering is a fascinating field that is just coming into maturity. The basic idea is to use the scientific method where a hypothesis is formed, an experiment is run, and the results inform changing the hypothesis. In the tech world, this practice is recently being used to great effect in understanding hugely complex systems and in improving them for users. Parents can apply these methods in their daily lives to build in resiliency and tolerance to change in their children and to make their lives better. Children who can deal with change and differences in daily routine tend to grow up more risk tolerant and successful, as well as being more manageable children who can adapt to changes. This also means less tantrums and screaming that parents have to deal with. By purposefully adding Chaos to their lives, parents can actually enjoy more manageable and productive child rearing.