I’m a huge proponent of Continuous Integration and Deployments. I believe that the fast feedback cycle these processes enable is hugely beneficial to a software development team, and allows teams to get beneficial features out to customers faster. However, it is far from a silver bullet, and there are many steps teams need to take before they can be successful in leveraging Continuous Deployments. This is the story of how I learned this lesson the hard way and had to give up on CD.
Automation is a powerful tool to improve the delivery of high quality software and features to users. Continuous Integration is often the first tool utilized by teams to make development less painful. With it, every time a code change is committed, it is automatically and immediately built against the current version of the rest of the code and dependencies and typically run with unit and potentially integration tests to validate behavior and prevent regressions in functionality. This is a great first step for teams to take to ensure their changes do not have unintended consequences.
Building on this, Continuous Delivery and Deployments carry the automation further up into the testing and deployment process. Continuous delivery automates the testing fully, and provides a release candidate that is validated and ready to deploy to production environments when someone choses to do so. Full continuous deployments even automate this step, automatically deploying code many times a day as long as all verification and testing steps succeed. For large organizations with many teams and dependencies that form a complex web of interaction, this can provide a powerful way to quickly build features.
I had encouraged my team to enable full continuous deployments for many of our services. I had found that many services had essentially continuous delivery where basic validation was done and changes ended up queuing up between staging and production waiting for someone to push them to users. Because of this, we often ended up with a stack of a myriad of changes that was hard to predict the behavior of due to the unknown interactions between them. We also tended to run very little, or no further manual validation before deploying to production, other than waiting until a good time during the day when people were around in case support was needed during an issue.
In order to address these problems, I pushed the team to quickly enable continuous deployments. First, we added automated deployment windows, allowing us to only deploy during core working hours and never on Fridays when much of the team tended to be remote. One of the biggest hurdles to doing so was getting past the fear of losing control over the process. It took some convincing, but once everyone realized that we weren’t doing anything beyond what the system was already doing anyway, we saw the value in just enabling pushes to production. However, we didn’t add additional validation or testing steps. This turned out to be the biggest mistake.
Full continuous deployments actually worked well for us for a little while. We had far more frequent releases with smaller batch sizes which meant fewer cases of things breaking due to unintended interactions. Our time from code commit to production went down, as did our frequency of issues and rollbacks. Even when we did have issues, it took less long for us to perform a rollback because we had more known good previous states to go back to and smaller changes took less time to roll back.
However, one day we had a major issue with a core feature. Due to an extremely complex set of systems interacting with each other in an unpredictable and unknown way, a bug was introduced that was not detected by any tests, metrics, or alarms. We only knew about it when customers started contacting us and by the time we found the problem and mitigated it, a large number had been affected. Not only were we missing key tests for this functionality, the functionality was actually complex enough that we couldn’t identify an automated way to test it correctly. As a result, we decided to disable continuous deployments until we could figure it out.
Unfortunately, this left us worse off than we actually began. Not only did we now have to go back to manual deployments, losing the benefits we had been experiencing, but we also needed to rely on manual testing from our front end to the services via the website and mobile apps which required non-trivial effort and was error prone. But, because of the failure of our validation in Continuous Deployment, we had to take this step backward.
Eventually, we were able to automate the key parts of validating this functionality via tests. This allowed us to go back to automated deployments, but the team, and I, learned a valuable lesson beforehand. Continuous Deployment is not a magic process that will solve every problem. Teams need to do the work first to be ready for it and ensure they actually have the right processes in place first. Much like teams should never test something they can’t measure and observe, no one should deploy things they don’t fully validate or test. Teams should never automatically deploy something unless they have the same controls in place as what they were doing manually first. It really isn’t safe to just forge ahead without putting this work in first.
This was an important lesson for me to learn as a manager and as an engineer. It’s very easy to fall into the trap of thinking a shiny new technology will solve a myriad of problems and that a cursory understanding of it will be sufficient to prevent major issues. The same thing happens with processes like CD too. But like any tool, in order to actually weild it safely and correctly, it needs to be understood and respected. Jumping straight into CD without thinking about the safety nets we needed or respecting the manual processes in place was a mistake, and not one to make again.
I still believe teams are much better off with automated continuous deployments. The resulting speed of delivery and faster feedback loop greatly assist software teams and make for a better product. But rather than just jumping to the next technology or blinding embracing processes I have seen work with it her teams, I learned to be more thoughtful and measured in my approach. For now on, I will make sure there is a better understanding of risks with such a plan, and realize that teams need to do the upfront work to be ready for changes like this.