Extreme Experimentation - How to minimize the feedback loop and deliver software constantly

May 8, 2018 - 7 minutes read - 1324 words

In order to outpace competitors, technology companies need to move faster in delivering features of value to their customers. Those who hit the market first often emerge as winners due to networking and first mover effects. While some companies can beat others with similar features based on size along, notably Instagram and Facebook, most companies need to maximize delivery speed to win. Methodologies like Agile or Extreme Programming seek to reduce development cycles with the aim of reducing feedback loops so that efforts can be adjusted based on realtime feedback. But on their own, they aren’t enough.

Simply adjusting a team’s development methodology can speed delivery up, but without focusing on the other aspects of software development, it won’t speed things up nearly enough. Teams need to also speed up the rest of their processes beyond just development including testing, infrastructure, release management, and features themselves. The most effective teams take nothing for granted and are willing to change just about everything about their development process in order to be more efficient. Efficiency includes not only increasing speed, but reducing slowdowns and bottlenecks as well. A truly effective team drives improvements through all phases of their process.

Switching to Agile or XP can help improve development cycles, but teams also need to adopt their development itself. Many teams know the benefit of Continuous Integration, in which all code is checked into a single mainline code branch and automatically built (or integrated) against the most recent version of the code ensuring the code can build and unit tests pass. By having a continuous build process, the feedback loop for code commits is shortened and developers know immediately if they have broken the build. Broken builds are reduced in frequency and duration. Since developers don’t need to wait days, weeks, or months before the next build finds their problems, they immediately know to fix things.

Many teams follow continuous integration practices but don’t always follow up on broken builds, limiting the usefulness. Broken builds should be treated with top priority as a broken branch can prevent features going to users and even bug fixes from getting deployed. Mature teams cut tickets for broken builds and assign them to the developer whose changes caused the break.

Building on continuous integration, effective teams also continuously run their integration tests against test environments. Automated tests can quickly find defects in new features or regressions caused by code changes. These teams may also cut tickets to the offending developer when tests are broken since this can also block the delivery pipeline to users. If all tests pass, some manual tests can still run, and the code is then manually deployed to production environments. This is known as continuous deployment. Though there is a manual approval needed for releases, the software is in theory, ready for release to production at any time, and important philosophical change required to succeed with this type of development.

The last step in continuous, automated delivery of software is continuous delivery in which no manual process slows down the release process. All tests are automated and as soon as they pass, the code is deployed to production. In this type of development, code is released very frequently and it encourages developers to instill high quality as their code can go to production at any time. These teams can deploy new features, bug fixes, and value to their users incredibly rapidly, but many teams never get this far due to fear.

These teams are afraid of the risk of deploying continuously and rely on extensive manual testing to ensure their features are ready for release. Instead of automating, they continue to rely on manual tests, deploy changes that aren’t quite ready or aren’t tested by themselves knowing that any issues will be caught and can be tested in staging environments before going to users. This safety net is a detriment as it slows down the process and effectively blocks other developers from getting their changes released while features are tested. Teams in this state tend to do less frequent releases with large batch sizes of changes. In reality, these changes are actually riskier than a single continuous delivery release as they include a batch of changes that may negatively interact with each other in unpredictable ways and because of this stacking up of changes, may not have been effectively tested.

There are two major ways to increase confidence in releases in order to make a switch to continuous delivery feasible, increasing automation test coverage, and improving realtime monitoring. The first is almost always the only one focused on and teams never get to a magic threshold they define to be comfortable with continuous delivery, often times some arbitrary metric of 70 or 80% line coverage. Having the main test cases covered with automation is important, but it’s not the only way to get to full CD. If teams not only know their key metrics well, but can monitor them in realtime and get immediate alerts, they can actually respond fast enough to issues that full CD becomes feasible without a massive suite of tests. Automated realtime monitors and alerts on a key metric like orders can show an issue has occurred immediately after a deployment and the release can be immediately aborted or rolled back to the previous state. Some sophisticated systems can even monitor this and perform the rollback automatically. Importantly, this minimizes the impact of issues to users.

Further, teams may also institute a variety of methods to incrementally roll out changes to users, minimizing the potential blast radius of any issue in these features. Simplest is the feature toggle or flag. A binary on / off switch, this enables feature code to be deployed, but in an off state so that the feature can be independently controlled separate from the release. When ready, the flag can be enabled, effectively turning the feature on to users. This allows for deployment on incomplete features or ones that need testing before enablement. More advanced configuration systems allow for percentage incremental dial ups of features and A/B testing of features across differing percentages of users to allow the testing and measurement of the feature. Some even allow individual users to be enabled, making testing with test accounts or beta testing features easy.

Canary deployments, in which deployments only target a subset of hosts in the fleet, perhaps 10%, can also allow features to roll out slowly and more controlled. Paired with automated monitoring, this technique can be effective in quickly validating new code in production without exposing a large number of users, then rolling back quickly if an issue is found.

Teams also need to evaluate how they get development and test data as well. Many teams create test environments that developers code against and the initial round of automated tests run against. They spend countless hours maintaining the data in these environments and ensuring the services and environments are up, running, and stable there. Some teams though don’t bother, and test directly in production. Rather than trying to keep environments in sync with complicated processes, they just use production data. This means they can move faster as they aren’t brining cycles maintaining test environments, setting up and nursing test data, and discovering issues in staging environments that are actually caused by inconsistent data.

Effective software development teams move as fast as possible, but don’t break things. They automate everything except for their development and build tools and systems to monitor the health of systems and the user experience so that when something does break, they know quickly, and the system can respond, minimizing the disruption to the user. Doing so allows them to get new features and improve experiences lightning-fast and shorten the feedback loop to the user, allowing them to learn what their users need much faster. These organization can outpace competitors and deliver the best value to their users by focusing on velocity and automation.