Everyone thinks their system is reliable until suddenly it isn’t. With a service oriented or microservice based architecture, it can be easy to overlook a single point of failure that can cause a massive cascading failure, quickie harming users and losing their trust. Ensuring that a single failure is handled gracefully is of paramount importance in a large scale distributed system, but that’s much easier said than done.
I’ve worked for multiple companies building large scale distributed systems and each company took a slightly different approach. At Goldman Sachs, working on a high throughput order and trade management system, efficiency and consistency were more important that outright reliability. Processing faster and faster was always goal, and while the system needed to be highly available, especially during trading hours, graceful degradation wasn’t a key focus. Instead, the system needed to be consistent, relying on active backups and very fast fail over in the case of an issue. Availability was such a high concern that alarms and paging were barely used and instead the system was designed to be self-healing, which we tested frequently, rather than relying on human interaction which could be slow and error prone. This configuration and failover was tested with every developer commit as part of the continuous integration test suite.
At Audible, the system is designed to handle massive scale and concurrent users. Instead of perfect consistency, the goal is to maintain high availability through a distributed system of small and redundant nodes for each microservice running across a large fleet. No one host or server is important enough to matter, the resiliency is handled through size of the fleet. No one service should be so critical to the entire system that if it were to fail, it would cause an outage of the entire system. Though an individual feature may stop working, the rest of the system should remain available and usable, degrading gracefully so that the user can still utilize the rest of the application and may not even know a given feature is unavailable.
Achieving this level of resiliency and reliability is no easy feat though. It’s one thing to determine a goal of reliability and an entirely different thing to actually achieve this and ensure it is maintained as new services and features get added. The only way to realistically accomplish this goal is by enforcing it through culture and mandatory and regular testing. Only by pairing these together can organizations succeed at achieving real reliability. Automated and enforced processes like testing are needed to ensure that once the bar is set, it cannot be lowered without a conscious agreement to do so. Minimizing the feedback loop also helps ensure this helps provide value to developers rather than becoming a pain point because tests always fail and developers begin skipping the tests. It’s important to make these tests part of the regular development and CI/CD cycle so that developers can see if anything has broken immediately and as a result of their changes. Building a culture of incorporating reliability as a core tenet also encourages the team to continuously raise the bar for building resilient software applications and ensures that the bar won’t be lowered over time. A successful team will have engineers who refuse to ship a code commit without reliability tests included.
Once these tests are created, they need to be run often and regularly. Because systems change so quickly in modern paced technology organizations, the results of a test are only valid for a short time. Essentially, there is a shelf life for these results. Teams with truly reliable software run these tests regularly, as often as with every check in as part of their normal CI/CD pipeline, and allow their developers to run them locally. Reliability tests should be treated as first class citizens in the same way as unit tests and functionality integration tests.
Teams also need to establish their base tenets and best practices for how they handle resiliency. In an optimal distributed system, one service, though dependent on several others for information and functionality, should be able to handle any of these dependencies failing. An error, missing data, or timeout should get handled by emitting an error to the logs with debugging information, emitting a metric which can be alarmed on, and continuing with whatever logic can still be achieved. Services may also handle downstream failures with a circuit breaker, in which a service detects a downstream failure and backs off that dependency for some time. After that period, one request may trigger a retry of the dependency and if it fails, continue with the back off, otherwise flip the switch back and resume calling the dependency.
Another technique to provide some data where possible is to separate response data from a service by the type of the data, often which corresponds to a specific dependency. For example, one of our services may provide playback details including the asset itself, chapter metadata, bookmarks and notes, and product metadata, all of which come from different sources. Though requested in a single response to minimize the number of network calls, if only a single dependency fails, we don’t want the entire call to fail. If we cannot get the chapter details, we can still allow playback and disable chapter info. The service then will continue to return a response with all of the available data and provide a status that the chapter data was not available. The client can then do what they like with this, still providing the core functionality even during an outage of one service.
All of this should be done with the end user in mind. The only reason for this resiliency is to ensure the best experience can be provided to the user even when systems are failing or otherwise unavailable. When done well, customers shouldn’t even notice that an outage is happening. If possible, disabling buttons, removing details from pages, or queueing up a retry for a workload that can be attempted later are good ways to gracefully degrade the user experience. Good resiliency means that users can still get to the functionality the most desire and need: checkout flows, consumption, and use of an app; even when services or other parts of the system are down.
It’s also important that systems are so “graceful” that operators don’t even know when a problem happens. Though the user shouldn’t see an error, the team itself definitely should, typically through automated alarming based on error metrics. Downstream errors should be hidden from the user, but still treated as an error from an operations standpoint and treated with the same urgency. If left unchecked, a downstream failure, even when handled gracefully, can cause a cascading failure if something else unplanned occurs. Treat these failures as if they were impacting all users, even if they aren’t currently.
As systems grow in scale and complexity, it’s more important than ever to ensure they remain reliable and resilient to issues. The best engineering teams ensure that the design for resiliency to failure scenarios and plan for the worst to happen because they know one day, often in the middle of the night, it will. Nothing loses trust from users faster than a complete system outage where users can’t find or derive value from a product. Applications need to be designed with reliability in mind, but teams also need to instill a culture of reliability and enforce it through automated processes ingrained in the development lifecycle. Only by doing so can a team fully ensure they are providing the best user experience.