In my first job, server problems could have severe consequences. The client-side software relying on the servers was used in doctor surgeries and hospitals throughout the UK. There was a special process for dealing with server problems that streamlined everything, including facilitating safely sprinting into the server room.
Immediately after each server problem had been remedied a dedicated team would be formed. It was their job to analyse the nature of the problem, the underlying cause and to prioritise the fix based on how critical it was.
Some of the problems were obvious to spot and rectify such as recent code or infrastructure changes. Some were more subtle and elusive, therefore more dangerous. Every post-mortem was documented, and we’d search for patterns, ever hoping that some master trend would be revealed. But is it ever possible to predict the unpredictable?
As software systems evolve and grow in complexity, it becomes increasingly difficult for any one person to understand every part of the system, and therefore to think about every possible scenario where something could go wrong. This is especially true of large and complex systems with decoupled services, each with their own set of external and internal dependencies.
Some potential errors can be guarded against through defensively coding or best practices: automated and manual testing, code reviews, circuit breakers, monitoring etc. But if you don’t know about a problem until it happens, you can’t bake in resilience.
The Netflix proposal, Principles of Chaos, formalises an approach to testing unexpected behaviours in systems through controlled experimentation. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
The Four Principles of Chaos Engineering
So how do you go about this? According to Netflix, the 4 principles of Chaos Engineering are:
- “Build a Hypothesis around Steady State Behaviour”
Rather than the nuts and bolds of a system, look at the measurable outputs: throughput, latency, errors etc, and focus on these metrics during experiments. This way, you have a ‘control’ set of metrics to measure against.
- “Vary Real-world Events”
The unpredictable influences that cause a fault are real-world events. Consider events such as a service going down, spikes in traffic, and consider it a variable in your experiments.
- “Run Experiments in Production”
Probably the most important psychological barrier for a developer in adopting Chaos Engineering is to consider the great taboo: testing in production. This guarantees authenticity in the way the system responds to events.
- “Automate Experiments to Run Continuously”
This suggests that you build automation into the system. This is both for orchestrating the tests and analysing the results.
Implement it in your company
You may struggle to make it Chaos Engineering routine in your company: there are considerable risks such as breaking production code and tying up resources and time and energy from an unknown number of participants (voluntary or not). It may simply not be possible in systems that require near constant availability, without first investing in reliable fallback mechanisms and disaster recovery plans.
Here are some suggestions on how to make Chaos Engineering part of your testing practices, short of fully embracing it:
- Focus on the critical failures
Not every thing that could go wrong is actually a problem. For example, a server that serves content may experience down-time. However if you cache the content in the dependent server, this might not be a problem unless it persists for a long time.
- Start small
Start by looking at isolated services or components, rather than what could happen if everything goes down. This reduces the chance that in your early experimentation, you cause lasting harm. By introducing Chaos Engineering in one part of the system at a time, you can incrementally make improvements.
- Set aside dedicated time
Rather than continuously testing, try a ‘chaos day’, where a day is set aside to investigate and test system resilience. It is a good way to strengthen teams, and a good way to time box experiments.
From an adrenaline junkie perspective, Chaos Engineering sounds like fun. It is a powerful tool for embracing uncertainty, and a good way to look at how you guard against failure. Ultimately, it will be difficult (not to mention philosophically impossible) to test against unpredictable circumstances. However it is an interesting attempt.
For more information, see the Netflix article on introducing Chaos Engineering.