Software is developed and operated by fallible human beings while running atop unreliable distributed infrastructure. Yet, we need the software to work reliably. To make things worse, as the complexity and number of these services increase, our confidence in the reliability of the overall system shrinks.
At Netflix, our software infrastructure is implemented as a large number of networked services. To increase our confidence that the overall system will remain available in the face of real-world events, we run experiments on the production system, including injecting failures. We call this approach “Chaos Engineering”. This talk will discuss the principles underlying Chaos Engineering and how we apply these inside of Netflix. Filmed at Devoxx 2015.