Chaos Monkey, the tool that causes minor faults in order to prevent greater ones
Alejandro Guirao, developer at intelygenz, gives insight into Chaos Monkey, a tool that causes random system failures for resilience testing which is used by Netflix and is winning over other major companies.
In the world of development, any change can cause a chain of errors throughout the whole system, so it's essential always to be prepared. The idea behind Chaos Monkey is that the developers themselves should trigger faults in their tools and development as a form of training.
Alejandro Guirao, Devops at intelygenz and expert in this tool, talked about its scope at the event Haciendo el Chaos Monkey (Making Chaos Monkey), held at the BBVA Innovation Center in Madrid. We asked him to explain in detail the operation of this tool.
What's the process behind the tool?
These types of tools belong to the concept known as resilience engineering, and use some kind of scientific method. First you have to measure your system: you must make sure it's working and that its performance is adequate without doing anything to it. Then you formulate a hypothesis: if I attack it, if I mess around with it like this, will it resist? That's when you start launching attacks with the tool. And at the end you measure, you measure and compare, and you may come to a different conclusion, and that leads to a new experiment. It's a fairly reiterative cycle.
What are the main advantages of using Chaos Monkey?
It basically tests a system's resilience. A system is composed of software, an architecture in which the software is installed, and a series of processes that are sometimes business, sometimes human. All this forms a pyramid and this is what's ultimately tested with Chaos Monkey. The simple fact of provoking a minor glitch in your input means you can see whether you're really able to survive it.
Is Chaos Monkey a worthwhile tool for companies in any sector?
The tool itself is mainly focused on the technical and IT side of companies. That's why its greatest benefit is for systems or software development departments. But if we go a little further and don't look so much at the tool but at its principles and practices, it can be extrapolated to any area. In fact, this is something that's been done for some time now in the aeronautics industry, the security industry and even in the medical industry.
Could these simulated production failures ever actually harm the company?
Yes, in theory they could. In fact, this would be one way to uncover a major problem in what we've created –in the architecture, the software or the processes. Problems may arise and it's a risk that exists, and one which has to be assumed at the very top levels of the company. That is, the entire senior and middle management must be aware that these experiments are being done and that there is a probability –not insignificant– that they may affect production. Also that in the end the medium to long-term benefits are going to vastly outweigh this process.
Isn't using Chaos Monkey like throwing stones on your own glass house?
Let's say that instead of waiting for a really big stone to land on your glass house, you start by throwing a little tiny stone to see whether it holds up or collapses. That way you can see whether there are any gaps in your system, and it's a way of being able to fix it.
Does it test teamwork?
Yes it does. There's no doubt that these types of tools that cause problems end up requiring multidisciplinary teams to resolve them, so they encourage teamwork, not only among people from systems, operations and development, but among everybody. For example, when Google –using the resilience engineering philosophy– conducted simulations of flooding in data centers, there was one case where they had to use a diesel generator during the simulation, and they didn't have any diesel. So people began to see how they could manage to buy diesel; the engineers began to call around, but then the people in the administration did too, and other departments came up with the phone numbers of people they knew who could get hold of diesel or who could lend them money to get diesel. In fact one employee even offered the company his credit card so they could buy diesel. In the end it's an effort that involves multidisciplinary teams.
“The best way of avoiding failure is to fail constantly.” What do you make of this phrase?
When you learn judo the first thing you learn to do is to fall –what's known as "ukemi waza"– so when you do the exercises you don't hurt yourself when you fall, because you're no longer afraid and you know how to fall correctly. This is a little the same thing –if you're used to a series of minor faults, then you can avoid major faults because of the experience you've had. It's linked to the lean philosophy of startups, the fact of failing fast.
How has Chaos Monkey been received by the open source community?
Very favorably –of course, it was a real shock. When it was first announced by Netflix, nobody knew it was doing something on this scale. Everything Netflix does always sets a precedent. The fact that they said that they were constantly provoking these failures in their production, but that they had no effect because they've reached such a level of software development and engineering that they're almost immune to numerous catastrophic errors... That made a lot of people want to emulate them. That's when we began to see posts from companies that had decided to take the leap. And then the open source community as a whole began to use it.
Does the fact that Chaos Monkey was successful with the Netflix development team endorse the tool?
Today the Netflix team is unique in the world. It has engineers who are genuine experts in many performance issues. Netflix currently operates on the Amazon Web Services (AWS) platform and they have people who know more about AWS than the people at Amazon themselves. It's impressive. So it's always an endorsement when it comes from them.
What other major companies have used or use the tool?
As well as Netflix, Google –which uses its own version of Chaos Monkey– and Amazon, there's Cover Flow, IBM and Yahoo, for example, who have published articles in their technical blogs saying they were beginning to use the tool. Also some other brands like Nike, which has a technology division, although that's maybe not what comes to mind when you think of Chaos Monkey.
Do you recommend using it all the time or just as a stress test in particular situations in a process?
I think it should be used constantly in production, that’s to say, fairly regularly. It shouldn't be done just once and then stopped, but the frequency should be increased until you get to a point that you're more or less satisfied. But you have to be careful when you install it, as at the beginning it’s bound to be rather catastrophic and you'll have some faults that affect production. With time things will settle down. Your system will have considerably improved if you've learned from your mistakes, and you can finally use it on an ongoing basis. What's more, as Netflix said in the introduction to this tool: “You never know if that change you made yesterday has caused your platform to become weaker”. There are always new changes, developers assume new features, and someone may have gone in to fix something at a particular time and provoked totally unforeseen consequences.
Do you think open source is the future for companies in terms of software and development?
I personally believe it's the future. The guaranteed quality of the code provided by open code –and particularly if it's free software–, and the capacity not only to see the code but to be able to modify it, extend it and adapt it, is something that can't be achieved with a proprietary software. In fact the software the big companies use, the ones Netflix and Google use... everything they're based on and all the technology that Internet's supported on is based on free software tools, so I'm 100% sure of it.
Sign up to the BBVAOPEN4U newsletter and receive tips, tools and the most innovative events directly in your inbox.