At first I thought it was a bit of a stretch to call it a self healing system but honestly, it’s not. As long as you only deal with software failure, it can be fixed with software. Hardware is harder to fix like that (thus the option to finally escalate to a human admin at the end of the chain).
A guy (Patrick) at Facebook found himself doing the same things over and over again and started writing scripts that did these things for him. The number of scripts grew. The complexity of the scripts increased and after a while the entire team found them useful. At this point it is a pretty obvious step to turn this into a full fledged system and they did just that. Thus creating FBAR, Facebook Auto-Remediation.
The article doesn’t really go into the details but that’s okay. The thinking, and the methods used here are solid. The first stumbling steps of FBAR wasn’t the workings of a complete enterprise level tool. It was a few lines of code that helped eliminate time consuming and probably mind numbing manual tasks. It looks like it is polling at the moment, an event driven approach would have been cooler but hey, as long as this Gets It Done it doesn’t really matter. Besides, there will have to be an insanely big server farm for the polling to be an issue.
Right on Facebook. The last time we spoke, you didn’t have any QA but it does give me some comfort that you are in fact doing some of these clever things. Who knows, maybe the step to automating some sort of high level tests for your services isn’t that alien after all?
On a related subject, I’ve been working on TAS (The Automation System), which is now part of TTS (The Testing System). It’s bare metal, it’s light, it gets the job done. I need to make it good enough for dog fooding before I release the first drop. Any day now. Or, before Christmas =)