Man, this autopsy on the last great internet breakage was definitely a breath of fresh air when it comes to public disclosure of mistakes companies with thumbs in a great deal of interwebs infrastructure have made. Again, like most of the major outages (remember the S3 one a few years ago?) it came down to a single mistake that cascaded out to wreak havoc and, at least in this write up, the person responsible wasn’t thrown under any buses because regular expressions are hard and difficult to test definitively in many cases. This also reminds me of how easy it is to break many things at great speed when you have a pile of automation in your stack and that automation not only doesn’t fix everything but can make small issues global pretty quickly.
Anyway, it was nice to see a comprehensive write up of what happened that didn’t lay blame on any individual human or service. Reading this also made me clench with fear since I have the opportunity to make mistakes like this, albeit nowhere near the scale of this one, all the time and often under pressure.