when one little gotcha hits over a 100 servers at once

Its not really a little gotcha anymore is it?


External image

The real forensics and reverse engineering will start monday, but the basics are already known. An updated cfengine rpm from an upstream (third-party) repo somehow reacted differently to our configs than the old version and set the permissions for “/” to 700. Which would have been a fascinating “what if” scenario if it didn’t happen to about a hundred servers at once at 5am on a saturday. Turns out that has nifty little side affects like breaking ssh because sshd cannot read its key files, and of course all our apache servers starting tossing out 403’s for everything in the docroot. So not only were most of our sites down, but we couldn’t login to fix it! It gets better though, for reasons I still haven’t debugged this *also* breaks console logins. So all our serial console and DRAC out-of-band access were no good too!

We wound up having a five-man parallel effort to reboot every affected machine into single-user mode, fix the perms, turn of cfengine, and reboot again. Whenever you do that many unclean reboots you’re guaranteed to spend the next week debugging “fallout” issues.

The inevitable question of course is “what the hell were you updating packages for at 5am on a saturday”. There’s a couple answers to that. Currently all our recent machines yum-update every 15 min. Now that comes off our own repo’s, which are mirrors of several popular upstream centos/rhel 3rd party repos. I don’t know how often the repo’s update off the top of my head, but I wouldn’t besuprised if it just runs out of /etc/cron.daily, which kicks off just after 4am. The reason our updating is so agressive is twofold. One is that we are increadibly strapped for manpower/time, so a lot of times we do risky things because thats how you make up time. If you cut 10 corners and only one burns you, sometimes its worth just taking the occasional burn. Two is that even though it sounds risky, we’ve had hundreds of servers doing it for nearly two years and this is the first time its actually caused a signifigant problem. Thats a real testament to the stability and QA/regression-testing of the redhat package base. In fact the package that burned us here was from repoforge. The only other time we’ve hit a snag was when the centos project relased a few kernel updates but didn’t do the clustersuite updates at the same time (clustersuite packages are very picky about kernel versions). This didn’t break anything live but we found out when rebooting our standby mailserver that our failover setup was broke and wouldn’t have worked when we needed it.

The opposite end is to only do updates one-server at a time on an as-needed basis. The problem with that is that with any popular linux distro the updates tend to stack up so much that if you haven’t updated in a few months and go to install something new, the dependency chain will often be dozens of packages long. So you wind up making large leap-forward changes on a host when you really only wanted to install one thing, which in turn makes everyday activity riskier and makes any given project go slower as we deal with more fallout or take more precautions.

The funny thing is, we had actually gone for an 80/20 tradeoff and hedged our bets on this. A list of key packages (of which cfengine was one) were in an rsync exclude list. So it shouldn’t have been updated. But, it turns out, the quoting in the config wasn’t quite right and it was “failing silently” by just allowing the updates through. Boy the littlest things just getcha hard some times.