Author: Preston St. Pierre
events that I can only describe as “interesting,” I celebrated it on Sunday,
November 21st. For my birthday, I learned a good number of very important
lessons. Just a few examples: UPS batteries are wired in series; out-of-band
communication devices are a lifesaver when your mail server is down and two
thirds of your crew is 800 miles away; document, document, document, and much
more.
Here’s the scoop
So it’s the Friday before my birthday, and I’m saying goodbye to most of the
people I work with, who won’t be back until after the LISA conference. I’m not
feeling stressed about their departure, as I’ve now been in the department for
a little over three years, and feel fairly confident that I can muddle through
pretty much anything that comes along. Simple resource requests from users,
debugging wireless networking or DSL issues, and debugging services in our environment
are all pretty much old hat. So I leave to begin celebrating my 31st birthday.
Sunday the 14th started as a very relaxing day. I had been out Saturday night
quite late, and didn’t rise until 10:30. On purpose. It was my birthday, and
nobody was gonna rush me into doing anything. I walk past my office down to the
kitchen and make some coffee, I have some breakfast, and I walk back upstairs
to see what’s new in my inbox. Only there is no inbox, per se. Just an error
message from Evolution saying it couldn’t reach my IMAP server. I’m not shaken
a bit. I recently moved into a new place, and was forced to give up the awesome
Speakeasy DSL service I had for Comcast cable internet. I figured it was a
simple matter of having to reboot my cable modem to pick up a new IP address or
something. But wait… I can get to Slashdot… I can get to Google… I can
get to sites I’ve never been to before, so I know it’s not cached… Odd… But
I’m still not quite worried.
Hey, I can’t get to the department website! Hey, I can’t ping the web server!
ACK! I can’t ping any of our servers! Oh no. My new blackberry 7290 lets
out a buzz in the background of my office, and I perk up. “Mail! Someone got
through to the mail server, so all is not lost!”. Wrong. Blackberries can send
PIN-to-PIN using nothing more than the cellular network, bypassing any notion
of a mail server, and that’s exactly how this message came to me, from a member
of our group who was about to board a plane for the LISA conference in Atlanta.
It read something like “Hey, something amiss here, anyone around?” Well, turns
out, I was around, and that was pretty much it.
As a group, we keep a pretty close watch on our infrastructure even when we’re
not around. We get emails from our UPS to let us know if there’s an outage,
emails from our syslog server, emails from our air handlers, even emails from
various key entry points to our machine rooms and networking cabinets. There
was an email from our UPS earlier about a power hit, but then another email
came that would seem to have indicated a recovery. I was hopeful that things
weren’t in some “day after” state when I suited up and headed into the office.
Early Discoveries
The first thing I noticed was via my ears, not my eyes. The UPS has a horn on
it that is louder than I had remembered. I silenced it, and then noticed it was
still in some faulty state which I hadn’t seen before. Flipping through the
menus on the LCD screen, I saw that the battery voltage was at 0%. This was
bad. However, the room, and the building, had power.
Next, I logged into our console server to connect to our file server. My heart
sank when I saw the prompt on the Sun 4500 server: hit ctrl-d to continue
. Oh no. Why hadn’t it booted? A quick glance over to the racks
booting
of disk showed that an entire T3 storage array was black. I quickly power
cycled the array, continued booting the cycle server, and then quickly switched
consoles to check on another machine that I knew would be lost without the file
server. It was also in a weird, half-booted and broken state, and running
uptime
on both machines showed that they had been up only a couple
of hours.
At this point, I’m thinking that at some point our entire machine room was
black. Just then a member of the user community comes by and says his machine
rebooted some time ago, and he’s since been unable to get an IP address. So now
I know the entire building lost power, which leads me to the only reasonable
conclusion, which was kind of scary, because it’s never happened before: our
UPS completely and catastrophically failed. Oh joy. I call the UPS tech
support, and they get somebody on the road to visit within the hour.
Of course, the fact that the power is back means that something else pretty
catastrophic happened: every machine in our entire machine room was powered on
and booted at the exact same time. Service dependencies be damned!
Calling for Backup
At this point, it’s pretty clear that I’ll be needing backup. By now, a Sun
tech is on his way to replace a disk in our array, and a UPS tech is coming out
to get our UPS back to normal. It’s clear that the entire world will have to be
brought back down, if only to insure that it is brought up in the order it’s
supposed to come up in. My boss and coworker are still in town, and they come
in to lend a hand.
All this time, it should also be noted that, although we were without a mail
server, probably 50 or so messages passed through my Blackberry to communicate
with others in my group who were out of town. This was invaluable since there
were slight changes and additions to our services which were not yet reflected
in the documentation.
The UPS tech says a battery in the UPS failed, and shows us splatterings of
battery acid inside the battery compartment. There are 20 batteries, so I’m a
little baffled, as I’m not an expert on UPSes. Turns out, UPS batteries are
wired in series, which makes some sense if you think about it. Wiring in series
is the only way the UPS can get the benefit of aggregated power from all of the
batteries. Wiring in parallel will only allow it to use power from one battery
at a time. Of course, the side effect of this is that UPS batteries resemble
christmas lights: one goes out, they all go out.
The Sun guy shows up a little later, and, though he seems a bit confused by
some of the errors in the logs, replaces the drive and things are well with the
world. At this point, it’s after 5 PM, and my boss instructs me to go home for
my birthday dinner. I followed his orders, and left them there to bring the
rest of the service machines back up.
Post Mortem
After all is said and done, we’re not happy that this event took place, but the
reality is that UPS batteries don’t often blow up, and we don’t often get
simultaneous disk failures in T3 arrays that cause a file server not to boot.
At the same time, we are still discussing ways to shorten both routine and
emergency downtimes, and how to make the process smoother. The machine room is
constantly evolving, and along with it our process for keeping up with the care
and feeding of our systems and services.
Some investments, like the Blackberries, which we had upgraded only a
couple of days before, were totally justified. Others, like a call-in number
for downtime (similar to a school’s inclement weather line), are being
explored. Still others, like a paid-for inspection of our UPS batteries only a
couple of weeks before the failure, are being questioned.
These are all signs of a healthy admin team. There was no infighting, no
blaming, no finger pointing, no mumbling under our breath. Things needed doing,
and they got done. What didn’t get done is getting done with the help of
others. What couldn’t be done is being researched, and all is peaceful in the
user community.