ALERT! EMERGENCY FIX REQUIRED!!!

I'm taking a vacation day today. I woke up at 5:00 to take my daughter to school (trip to six flags), then went downstairs to put some finishing touches on some html I was fiddling around with to help with some new screens.

While doing this a developer on my team started IM'ing me in a panic about a production problem that they where going to do an emergency deploy to fix. I attempted to calmly ask her to please explain the exact situation to me, but it came out more like "ARE YOU F'ING JOKING!? TELL ME YOU AREN'T SERIOUS!, WHAT THE HELL?".

Whoops, no more coffee for me, my bad.

So then I opened my email inbox and discovered no less than a dozen emails about about how the day before there was some sort of error that cropped up based on a specific data condition and stopped an extract process dead in it's tracks.

For some reason, this software has run without exception for 2 months and suddenly yesterday at 5:00pm this data condition started to happen. What puzzles me is that this tiny miniscule detail seems to have eluded everyone involved. Nobody seems to have stopped and asked "Hmmmm, WHY did this start happening? What changed? We all just started digging to get ourselves out of the little hole we found ourselves in.

Some fixes I heard where "patch the the code to convert nulls to 0 or -1" or "update the production database and change the nulls to 0's or -1". What I didn't hear (still waiting by the way) is "Hmmmm, I wonder what changed to cause this to happen?" and worse yet, what I didn't hear was "I wonder what we'll break if we suddenly change all this data to a different (invalid) value?"


Now that I'm a bit calmer I'm also asking "what is the business impact to this?" From what I can tell, we had a "red alert all hands on deck emergency" response to a relatively trivial problem. I'm glad everyone want to help and react quickly, but in my experience reacting quickly without thought (Oh, your arm hurts? cut it off!) more often than not has some serious side effects.

Now, someone might come back with an explanation about why this particular problem garnered the response it did, but I'm still waiting on that one. It's now 10:00am and I'm ready to take a break....I think I'll go outside and mow the lawn.

As a followup, we ended up on a conference call about this and it turns out the problem was caused by a bad data condition that really shouldn't have happened. In this case it is probably good that it blew up because we now can figure out how people are putting data into the system that is incorrect. While it's not ideal that we had such a bit TODO, it is good that we didn't put in one of our "fixes". They would have masked the real problem and caused even MORE bad and incorrect data to be put into the system.

My learning from this is that sometimes I need to sit everyone down and keep asking questions. There are a lot of people who are in hammy mode and sometimes someone needs to sit down with them and help get calm. It's funny because I have a reputation as a hot head, but it usually takes quite a bit to work me up if everybody's being reasonable. I took the ARSE test a while back and I'm not TOO bad (yet).

Comments

Unknown said…
You a hot head? Psshaw mon frer, I don't believe these lies.

Of course it was the data! Has no one heard of Occam's Razor?

I might have responded like this...

-- OMG Matt the "arm durch Arbeit" program abended! We should hammer in some db updates to fix the problem!
-- Never fear, I'm going to give you a script that will pinpoint the problem. Call me when the script stops running.

#!/usr/bin/python
x = 0
while x < 1:
   print "The data changed! Buy a clue!"

Popular posts from this blog

Push versus pull deployment models

the myth of asynchronous JDBC

Installing virtualbox guest additions on Centos 7 minimal image