I brought down our SQL server

Asgorath

[H]ard|Gawd
Joined
Jul 12, 2004
Messages
1,253
Well, I did a simple thing...I installed a new KVM over IP at work. I needed an IP for it, so I pinged .20 on our range and got no responses. Set up the KVM, gave it the .20 address. Was working intermittently but figured it was a KVM problem. Went home.

Come back the next morning and turns out our SQL server was at the .20 address. :-( I had it down for 2 hours before our other tech figured out the problem. My boss is going to have a talk with me but doesn't have time right now. I have a full day to stew in my mistake and wonder what's going to happen. I think he's just going to take away some of my admin rights and tell me not to touch certain boxes.

This is the worst mistake I've made. I thought i was being safe just working on the KVM...but a small careless error brought down our website for 2 hours during a high traffic period.

I don't expect sympathy here, just a good rashing. Let this be a lesson to all the admins out there....Be very very extra careful when you're doing anything that could have major consequenences. I thought that if I made a mistake someone would complain. But it didn't bring down the SQL at first, it wasn't until later that all SQL traffic was rerouted to the KVM. The two devices were fighting for the .20 address, and so it was intermittent. The KVM ended up winning. :-( . I'm a bit scared and damn angry at myself right now. Just needed to vent on how some bad admin'ing can go wrong.

If you want to scold me...feel free.

If you want to tell a tale of you doing something stupid like this...just as well.

Ops...you can lock this one if it gets out of hand, but please let me just voice this one for a little while.

Thanks,

Greg
 
a simple mistake... but one thats easily made when the documentation of your network was not previously consulted.

every device with a static IP needs to be documented... be it an excel spreadsheet or a full blown CMDB (confiration management database). ideally there is an entire change control processes that should be happening. your change process (or change manager) should ask you a few questions...

1) waht do you want to do? - deploy a new tcp/ip device on the network
2) what ip are you goingn to give it? - .20
3) whats the possible impact of giving it .20? - uh... the SQL server is .20... i guess it would totally bring it down.

etc etc etc.
 
Don't stress yourself out about it too much. Stuff like this happens in our line of work. At least you weren't an accountant who missed a few zeros.... :)
 
why wasnt the SQL server ping-able?

also when dealing with static IP addresses you should have a list of assigned addresses etc so this dosent happen

quite frankly i wouldent blame you (much) lol
 
i really don't want to insult you but how do you not know the static ip addresses on your network. especially the servers. even if you didn't know all the ips because your network was huge, common sense would be to reference the log and add in the new KVM before even installing it.
 
Don't stress yourself out about it too much. Stuff like this happens in our line of work. At least you weren't an accountant who missed a few zeros.... :)

Indeed - we've all done something like this in our time. It happens.
 
Funny thing...

after I installed the KVM, I started documenting the network. I drew a diagram, labeled, everything, and got about halfway through labeling the static IP's.
 
I'm pretty sure your boss won't give you much thrashing. It's not like you MEANT to do that. It was an honest mistake (sort of).
 
Funny thing...

after I installed the KVM, I started documenting the network. I drew a diagram, labeled, everything, and got about halfway through labeling the static IP's.

So that was not documented before, thats a mess.
 
So that was not documented before, thats a mess.

Documentation would always be great, but sometimes though I have a hard time organizing all of that -- end up having a hard time finding the documentations.
 
Every Static IP should be documented. If it is anyones mistake it is your boss' for not having the network documented properly. I would not worry too much about it really.
 
There's an old saying....

"Shit happens."

I would not believe a single person who says they have never made a stupid mistake. It happens. Just take this is a lesson. Never assign a static IP address without first making sure it's not being used per your good network documentation and by pinging the IP and checking DNS to see if the IP is registered to a host name. You did the right thing by creating good documentation, but it really should have been done a long time ago by whoever first installed the network.

If this was your first or even second serious mistake, I wouldn't expect too harsh a discipline unless your company lost a lot of money by having their SQL server down for two hours.
 
There's an old saying....

"Shit happens."

I would not believe a single person who says they have never made a stupid mistake. It happens. Just take this is a lesson. Never assign a static IP address without first making sure it's not being used per your good network documentation and by pinging the IP and checking DNS to see if the IP is registered to a host name. You did the right thing by creating good documentation, but it really should have been done a long time ago by whoever first installed the network.

If this was your first or even second serious mistake, I wouldn't expect too harsh a discipline unless your company lost a lot of money by having their SQL server down for two hours.

Welcome to my world hehe, taking and fixing someone else's network sucks for documentation
 
why did the ping time out?

I don't know Never had a problem with pinging inside the network before? Pinged some other clients to be sure and they worked. I was convinced the IP was free.

In the future:

Check DNS records
ping
check documentation (I'm working on building it)
 
If you have access to the routers you can also check arp tables.
 
I don't know Never had a problem with pinging inside the network before? Pinged some other clients to be sure and they worked. I was convinced the IP was free.

In the future:

Check DNS records
ping
check documentation (I'm working on building it)

Sounds like your are already taking the steps to correct the problem, this is what you need to tell your boss.
 
That's usually why I assign servers to one VLAN and BS network devices (KVMoIP, environmental monitors, IP-addressable UPS's) to their own.
 
I'll try to make this as quick as I can.

First, don't beat yourself up too bad. Mistakes do happen. A few things that I've seen in the fire service, such as: Fire Engines running over hydrants, paramedics giving the wrong drugs, etc. Those are serious mistakes by professionals.

So again, don't beat yourself up too bad. But on the other hand, you're an IT professional. Those above examples are the equivilent of your mistake. I'm not on a high horse, I did the exact same thing you did much earlier in my career.

I'm not sure what you're level of expertise is or your years of service, but you'll lose the trust of your team and manager.

The longer you are in your career, the more these types of mistakes will piss you off when guys junior than you make them.

The most important thing you can take away from this, is to quickly admit your mistake to your boss and team. Ownership is the biggest thing you can do when you make a mistake.
 
Like the posters above mentioned already, just learn from your mistakes - this is what makes you a better admin. I just started out as an admin and I know that only through mistakes and problems can I really understand something. What's the difference between the senior admin working next door and myself? Experience on handling problems. Let's face it, you will now think twice before giving it a network address :)
 
Gotta admit that is a nice little screwup but as others have said, shit happens.

When the boss says something to you be sure to show what you are doing(documenting the network) so this doesn't happen again. I've seen shit like this happen all the time. Some good ones I've seen.

One of our clients was having trouble with their dsl. Instead of calling us they called verizon. Verizon had them reset their linksys router(which killed the ports that exchange, dns, remote access, etc all use), set the server to dhcp on its lan card(this really cause some issues) and do a bunch of other shit. I walk in to fix the issues when everyone is at lunch and was like WTF? Took me a while to figure it all out and get everything running right.

My assistant made the mistake of running the dsl software on a server when she replaced a dsl modem once. Pretty much did the same thing as above. My boss showed her how to fix it and got them running for the most part. Had a few issues to fix the next day when I took a look at it.

One of our clients went to reimage a notebook. Ghost managed to ping the entire network into nothing. Took out like 5 remote branches. As I walked in I noticed the switches in the main branch going nuts. As I stare at them I get a call from their citrix host asking what the hell is going on. That guy figured out quick why I had a switch sitting next to the computer we ran images off of.
 
funny, i'm actually updating some documentation right now so crap like this doesn't happen to me at the office. i get lazy sometimes though, and have done the exact same thing once... took out an entire school district, but only for about 5 minutes. they didn't even notice (or at least never complained).
 
No biggie really.
As serious screw-ups go, that one was pretty minor.

As has been said before, we've all done something like that at one time or another. Good technical people are hard on themselves (just like you are) when they make a mistake. That's how they got to be so good. Keep up the good work. :)

I've seen admins (no, not me) lose data, lots of important corporate data. :eek:
That's a bad day for everybody, and those are the screw-ups that get people fired.
 
why didnt your monitoring system go off and alert on the sql server server? You do have a monitoring system? right?
 
why didnt your monitoring system go off and alert on the sql server server? You do have a monitoring system? right?

Good question, that and why did it take 2 hours to figure out, last i logged into a test server and it told me, hey you have an ip confliction on your network.
 
Just clear up the .200 IP real quick and say you forgot to add an extra 0 on the end. :p
 
Welcome to my world hehe, taking and fixing someone else's network sucks for documentation

Man- Every single network I have ever screwed with has been this way. I don't have a bit of a clue how it is setup at all.
I know what machines have static IP addresses, but I have no idea where all the switches are or anything like that. It can drive you mad.


What I can say to you: No amount of bookwork can prepare you for everything an IT job throws at you. This is why experience is so much preferred over "certifications" and items like that. A degree in IT won't get you a whole lot, but it generally does require at least some hands-on compo nets- which is better than nothing (and why they aren't quite worthless).

You learn from mistakes, that is all there is to it. And having a poor documented network just adds to the problem. Each little problem or mistake you run into tells you a little bit more about how the network works.

I will say though- I would have tested the server to make sure it worked before leaving like you did. I don't "assume" anything is going to work. Test it out and make sure it does.
Sure- sometimes it works great when you leave- and the next day it all came crashing down. You at least know it worked when you left, though.

Remote-access is your friend as well. After everyone goes home for the day- you can remote in and reboot the server, if you applied patches during the day that required a reboot. Before they came up with remote access, it must've been a pain (thankfully I never had the opportunity to experience that...), even basic like Telnet is better than nothing!
 
one time, i was assisting a lower level tech with a server that i was the admin of (im in dallas, they in NYC). anyway, long story short, while i was on the phone with this guy, somehow he got lost in the bios, and unconfigured a 14x72GB drive raid array. 800GB of data went poof. adding insult to injury, they had not been changing their backup tapes, and the last good backup was 3 months prior. they were also in the middle of a giant pitch to a new client, so hundreds of millions of new dollars were at stake.

it took about a month and about 50 grand for a company here in the dallas area to disassemble those 14 drives, and bit by bit get data off that broken raid5 array.

prior to this incedent, the CTO of the company used to fly me out for every problem that every server in our enterprise had..i have no idea why they asked me to work with this guy over the phone that time. he wasnt an idiot...but he was a desktop guy trying to understand about server hardware that im telling him over the phone. in the end, they had me on a plane the next day to go out and start damage control.

so, your SQL server being down for a couple hours... is not that big of a problem in the grand scheme of things. (worse accidents happen!)
 
Back
Top