Scary moments in system administration

Was working in the FAB a few days ago, went to work on a server and the mouse decided to jump to the top right corner as I was clicking on something. Fortunately the software asks if you really want to shut down and explain to the boss why you just scrapped all the wafers in the tool. Needless to say that mouse was in the trash shortly after.

A few years ago I was sent out after hours on a Friday to a dental office that we took care of. They were switching software packages and the vendor was doing the install, I was just supposed to be onsite to let the vendor in and keep an eye on things. I watch the vendor start the install on the server but then see something strange. The installation package started deleting old files for software that hadn't been installed before. I question it, but the vendor says it's normal. The software finishes its installation and the vendor asks me where the old program kept its data so he could import it. I try to point him in the right direction but all that data is missing. I pull the vendor away from the server and go to the tape backup that was done earlier that day only to find the tape in the server was broken and the client hadn't rotated tapes for several months. In the end, I became vary familiar with Ontrack that weekend, the client's insurance co brought in someone to verify all paper records and we started doing all the backups for the client. Lesson learned, when a client says they are doing their backups and rotating offsite weekly ignore them.
 
We recently had a new room built to house all of our new equipment. We recently purchased new HP SAN's, new servers for VM's, new switches, new storage repositories for files and archiving, new backup storage arrays, etc. We haven't had rain in this part of Texas for months until a week ago and it rained about 3/4" in 15 minutes in the part of town we're in (other side of town had no rain). Anyway, the electrical contractor was running conduit through the roof for electrical cables to go to the condensers for our mini-split A/C's and didn't patch the hole in the roof that he had made. As you can imagine, water ran through the hole and got all in our new server racks.

We could literally unrack the servers, tilt them a little and water would come streaming out of the back. Everything in there in the two racks has been ruined.

Proper backups saved my ass for our SQL data and other important data that we would have lost otherwise. However, even currently we are crippled because we are still awaiting delivery of new hardware to replace the stuff that has been ruined. We use Citrix XenApp to a big extent and the entire farm was recently put into virtual machines instead of physical servers. Now it's all unavailable and our other offices and subsidiaries can't connect through XenApp.
 
Last edited:
We recently had a new room built to house all of our new equipment. We recently purchased new HP SAN's, new server for VM's, new switches, new storage repositories for files and archiving, new backup storage arrays, etc. We haven't had rain in this part of Texas for months until a week ago and it rained about 3/4" in 15 minutes in the part of town we're in (other side of town had no rain). Anyway, the electrical contractor was running conduit through the roof for electrical cables to go to the condensers for our mini-split A/C's and didn't patch the hole in the roof that he had made. As you can imagine, water ran through the hole and got all in our new server racks.

We could literally unrack the servers, tilt them a little and water would come streaming out of the back. Everything in there in the two racks has been ruined.

Proper backups saved my ass for our SQL data and other important data that we have lost. However, even currently we are crippled because we are still awaiting delivery of new hardware to replace the stuff that has been ruined.

Wow........
 
We recently had a new room built to house all of our new equipment. We recently purchased new HP SAN's, new servers for VM's, new switches, new storage repositories for files and archiving, new backup storage arrays, etc. We haven't had rain in this part of Texas for months until a week ago and it rained about 3/4" in 15 minutes in the part of town we're in (other side of town had no rain). Anyway, the electrical contractor was running conduit through the roof for electrical cables to go to the condensers for our mini-split A/C's and didn't patch the hole in the roof that he had made. As you can imagine, water ran through the hole and got all in our new server racks.

We could literally unrack the servers, tilt them a little and water would come streaming out of the back. Everything in there in the two racks has been ruined.

Proper backups saved my ass for our SQL data and other important data that we would have lost otherwise. However, even currently we are crippled because we are still awaiting delivery of new hardware to replace the stuff that has been ruined. We use Citrix XenApp to a big extent and the entire farm was recently put into virtual machines instead of physical servers. Now it's all unavailable and our other offices and subsidiaries can't connect through XenApp.

What action have you taken against the contractor, if any?
 
What action have you taken against the contractor, if any?

So far we have taken a lot of pictures and have filed a claim against his insurer. The company I work for is a subsidiary of a very large conglomerate and the legal office for corporate is instructing us on how it wants to proceed.
 
A couple months ago, one of our VM's kept randomly crashing a couple times a day. After looking into the issue, I find out the hard drive is throwing a ton of errors and appears to be dying. So I contacted our dedicated server host to get the drive replaced. The plan was to have them add an identical drive, move the VM data over, then have them remove the faulty drive. They had to shut the server down each time as well, since I guess they weren't hot swappable....

So I finally get a window for downtime to add the drive to the system. Once that was done, moving the data went smooth and the VM's were running again with no crashes/errors. Since this was one of our production servers, it took me a week and a half to get another window to take the servers down and remove the faulty drive. It also had to be during off hours, so that meant the 12-3am window. I had a bad feeling, and decided to stay up and make sure nothing happend. At about 1am they finally took out the drive and booted the system back up. Once I finally get in to check the system, I find that the tech had removed the wrong drive. Of course the drive he removed was being used by 2 other missions critical production servers. The guy had also already updated the ticket and moved it to accounting, so he couldn't see my update telling him how bad he eff'd up. So I spend the next hour trying to get some help via a new ticket / frantically calling in before they wipe the drive. Finally an hour later I get in contact with the tech and find out he still has the drive. Of course he had to take down the server again to swap the drives. So the night I wasn't even planning on staying up, had me freaking out until past 3am.
 
Installed Dell Openmanage Server Administrator tonight remotely on a client's system. Now I don't see the server. No DRAC or console access either :( early morning tomorrow.
 
Installed Dell Openmanage Server Administrator tonight remotely on a client's system. Now I don't see the server. No DRAC or console access either :( early morning tomorrow.

Ugh, I absolutely hate, hate, HATE rebooting remote client servers when they don't have DRAC/iLO. Especially those 7+ year old SBS boxes that take a good 30 minutes to boot up normally.

My pants shitting moment this week was coming into work and finding all the processes had crashed on our server with a 22-drive 10TB RAID10 array. Computer shows no drives and app/system logs are absolutely chock full of errors. Array manager stills show the array, so I cross my fingers and reboot.

Comes back completely fine. Turns out the SAS card isn't to fond of having the external tape drive plugged in, as the tape drive tends to shit itself every so often and send a reset over the bus. Normally it doesn't cause an issue, but this time it caused the card to drop the array as well (hell if I know why). Whew, I was really not looking forward to restoring multi TB of data from all sorts of different tapes.

Guess that will force me to finally get the separate HBA for the tape drive. My lazy fix for now is to just make sure the B2D and D2T backups don't overlap, seems to be working. :)
 
Created a super locked down IT policy for our Blackberries and accidentally enabled Force Smart Card Two Factor Authentication. Everything was fine until the server got restarted later in the day from updates and it pushed out the new policy to the handhelds. Ended up having to manually wipe each handheld and resync to the server. After that they got me the second copy of BES I had been asking for for testing. lol
 
Comes back completely fine. Turns out the SAS card isn't to fond of having the external tape drive plugged in, as the tape drive tends to shit itself every so often and send a reset over the bus. Normally it doesn't cause an issue, but this time it caused the card to drop the array as well (hell if I know why). Whew, I was really not looking forward to restoring multi TB of data from all sorts of different tapes.
never ever put tape drives on the same controller as drives.
Learned that the hard way multiple times over the last 15 years. (not on my systems however after the first)
It always causes issues eventually.
 
Worst I've seen is when some guy is configuring a backbone switch that connects our global offices to HQ and configures the wrong subnet mask causing the entire enterprise network to propagate that route and shutting everyone down. Good thing he caught it, like ten minutes later after eHealth was blaring all over the place. lol
 
Was working in the FAB a few days ago, went to work on a server and the mouse decided to jump to the top right corner as I was clicking on something. Fortunately the software asks if you really want to shut down and explain to the boss why you just scrapped all the wafers in the tool. Needless to say that mouse was in the trash shortly after.

A few years ago I was sent out after hours on a Friday to a dental office that we took care of. They were switching software packages and the vendor was doing the install, I was just supposed to be onsite to let the vendor in and keep an eye on things. I watch the vendor start the install on the server but then see something strange. The installation package started deleting old files for software that hadn't been installed before. I question it, but the vendor says it's normal. The software finishes its installation and the vendor asks me where the old program kept its data so he could import it. I try to point him in the right direction but all that data is missing. I pull the vendor away from the server and go to the tape backup that was done earlier that day only to find the tape in the server was broken and the client hadn't rotated tapes for several months. In the end, I became vary familiar with Ontrack that weekend, the client's insurance co brought in someone to verify all paper records and we started doing all the backups for the client. Lesson learned, when a client says they are doing their backups and rotating offsite weekly ignore them.

Funny, I had a similar experience with a dentist office and ontrack. They too had not been doing backups and the data drive on their ancient server had crashed.

I guess it really doesn't pertain to dentists, the majority of the clients at my old job were not very consistent at switching tapes out and verifying that backups had completed successfully.

I had a hard drive fail on my main production server last week, but it was the OS drive in a Raid 1 array, so no downtime. :)

If this was 8 years ago, hardly any of the servers my past company's clients were on had raid arrays, would have been a bad day :p
 
I've had a few little things here and there nothing really SCARY... co-workers downing servers... people tripping over power cords in the switch rooms...

Biggest annoyance was a number of years ago when Blaster Virus hit... I was working for a bank and our "brilliant" security person decided to push out the updated virus defs to our company servers. Of course since they worked on 2 of our NT 4.0 desktops it would work fine on the 180 servers... Then again maybe not... All 180 servers crashed. A dozen of my co-workers and I got phone calls around midnight that we needed to be in our main office by 5:30 the following morning. They were able to bring up and back out the update off of about 85-90 of them... the rest we had to drive around the suburbs of Chicago manually bringing the servers back up and removing this update. And our genius security person actually had the balls to wear his Network Associates polo in that day... needless to say we almost found out if he actually was the superman he believed himself to be or just how much of a stain he'd make on the pavement after a 35 story drop.

After many hours of driving around through Chicago traffic we got the remaining servers back up and online... But I still wish they'd have handed the whole thing over to superman and let his dumb ass deal with it all.
 
This is an awesome thread....and I have so many stories....

I was working as the network/systems/com admin for a private college that ran camps during the summer. We had a very reckless heavy machine operator that was changing out a large water slide that had a phone mounted to the side of it....we told him to call us before he pulled the section of the slide out of the ground with the phone on it....he did not call us. We did however get a call from someone on the campus to inform us that half the phones were not working for the entire campus....turns out when he pulled the phone cable up with the slide section he pulled the underground copper wire junction box out of the ground and ripped it to shreds. Of course there were over 500 pairs of copper in there that were ripped to shreds....so I spent the rest of the day under the hot summer sun soldering wires back together.....this was not the first or last time this guys did major damage, he was also the snow plow driver and he wrecked more than one com box while I was there....
 
like 5 years ago, my office was moving, so our office was split into 2 parts. We had point to point paired t-1s connecting our office together since the server room was still at the old location. all voice and data traffic went over the vpn to our other office. We were sharing a DSL connection at the main office, which everybodys internet traffic was coming from. One day i unplugged the DSL modem. lol. that was kinda funny.

we have a 100 meg fiber connection now...thank god times have changed!
 
The biggest thing is when I boot the VM server. I don't have an IP KVM on it and it has the basic idrac so I don't get to watch it from boot. Other than that, it isn't bad. The stuff that users make to be big deals are usually picnic errors or issues with management not wanting to make decisions.
 
I get to add this as a funny facepalm moment.

I was working on my test ASA firewall as I'm preparing to upgrade our production firewall to 8.4 firmware.

I downgraded my test firewall to 8.2 with an old config file... forgetting the old config had the IP settings matched to the production one. Needless to say I bumped our VPN offline. After a minute of pinging the test firewall unsuccessfully I realized oh shit I bumped it. Ran to the NOC and unplugged the cabling and brought up the interfaces for the production firewall.

Now I have to sit in the server room with a laptop redoing the test firewall from proper backups booo
 
CISCO6500>switchport trunk allowed vlan 303

is MUCH different than

CISCO6500>switchport trunk allowed vlan add 303

sending a command to a switch and having it immediately lose connection to you is always a fun one. Especially when it is one of the two BIG switches that connect everything.
 
CISCO6500>switchport trunk allowed vlan 303

is MUCH different than

CISCO6500>switchport trunk allowed vlan add 303

sending a command to a switch and having it immediately lose connection to you is always a fun one. Especially when it is one of the two BIG switches that connect everything.

Haahaha yeah. I remember making that mistake on my test lab and breaking my home network (was putting a 3550 into place for gigabit for the roommates). Fun times
 
Back
Top