Scary moments in system administration

stiltner

[H]F Junkie
Joined
Mar 16, 2000
Messages
10,691
Whats your scariest moment that you can share about doing systems administration?

Like, for example, you had to get something back online, pressure was on, people were complaining, and one slip of the finger while under pressure, and you had to start all over.

IT never used to be this stressful, back 10-15 years ago data was not the lifeline of business it is today. I was literally so worried about my snafu that I was physically ill. I mean like I wanted to go throw up.

So, lets do some sharing, we've all been there, or seen something / someone have to go through the mess I'm sure.
 
The guy on our team with MS cluster experience had left about 2 weeks before. We were patching a clustered file server with 2 1.5 TB disk arrays (this was back in 2005) following his instructions.

After the first server was patched and rebooted, one of the 1.5 TB arrays (they were about 80% full) errored, and chkdsk reported that the file system was unrecovereable. Ten minutes into our troubleshooting, we realized our cluster "expert" had given us a patching process that was exactly opposite from MS best practice as it was relying on the failover methodology every time we patched. And that the quorum was trashed as well.

The person I was working with said "If we leave the country, they'll just find us, so I guess we better get it back in some sort of working order."

Since we had a DFS in front of it anyway, we just brought the one array that was working back into service on one cluster memeber, and redirected DFS using that server's name. We reformatted the other array, and started a data restore. Five days of tape restore later, all the shares were available again.

We had notified everybody when it happened, and the next morning (I was obviously there all night) I gave our initial root cause analysis of our following a non best practice patching procedure as the cause to the IT director of the client. He thanked me for being candid and taking responsibility, and for going ahead starting the tape restore as fast as we did rather than continuing to try to recover the array.

A couple of months later, we had another issue with that same array, and we found out the manufacturer was having all kinds of problems with them, and had been issuing firmware updates about once a month. It was another 6 months before we had really good firmware.
 
Nothing important compared to above, but I recently restarted work in a what I would consider stressful environment where everyone is on your back the second something goes down. I quit 2 years ago and was contracted back on for a network revamp 2 months ago. During my audit, I found WUS had been disabled on a production server. After scheduling a mid weekend 11p,-5am maintenance window, I proceeded to remotely run windows update and bring the machine back up to date. All went well, until I jumped over to the primary server and found WUS had been disabled on it... for 2 years... It was 5am by now and my maintenance window had passed.

After several "should I risk my only connection to the site SBS vpn" and lacking a key to the building, I said screw it and ran the updates the following Sunday night. (I also proceeded to check the remaining servers for updates and brought them all up to date.) At about 4:30am Monday morning, I was kicked with no warning. For nearly 2 hours I tried everything I possibly could think of and suddenly everything came back up, 10 minutes before the first worked arrived for the day. What did it? As far as I can tell, a NIC card update which failed. The server rebooted and reverted to the former driver. I always run NIC updates in front of the physical machine, but must have missed it due to the extreme number of updates needing to be installed or the two consecutive late nights.

Anyways, that was my bit of fun :)
 
We had a tape backup reporting an over voltage on one of our PDU's. I took a volt meter and tested the PDU, the PDU had a short and the circuit breaker on the PDU didn't trip nor did the breaker on that outlet block.

Long story short, the breaker on our 15 kVA UPS tripped, dropping the load to our server room floor for a split second.

It was a long day after that recovering the exchange database.
 
I was doing some remote administration for one of my client's Windows SBS2003 servers. I wanted to right click on the network connection icon in the system tray and click on status, but accidentally clicked on disable. This was the primary and only network card hooked to the network. :(

I tried to instruct them on how to re-enable the network connection but they could not figure it out. So I had to drive 30mins to the client's office and fix the issue. Then I had to file a claim with my business insurance for the downtime to my client's office. :(

I guess that is one advantage with UAC. You get one final chance to back out before making a major system change.
 
I probably had several big ones but unfortunately the only one I committed to memory was not necessarily the worst.

I was adding a secondary IP address to our main datacenter Cisco 6509 on the primary SVI and forget to use the syntax, uh, "secondary". I ended up replacing the primary IP which was most servers' default-gateway. It stopped all traffic in and out of our datacenter for a minute until I could get back into the switch a different way to fix it.

As far as management was concerned, it was a network anomaly.
 
I had to help in rebuilding an entire domain before, shortly after I arrived. There were about 350 user accounts lost, and probably double the devices.

Company did not want to give any money for a secondary server (had AD, DNS, DHCP and WINS running on a single box) or a proper backup solution. Didn't even shell out for RAID for redundency. They were using a Dell Dimension 3000 desktop as the server. It didn't even have official driver support for the network card for Server 2000. The driver had to be hacked to use it lol.
 
I have a few, not any really good ones i can remember off the top of my head.

About 5-6years ago i was re-installing win98 on a customers machine, without paying attention i hit "format and install" and well lost the customers data. The customer was irate, but as soon as my supervisor pointed out the "we are not liable for lost data" contract he signed, he shut up. I still felt bad.

Just a few months ago, while updating some netflow statements on a 3845 router at a remote location, i removed a sub-interface. My heart dropped for a split second, while i did a "show start", copied the sub-interface and pasted back into running config. Luckily, it was an unused sub interface for future use. My co-workers just laughed at me.
 
My couple ones I remember..

The last job that i had was doing consulting for small rural area hospitals. We had an update to their HIS system that we started at 11:00 PM after that last rounds of stuff. So as part of this the HIS vendor needed to copy a bunch of tables and rows from the test database. About 4:30 in the morning we were ready for them to their thing. About 5:00 AM we get a call from them asking if they had the right database and for us to check on it as well. We found out that at some part of the night the database had become corrupt. SO at 5:00 AM we are frantically rushing to find tapes or something we can use to get the database online again. We found a tape that had the most recent full backup using backup exec and attempted to restore the database... Backup Exec further fucked up the database (imagine that) so we continued to rush to find something. So being the good sql admin I am I had a maintenance plan that backed up the database every night on top of the backup exec job. So we found the most recent one flat file and restored it.. We only lost a handful of things that they had made. Thank God finally about 8:00 AM we got everything back up and running mostly as expected. No sleep for the WIN

So the Run down:

10:00 PM Start
4:30 AM Database Blowup
5:00 AM Frantic Restore
8:00 AM Everything is Running
4:30 PM head to motel
5:00 PM Crash

My most recent scare at my current job was when I got a frantic call from our California production manager. He asked me if I was working on his computer. I told him now why.. He proceeded to tell me that his mouse was moving around and it wasn't him doing it. So I quick VNC into his machine and sure enough something / someone was sifting through his email. So I quick shut down the vnc service, change the password and start it again.. Start sifting through anything and everything I can think of logs wise. We sift through firewall logs to make sure the old consultants weren't snopping or anything and didn't see anything from the outside. Were thinking someone figured out the VNC password and was digging for something. The bright side of this scare is that I was able to have a business case for Dameware. I was able to get 2 copies of dameware NT Utilities. 1 for me and one for my boss
 
Long story short we have a fiber ring that goes from Des Moines -> Newton -> Us .> Huxley -> Kamrar. Got a call from the NOC one night that they lost the connection to us from both sides of the ring, but were able to switch to backup fibers and get us going again. Sadly we have ~10 mile strech where both sides of the ring are on the same physical cable. :eek:

To make a long story short we found a spot where mice had chewed our fiber in a hand hole and respliced that. And it didn't fix it. Some fiddling with the oTDR and we found a spot where another independent LEC put a fiber marker on top of our fiber, displacing it and casing micro-fractures that didn't effect it at the time, but when it got cold it finally gave up.

So in the end we started @ 3:00 am Sunday and finished @ 6:00 am the following Monday. In total I spent 3 all nighters getting everything sorted out, most of that in 3 feet snow and -20 wind chills.

Had some smaller issues too like CWDM splitters breaking in outdoor cabinets due to cold or remote cabinets loosing power but those are easy fixed once you get past the initial one.
 
Had a dead RAID card on the only server for the college, since they were cheap. At least it was the middle of summer.

It was a Dual Pentium III 500 or some such Acer brand box that we sold a the time. It was under warranty so we ordered up a card expecting it to show up the next day, saturday, because thats what they told us.

Saturday rolls around and no card. Monday rolls around and we get chewed out by the boss. We find out the card will arrive tuesday, but by then we had to make it look like we were doing something.

After sifting through the shitty Acer website I found the strongarm BIOS flash software for the RAID card and figured what the hell. I found the most common memory address for the card and flashed away. Sure enough, it booted after a power cycle.

They had no backup either by the way :rolleyes:
 
Interesting stuff. Most stories are software/server room based.

Long story short we have a fiber ring that goes from Des Moines -> Newton -> Us .> Huxley -> Kamrar. Got a call from the NOC one night that they lost the connection to us from both sides of the ring, but were able to switch to backup fibers and get us going again. Sadly we have ~10 mile strech where both sides of the ring are on the same physical cable. :eek:

To make a long story short we found a spot where mice had chewed our fiber in a hand hole and respliced that. And it didn't fix it. Some fiddling with the oTDR and we found a spot where another independent LEC put a fiber marker on top of our fiber, displacing it and casing micro-fractures that didn't effect it at the time, but when it got cold it finally gave up.

So in the end we started @ 3:00 am Sunday and finished @ 6:00 am the following Monday. In total I spent 3 all nighters getting everything sorted out, most of that in 3 feet snow and -20 wind chills.

Had some smaller issues too like CWDM splitters breaking in outdoor cabinets due to cold or remote cabinets loosing power but those are easy fixed once you get past the initial one.
 
Until 2011 we got our server room in a company building of ours. Got called many times in the night for these things:

power failures
airco failures

Luckily we moved to 2 redundant datacenters..
 
Interesting stuff. Most stories are software/server room based.

Yeah, sadly working for a small company like this you get stuck doing everything from small items like DSL install and phone issues to setting up networking gear/servers to digging holes and splicing fiber/copper.

Software wise most of our stuff is supported by 3rd party companies like Macc or Metaswitch.
 
Worked for a LAN gaming center before I went to school. Internet went down and all hell broke lose. I wouldn't say it was that stressful, but disconcerting to say the least. Bunch of little pre-pubescent fuck-wads kicking and yelling at me while I tried to reboot the router/troubleshoot the issue.
 
Long story short we have a fiber ring that goes from Des Moines -> Newton -> Us .> Huxley -> Kamrar. Got a call from the NOC one night that they lost the connection to us from both sides of the ring, but were able to switch to backup fibers and get us going again. Sadly we have ~10 mile strech where both sides of the ring are on the same physical cable. :eek:

To make a long story short we found a spot where mice had chewed our fiber in a hand hole and respliced that. And it didn't fix it. Some fiddling with the oTDR and we found a spot where another independent LEC put a fiber marker on top of our fiber, displacing it and casing micro-fractures that didn't effect it at the time, but when it got cold it finally gave up.

So in the end we started @ 3:00 am Sunday and finished @ 6:00 am the following Monday. In total I spent 3 all nighters getting everything sorted out, most of that in 3 feet snow and -20 wind chills.

Had some smaller issues too like CWDM splitters breaking in outdoor cabinets due to cold or remote cabinets loosing power but those are easy fixed once you get past the initial one.

You win. That must have really sucked.
 
My best was almost 10 yrs ago when I was pretty green, I was at a clients in the evening doing some patching and decided to patch SQL 7 with SP3, just because it was the latest SP and heck, why not right? Little did I know that SQL service packs could not be uninstalled.... So the after the SP3 install, reboot, NT blue screened. Managed to do a parallel install of NT, got her booted, on the phone to MS, found out I was screwed, I spent 12 hrs onsite, my boss another 12, finally got the server back up and running, client lost one days worth of data, not happy, I still had a job though.

Another good one was at the same client, not my fault though, when the 'I love you virus hit' they were pretty heavy mail users for back then, within minutes the virus had propagated through the company and I just turned off the exchange 5.5 server, managed to get through to Symantec who helped me through the cleanup process, what a mess.
 
Mine are mostly hardware related. I can't share the story that made me start this topic just yet. But it was a heart stopper w/o any doubt. When the problem is 100% resolved, I'll share :)

I once plugged a customers power supply into his MB backwards (back in the AT PSU days when you had Left and Right connectors). The plume of smoke that rose up off the test bench and the customer saying " Did you get it working back there" was rather humorous. Thankfully it was a common board for us, and I just handed him another one w/o his knowledge ;)
 
Old IBM San, multiple red and amber lights on it, no backups. It was about to blow any day.

Months and months until they finally did something about it. If by chance it had went down it would have been our fault. Medical data.
 
A client of ours had an IT intern which they let roam free in Active Directory to "learn" about AD. the intern apparently was in the default domain policy and thought the term "DCOM" was interesting. Long story short, according to him, he changed something related to DCOM in the default domain policy which:

1) Replicated the change to all machines on the domain (servers and workstations)
2) Broke directory services on every machine essentially isolating them from the rest of the network. So, even if we found out what was changed it wouldn't replicate back.

The intern "couldn't remember exactly" what it was that he changed so we didn't know what to fix. We had to go through group policy to figure it out then we had to manually hit every machine with that change. Even after changing the setting back on the DCs they still wouldn't talk. We called Microsoft and they had no clue other than to restore the DCs from backup.

We ended up doing a restore on all the DCs to get them talking then running around to all workstations and disjoin/rejoin to the domain..

I was up for 36 hours straight, the Intern left after about 8 hours because "he was tired". Amazingly enough that didn't get him fired.. He did end up leaving after a few months though.

Riley
 
Had one a few days ago....
new client called me all freaked out... server went down and they no one doing anything IT for 6 months.
I arrive,
I asked them if they had any backups, not in at least 6 months.
Open it up.
Burst caps all over the motherboard, power supply smells burnt.
.
.
.
Mylex pci scsi raid card with 2 drives.
Pop the card and drive into a workstation, degraded raid array says the raid bios.
1 drive of a raid 1 array was dead.
Data is still ok on the other drive.
Copy it off on to the workstation HD.
Talk to him about a server and he said they couldn't afford it.
So I just shared a folder off the workstation hard drive and used synctoy to set it to back up to another workstation for them to use and left with a check.
Total time 6.5 hours.
Check bounced so I have to bug them to get paid.
To top it off the entire time I had 10 people asking me if it was fixed yet every 5 min.
 
I posted this one over in the power supply forum a while back...
It was freaky...

You drive up to an office and there is a power pole split down the middle with the top blown off and a transformer with scorch marks and a huge hole in it near the building.

Direct lightning hit on the pole 30 feet from your clients server.



You grab a flashlight and go in.

There is water in the basement at the bottom of the stairs, the surge blew the sump pump.

You look into the electrical room and see melted circuit breakers and a scorched and partial melted pbs system on a brand T ups also melted.



Expecting the worst you go into the server area.

The APC smart-ups smells like smoke and has burn marks on the back.

It is toast.

Expecting the worse you take the server back to the office.

You plug it in and it powers right up and boots into windows.

You take it to the office they rented until theirs are fixed then go to comp-usa and buy them 5 PCs(the insurance company wouldn't let us pull the workstations, as I remember it 5 out of 25 were still good and they were on apc surge strips and were off at the time the of the hit the rest were on brand B surge strips) to get them back up and running.

They close on a multi-million dollar deal the next day.
 
When I was very young (10 or so) I remember highlighting the C:\ drive and deleting everything, needless to say, rundll32 was somewhere in there and the system wasn't booting anytime soon. $150 dollar repair bill and an ass chewing later I became damn good at repairing computers; and I regret the "blessing" of such a slavery talent.
 
When I was very young (10 or so) I remember highlighting the C:\ drive and deleting everything, needless to say, rundll32 was somewhere in there and the system wasn't booting anytime soon. $150 dollar repair bill and an ass chewing later I became damn good at repairing computers; and I regret the "blessing" of such a slavery talent.

haha always fun to look back at when I was new to computers. I remember the days when Dos was dark and scary, I would freak out if I stayed stuck in it somehow. (back in win98 days).

Curious, I once wondered why there was a A and C drive but no B drive. So I typed B: somewhere (maybe in run, I forget) to see what would happen. A blue screen popped up saying "Please insert disk for drive B" or something to that extent. I was a little scared, but I did insert one and pressed a key. This kept me in full screen dos prompt. Scared to get in trouble for "screwing up the computer" I just hit the reset button. :p All I actually needed to do was type exit, but I did not know.

I also went in safe mode by accident once, I was in such a panic. :D Sometimes I wish I was as clueless about computers as I was then. Maybe I would be an building contractor or something instead. LOL
 
When I was very young (10 or so) I remember highlighting the C:\ drive and deleting everything, needless to say, rundll32 was somewhere in there and the system wasn't booting anytime soon. $150 dollar repair bill and an ass chewing later I became damn good at repairing computers; and I regret the "blessing" of such a slavery talent.

Yeah, I remember deleting some rather large files on my old DTX computer back in 1994 or so that were ".ini" files to gain back some HDD space. Needless to say, my dad had alot of work to do on the computer that evening and with no .ini files, the computer no workey at all, just a pissed off DOS prompt screen.

I got my ass seriously chewed off by my dad and had to help pay for the bill as the tech at the local service center had to work after hours to get our computer up and running.

A lesson learned children....
 
Total time 6.5 hours.
Check bounced so I have to bug them to get paid.
To top it off the entire time I had 10 people asking me if it was fixed yet every 5 min.

That shit makes me blow a gasket. Nothing pisses me off more than when I go out of my way for someone to help them, and then their only form of repayment is a shafting of my wallet.

People asking every 5 minutes makes me laugh too. Don't they understand that distractions = more time = bigger bill
 
Oh that pisses me off when people ask every 5 minutes. People have no clue the true work involved.

When I used to work at helpdesk, every time someone would call and ask that often, I would lower the priority of their ticket. Screw them.
 
Oh that pisses me off when people ask every 5 minutes. People have no clue the true work involved.

When I used to work at helpdesk, every time someone would call and ask that often, I would lower the priority of their ticket. Screw them.

Right on! I did the same thing and always put them at the bottom of the list if the ticket includes ASAP.
 
Last edited:
Curious, I once wondered why there was a A and C drive but no B drive. So I typed B: somewhere (maybe in run, I forget) to see what would happen. A blue screen popped up saying "Please insert disk for drive B" or something to that extent. I was a little scared, but I did insert one and pressed a key. This kept me in full screen dos prompt. Scared to get in trouble for "screwing up the computer" I just hit the reset button. :p All I actually needed to do was type exit, but I did not know.
Back in the windows 95 days I remember doing something inside a program and it crashed and came up with "this program has preformed an illegal operation" message.
I remember being scared thinking I actually did something illegal.
 
Back in the windows 95 days I remember doing something inside a program and it crashed and came up with "this program has preformed an illegal operation" message.
I remember being scared thinking I actually did something illegal.

used to be able to type +++++++ and get a bluescreen

odd it edited the command con / con without the spaces
 
That shit makes me blow a gasket. Nothing pisses me off more than when I go out of my way for someone to help them, and then their only form of repayment is a shafting of my wallet.

People asking every 5 minutes makes me laugh too. Don't they understand that distractions = more time = bigger bill

It is almost always the people you go out of the way to help that shaft you too...
middle of the night? weekend? holiday? drop everything and help... check bounces.
 
Great stories everyone. This really makes me appreciate going into IT administration once I get my degree. I worked at my university for nearly 6 years as our department's senior technician and technology assistant.

Those late nights, long hours, and PR duty everyone here has to go through are definitely appreciated as I've gone through the same thing many times.

Keep the stories coming, I'd love to hear more.
 
Great stories everyone. This really makes me appreciate going into IT administration once I get my degree. I worked at my university for nearly 6 years as our department's senior technician and technology assistant.

Those late nights, long hours, and PR duty everyone here has to go through are definitely appreciated as I've gone through the same thing many times.

Keep the stories coming, I'd love to hear more.

What is your degree in? Rather who from and what program? I am starting on a Technology/Education degree this fall. I loved my time spent working in schools.
 
A long time ago I worked for a Broker/Dealer. We had electrical problems with the A/C in the tiny datacenter and I had to bring in portable A/C while it was being worked on.

I plugged them into outlets that were marked "NONUPS". They were marked WRONG.

I plugged 3 portable A/C units into the UPS that powered the entire room, including the PBX. It took a little less than 30 minutes to blow through the reserve battery and then the UPS just shut down. I bypassed the UPS and immediately started bringing systems up, but the phone system was toast. Me and the Nortel guy became good friends that night.

That ended up being a 28 hour day, but we managed to get it all back up.
 
So I quick VNC into his machine and sure enough something / someone was sifting through his email.
So I quick shut down the vnc service, change the password and start it again.. Start sifting through anything and everything I can think of logs wise. We sift through firewall logs to make sure the old consultants weren't snopping or anything and didn't see anything from the outside. Were thinking someone figured out the VNC password and was digging for something. The bright side of this scare is that I was able to have a business case for Dameware. I was able to get 2 copies of dameware NT Utilities. 1 for me and one for my boss

port 5900 is a common target, as is 3389. I was logging nightly dictionaries against my port 5900 forward to my DVR in the living room until I moved it to port 5901 on the WAN side lol. Haven't seen anything at all since then.
 
Phone center - BIG phone center. They did Compaq (main call center huge floor about 300 workstations) and IBM (smaller) and a lot of other consumer level tech support. I'm in the "IS" dept - code for "workstation side". I worked 10a-11p on the sun-wed shift (halfday wed) - and Sunday morning I get buzzed by the IT director to go run some pings. Seems half the building lost data and the other half lost voice. Huge problem.

I'm fairly noob but I helped him troubleshoot (he was out of town) and traced back to a cisco flash he did the night before which took out 3 fiber modules in 3 separate 4000 switches.

The best part was watching the floor from the fishbowl like server room as folks milled about (recall that its tech support and the phones aren't ringing so they are in the aisle talking carrying on, etc) - Pressing CTRL + V to paste in the 4000's config, and watching the phones light up and a collective boooooooo from the floor.
 
I was working at an internet provisioning company and I brought down an entire exposition wing of internet access during an event. It was servicing about 200 customers that were paying $1000's to use the internet that weekend simply because when I was messing with DHCP settings on a router I typed "172.168.16.0" (because I'm so used to typing 192.168) instead of "172.16.16.0" and was in too much of a hurry to check my final config before I submitted the changes.

Once I lost connection to one of the servers that ran off that network, my heart must have skipped a beat, and I think its the fastest I've ever moved in my life to rush back to the terminal I was just at and check the config and found my mistake in seconds and changed it.

A few seconds later I was able to hit that part of our network again and was able to VNC back into the server. Luckily it was only down for about 3-5 minutes, I got no phone calls about it, and I felt as if I've been relieved from holding the contents of my bladder in for hours.
 
Back
Top