Biggest challenge you've faced?

iamkion132

n00b
Joined
Feb 27, 2007
Messages
57
Mods, if you don't mind could you rename this thread "Biggest Challenge/Disaster you've faced?"

Not a lot has happened while I've been working. I've had to deal with a network failure that came as a result of a bad UPS, a bad line between buildings and the occasional DNS issue. Other then that I've had pretty smooth sailing but I've gotten curious what people face out in the "real world" of IT. I'm pretty sheltered when it comes to dealing with serious issues and I would like to hear about what happened and how the problem was dealt with.
 
Have a client call and their server isn't booting up?

Have a client call (big golf resort)....their main building got whacked by solid direct lightening hit that bloew the roof off....their server, 3/4 of the workstations, main network switch, network switch down at the beach house, were all dead dead dead.
 
Have a client call and their server isn't booting up?

Have a client call (big golf resort)....their main building got whacked by solid direct lightening hit that bloew the roof off....their server, 3/4 of the workstations, main network switch, network switch down at the beach house, were all dead dead dead.

Sounds like a good time to dust off your bare metal recovery plan.

And as reference to my backup thread going on right now...good reason to take tapes off site every night.
 
In January of 07 we purchased an additional disk array for our SAN. Our SAN is one of those hack SANs that run on commodity servers connected to whatever mix of SCSI/SAS/iSCSI/FC you want (meh). We had a relatively entry level FC-->SATA disk array and wanted to upgrade to something a little better so support our new ESX installation. We added the shelf (recommended by our VAR, on our SAN vendor's HCL).

A few weeks later we started getting complaints that saving to mdb in our custom in house application were taking an inconsistent amount of time (2 - 60sec). We'd just done the ESX install a few months prior, so I thought maybe it was some virtualization issue. I first looked for a load/capacity problem; I spent tons of time pouring over ESX counters, perfmon counters, trying to correlate the save issues to something. Eventually we were able to show that yes, the write queues on the server did spike at the same time people had issues, but nothing in this application had changed; it's essentially the same thing it was two years ago.

I then spent hours with vmware support (who are fantastic btw), trying to figure out whether it's a vm issue. Eventually we ruled that out. Then I moved onto our SAN vendor. The hardware is on their HCL, they weren't seeing anything terribly useful in the logs at first. Delving deeper they saw that the storage was occasionally taking a while to respond to writes. We brought our VAR onto the case and they worked with us, our SAN vendor, and the disk array manufacturer.

Finally, our VAR and the array manufacturer determined that the way our SAN issued commands had some type of issue with the way the cache worked in the array. We were promised firmware updates. Months passed and no firmware. Our VAR came through, got the manufacturer to refund the unit and we purchased something else. That whole time, especially early on when all I had to go on was "saves take a long time in this app (which eventually expanded to include excel files) was extremely frustrating - especially since I was responsible for implementing ESX and thought the problem was somewhere in that layer and I just didn't understand it well enough. At one point I thought we'd have to roll back everything. I really just wanted someone else to deal with it, but with enough time and through the process of elimination we got it resolved.
 
Nothing major.

Major problem (servers down) but minor fix (just a new PSU).

And of course some problems I have zero explanation for to this day. Like the whole network not communicating with each other. Eventually power cycled all switches and it fixed the issue. Go figure. Closest I can figure is the MAC addresses got fudged or something.
 
Cisco IOS bug... (what a surprise!)

After configuring a multicast feature on our global MPLS network, everything was fine around 11pm on a Friday. Monday morning, 9am, when market feeds started to come in... that's when the shit really hit the fan. All of our Cisco 7613s in sevreal MPLS cores were averaging 94% CPU utilization on sup720s. Turns out, our Cisco team said it was the "order by which the commands were entered that caused multicast to be in software mode". Pleeeeeeeeeeease... tell that to the traders lol. We couldn't risk bringing down the production network completely, so we waited. After biting our nails all day, we re-entered the config at 12 midnight, and all was well. Lots of stress/pressure that day...

Then, after fixing that, we ran into another issue... but it wasn't as bad.
 
How about going to the director's location on multiple occasions because he keeps breaking his goddamn laptop? I think he has gone through at least 4 of them. I swear he treats it like garbage.
 
Client got new carpet, carpeting people unplugged the Windows Server 2003 box, moved it, plugged it in. Unplugged it again to move it again, and went again. Killed the RAID controller, RAID drives were messed up, they "didn't know" to put a tape in the tape backup, so the only backup they had was from 3 years ago... Title company so a LOT of records were on the server. So, with backups and RAID 5 backup, they destroyed it all.

I told them that it could be recovered, but it would take a while and cost a bit... They said no, we're done. I didn't go back to work for them, but they ended up going with a data recovery place and spend $14,000 to get the data back. And a whole month. They concluded it was the fault of the constant power interruptions and moving it. The client's head director bitched out the other office users for it (making them cry... SHITTY BOSS!), and one of them came to me and told me what happened and apologized to me for her boss not letting me work on it anymore. She said the boss claims it was my fault up until the recovery company said otherwise. So, it was a big "I told you so", and I've been telling them for a while to do backups. Oh well, I would never -NEVER- do work with them anymore. EVER.

Biggest challenge was working with her. The computer problem wasn't too bad, it was doable. The boss lady was a HORRIBLE experience. I'm surprised she can even survive in business with attitudes like that. I still doubt they do backups to this day.

Other big challenge: worst damn malware/virus protection EVER! I mean EVER!
 
Internally - GPO was affecting our exchange server, one of our sniffing VM's became corrupted after it wasn't properly shutdown, blew a circuit.

Externally - Only big issue is that a newspaper company had some major issues with their routers which caused them to lose hosting of their web site. It took around 12 hours or so to find the problem and fix it.
 
Lost the company email server after a power cut, found out that the backup was not setup correctly (not my job) so we lost all our emails. I had to single handedly get a new mail server, set it up, re create all the accounts, reconnect all the users and then explain why their important emails had gone for ever.

Moral of the story, never use RAID 1 in a server (RAID 5 min) and never allow other people to do the backups without you checking them over.
 
Had to do a WinNT Domain Controller migration for a small business to a new 2k3 Server Active Directory setup along with adding domain A/V, updates, file shares, VPN, and Backup a few years ago.

Overall probably not that hard for the domain experts around here, but for me it was new and uncharted territory that was hair pulling at times. I have never had to do much serious work with domains (most of my clients are small enough that they do not have a serious need for a domain), so it was all fairly new to me.

The client is happy now though and everything has been working great for the last few years since it was setup. It is a very simple domain, but does what it needs too ;)
 
One of our clients thought it would be funny to put some porn on their secretarys computers last week ... ended up infecting it with Win Antivirus 09 lol. Not a difficult one, but I had a good laugh!
 
I worked on a computer once that has a possessed case. We replace every single piece of hardware in it. Checked all the stand-offs, etc. Everytime we installed a motherboard/cpu/ram/psu, it would not post. The components would post fine outside the case. This was true with 2 completely seperate sets of components. 1 was intel based, the other AMD. we ended up telling the customer he needed a new case and all was well.
 
power failure in our datacenter...not so bad you say....


diesel gen failed to start, then its backup failed to start...

that was one of the most stressed days in my life lol, we had numerous companys entire infrastructure (server and phone) in our DC and no power for 3 hours.
 
Biggest Challenge? That's a toughy. How about a 'biggest disaster' story instead? I had been on-site with a large medical organization for just over three months at this point, implementing VMware ESX on an enterprise scale. The very first site we stood up was ESX 2.5 as 3 had come out right as the first site started and they wanted to wait a little and see. Well, the time had come to upgrade the first site as all future sites were being deployed as 3.x and the our other engineer on the job was helping them through that.

After a long day of prep work our second engineer goes home, leaving the ESX 3 disk with the customer's engineer with some instructions on upgrading the first host to practice with. Apparently however what wasn't made clear or was forgotten was that he should unplug the HBAs before booting off of the CD. The idea was innocent enough: to upgrade the first box overnight and then come in and perform the moves and VMFS migration the next day.

Now the customer's engineer was a smart guy - very capable - and everyone makes mistakes. Well, when the ESX CD booted up it detected the one LUN they had all of their VMs stored on as sda. See where this is going? Since he was comfortable installing ESX and the local hard drive is typically sda (when the HBAs are unplugged), he went on with the install and "upgraded" the server then went home.

The next morning I arrive on-site to a crisis. The entire farm at that location was down. All of the VMs were showing as disconnected, our engineer and the customer's engineer for that site were still at home. I spent the morning troubleshooting while the customer prepared press releases.

After a conference call between at least fifteen different companies and a webex by VMware support - we came to the conclusion that the LUN containing all of the VMs at that specific location appeared to have had ESX installed over it. The file creation dates all matched what we saw in the logs as far as when the hiccups starting occuring, we just didn't want to face the truth until VMware confirmed it.

After we knew what was wrong definitively, we volunteered to help get things back up and running. Our second engineer decided that he didn't want to come in and subsequently left for another position elsewhere. Leaving me there by myself, so...

I spent the next 24 or so hours straight helping them get the infrastructure rebuilt properly and bring VMs back from the dead. Backups? Well, they had been having trouble with that for some time... :( Old equipment everywhere. Heck, this site included VMs captured from Pentium Pros! Once their management came to relieve me and I had done all that I could do for them, I left for a two-hour drive home. Their engineer was still on-site and remained there for almost another day straight fighting with backups and getting everything back to where it was.

In the end they lost very little and were grateful for us going above and beyond. None of the patient-critical servers were virtualized at that point, only infrastructure, so no one was in harm's way either thankfully. It was a very hectic way to spend one's holiday weekend.
 
I worked on a computer once that has a possessed case. We replace every single piece of hardware in it. Checked all the stand-offs, etc. Everytime we installed a motherboard/cpu/ram/psu, it would not post. The components would post fine outside the case. This was true with 2 completely seperate sets of components. 1 was intel based, the other AMD. we ended up telling the customer he needed a new case and all was well.

Ive came across that once too. Replacing the case fixed the issue for me too haha!
 
Server complete fail. The whole network is down.
16 hrs shift for 5 days to get everything up and running.

Setup a new server:
update
active directory
DNS
user group/permission

sql 05:
restore database
setup job

Setup 14 workstation:
reconfigure application to point to new server
config static ip address
test application

Setup 14 lanes:
setup msmq
config setting
test application
 
There's nothing wrong with RAID1.
It's just that RAID is not a backup.

well the power cut killed both the disks! and the gas generator didn't kick in. I never said RAID was backup but for data thatis key to your operation don't trust it to RAID 1.
 
Biggest disaster stories are much more interesting.

A few year back I was working for one of the larger casinos in Illinois as a server/helpdesk technician. All slot machines feed into a sql cluster to record player activity and control the ticket in/out system. The ticket in/out system prints out paper certificates with the dollar amount that can be redeemed at other slot machines or at the cage after a player cashes out. The install was completed by the software vendor and shortly after they left I noticed that the quorum was setup as a single drive. No RAID1. Not good because if you lose the quorum, you lose the cluster.

Called Compaq to confirm that if I added the drive to the LUN it would automatically setup a RAID1, copy the contents of the live drive, and do it without interruptions. So I added the drive and it started it's copy. Let me tell you this, hardware failures come at the worst time. The live quorum drive decided to take a dump and the whole cluster came crashing down. No ticket in/out system now. Patrons are slamming the cash out button hoping to get their tickets. Cage people can't redeem tickets already created. It was chaotic. An utter mess. I was successful in recreating the quorum resource and get the sql cluster back online but it took a hour.

The moral of the story is, don't make changes during your company's most busiest day, Thanksgiving.
 
My biggest challenge so far had nearly pushed me to moving far away and finding a new job...

I am the Network Administrator for my company and work out of our main datacenter located in Baton Rouge. We are an insurance company and are in 8 states.

Sept. 1 Hurricane Gustav hits Louisiana and obliterates Baton Rouge (and most everywhere else in Louisiana). There are so many trees down is takes me 2 days to get to the office. Early estimates are that the power might take up to 3 weeks to restore. That was great news when our generator has enough diesel for 3 days and its impossible to get any here in town since theres no power. We had some shipped to us from out of state which wasnt exactly easy to find.

We had a malfunction in our Halon system that would not trigger it if a fire were to start so we had to have a 24 hour watch just in case one were to start. Power to my office was not restored until 7 days after landfall. My wife and I slept at the office and she would go home to check on the house once a day or so to make sure we hadnt been looted and to check on the cats. We didnt get power back at our house until 10 days after landfall. I will just go on the record and say that it was the worst time in my life so far and hope to never go through something like that again.

The good news though is that all the servers stayed up the whole time and our offices in the 7 other states were able to work as if nothing had happened. Looks like someone did a good job preparing.

Needless to say, i didnt move or change jobs. i should though.
 
well the power cut killed both the disks! and the gas generator didn't kick in. I never said RAID was backup but for data thatis key to your operation don't trust it to RAID 1.

Same thing could have happened to any RAID level. It is all about playing the odds with RAID, that is why it is not a backup.
 
Well considering it's my job to fix disasters i have too many.

This is a funny disaster.

We do support and managed services for a local small ISP. They have roughly 1500 cable modem customers and another 100 or so Fiber customers. The fiber spans 4 counties, connecting offices (think Metro Ethernet) and providing internet. So, I'm at the NOC one day testing out a new VPN Concentrator (a few years back), when one of the Fiber Tech's comes in. We shoot the shit for a few minutes and he proceeds to check on a fiber problem he was working on. He bends over to look beind the rack, puts his hand on the big APC Symettra UPS and all of a sudden, the room gets very quite. I turn around and NONE of the equipment is running, no flashing lights, no fans.. nothing. I start freaking for a second and ask the fiber tech, wtf just happened. He goes, idk and we proceed to look around for a good 10min or so trying to figure out what happened. Then the fiber tech goes, "Oh shit!" and flips the power switch on the back of the UPS. Everything proceeded to power up. Then I spent the next 3-4hours making sure everything came back up alright, calming customers down etc.

Wasn't hard work, but funny none the less.
 
IMO the biggest challenges for myself has not been hardware, network, or IT related issues... it has been politics in all shapes, sizes, and forms.
 
well the power cut killed both the disks! and the gas generator didn't kick in. I never said RAID was backup but for data thatis key to your operation don't trust it to RAID 1.

Same thing could have happened to any RAID level. It is all about playing the odds with RAID, that is why it is not a backup.

Thank you Grentz!



Ive came across that once too. Replacing the case fixed the issue for me too haha!
That's pretty darn funny, I've not ran into that one before ;)
 
funniest one was when an ex employee of one of my clients broke in and stole the server and all the backup tapes.
There were sol for 48 hours until the cops caught up with him and got the server and tapes back.
 
funniest one was when an ex employee of one of my clients broke in and stole the server and all the backup tapes.
There were sol for 48 hours until the cops caught up with him and got the server and tapes back.

Either a really pissed off employee or a really horrible a$$hat employer....kinda funny
 
Either a really pissed off employee or a really horrible a$$hat employer....kinda funny
the second, well both, the second caused the first... a real pita for me to work with also.
Always late in paying their bills and never listened.
If they had listen they would have had an off site tape to restore from.
 
Back
Top