DNS Server Upgrade

Ehren8879 · Oct 16, 2011

Edit: The information below was posted before I knew my ass from my elbow

Edit #2 (1/8/2012) Upgrade complete. Fixed a number of problems and implemented DNSSEC. Update on post 22

Currently we run single primary and secondary DNS servers. The way he have them handed out our primary is getting hammered and our secondary gets very light usage. So the plan is to upgrade our boxes and utilize a master-slave relationship to distribute the load and make resolving more responsive during heavy load. I'm just trying to figure out what makes sense.

This is what I'm currently thinking about doing:

Primary master server with a slave.
Secondary master server with no direct slave. I don't feel the load justifies a secondary slave.

In the event the primary master server fails, the primary slave will be set with the IP of the secondary master to transfer zones with it. Though the slave should continue to work by itself if the primary master goes offline, having a backup master to answer to should keep the server load from suffering.

We're currently providing DNS to around 6000 broadband customers via BIND9.

Wow, I was so dumb back then...

Berg0 · Oct 16, 2011

not sure I follow, are you using round robin with multiple servers per a record, or single server per A? I assume these are just public recursive servers, not authoritative servers?

i.e.:

ns1.company.com = 1.1.1.1
ns2.company.com = 1.1.1.2

or:

ns1.company.com = 1.1.1.1
ns1.company.com = 1.1.1.2

?

Ehren8879 · Oct 16, 2011

Berg0 said:
not sure I follow, are you using round robin with multiple servers per a record, or single server per A? I assume these are just public recursive servers, not authoritative servers?

i.e.:

ns1.company.com = 1.1.1.1
ns2.company.com = 1.1.1.2

or:

ns1.company.com = 1.1.1.1
ns1.company.com = 1.1.1.2

?

I'm currently learning this as I go, so thank you for bearing with me!

Our two DNS servers are public recursive servers, so:

ns1.company.com = 1.1.1.1
ns2.company.com = 1.1.1.2

Presently they do not communicate with one another in any fashion. Changes made to one server's config must be made manually to the other. Things are simple enough where this is not much of an issue.

That said, perhaps I should back up here a bit and simply ask for some direction.

What I'd like to achieve is dispersing the load on ns1 and having some kind of failover should we experience hardware failure. I assume currently that if ns1 suddenly stops answering to requests then a timeout period must expire before ns2 is tried. If we had hardware failover then ns1 could keep on trucking and our customers would experience less latency in this cirumstance. However we would still maintain a second name server just in case.

(We hand out ns1 as primary and ns2 secondary via dhcp)

mwarps · Oct 16, 2011

I am assuming all the hardware is in one location, yes?

You'll need some form of loadbalancing to spread the load.. Not sure if your router or firewall has that built in. OpenBSD's pf works pretty well for that, and it's simple as hell to configure the loadbalancing. Failover can be accomplished with pf and relayd/carp on openBSD as well.

BTW.. recursive for ~6000 users shouldn't really beat that badly on a machine. What sort of hardware are you using?

EDIT: Looks like pfsense does this pretty much out of the box with hardly any intervention.
http://doc.pfsense.org/index.php/Setup_Incoming_Load_Balancing

Obviously DNS is on 53, not 80.. Otherwise .. Bam done.

Ehren8879 · Oct 16, 2011

We're due for hardware upgrades, but currently each server is an Opteron dual core. I'm brain farting on the specifics. Perhaps I might be misjudging the actual load, but I've told that the the primary is "getting hammered". Our firewalls is a Cisco ASA. Perhaps I could dig into that and see if I can load balance with our current servers. Unfortunately pfsense isn't an option. Boss man takes a little bit of convincing about things like this and I'm just trying to learn and make him happy with some results.

If we can make the ASA redirect some ns1 requests to ns2 (or redirect ns1 requests to ns2 if ns1 is unresponsive) then that would be great.

Edit: Hardware is all in the same rack

mwarps · Oct 16, 2011

If boss man is a stick-in-the-mud, throw a quick quad-core with hyperthreading machine with 8+GB ram at it, it'll do fine.

I'd suggest something BSD-ish for the OS (FreeBSD 8.2 amd64), and make sure bind9 is compiled with threads.

Hell, if you're really budget limited, max out the memory in the opteron, it may put off the inevitable.

Oh, and real network cards. Very important.

Ehren8879 · Oct 16, 2011

mwarps said:
If boss man is a stick-in-the-mud, throw a quick quad-core with hyperthreading machine with 8+GB ram at it, it'll do fine.

I'd suggest something BSD-ish for the OS (FreeBSD 8.2 amd64), and make sure bind9 is compiled with threads.

Hell, if you're really budget limited, max out the memory in the opteron, it may put off the inevitable.

Oh, and real network cards. Very important.

Our servers are primarily AMD/CentOS boxes.

Edit: i just ran Namebench against our DNS servers and everything else out there absolutely crushed us. Google DNS had a mean of 35ms, whereas our ns1 came in at 270ms, ns2 at around 500ms. We gotta do something about this.

I have to check to see if BIND is taking advantage of multithreading.

mwarps · Oct 16, 2011

YGPM.

diizzy · Oct 16, 2011

You might want to look at unbound which is also very common and supposedly more secure and a lot faster than BIND I'm not a user of it myself though.
//Danne

mwarps · Oct 18, 2011

Unbound is also what we run on our recursive machines.

I would think that bind would be able to handle 6000 clients recursive clients. (I'd hope so anyway.. it's not that much traffic!)

Fint · Oct 18, 2011

For what its worth, I run an eCommerce website that averages 3k-4k unique visitors at any one time... our primary external DNS server is a Debian box running Bind9.7; it has one CPU and 512mb of RAM (that's right, megabytes). Currently the box has a load average of "load average: 0.05, 0.04, 0.08" and (aside from cache) is using 141 MB of RAM.

Are you sure your hardware is the issue? DNS is not a very intensive operation.

Ehren8879 · Oct 18, 2011

Fint said:
For what its worth, I run an eCommerce website that averages 3k-4k unique visitors at any one time... our primary external DNS server is a Debian box running Bind9.7; it has one CPU and 512mb of RAM (that's right, megabytes). Currently the box has a load average of "load average: 0.05, 0.04, 0.08" and (aside from cache) is using 141 MB of RAM.

Are you sure your hardware is the issue? DNS is not a very intensive operation.

I'm certain it's not a hardware issue nor a network issue for that matter. What i have here is a config issue, but I've yet to pin point it. Whe i run dig +trace domain@localhost I often get the following result.

dig: couldn't get address for x.root-servers.net

Every time it fails, it fails for a different root server (normal roll-through I presume). Why it's failing, I'm not sure. On a rare occasion it doesn't fail.

I'm going to be testing unbound fairly soon which, if it works will prompt a hardware upgrade anyway.

mwarps · Oct 19, 2011

Fint said:
For what its worth, I run an eCommerce website that averages 3k-4k unique visitors at any one time... our primary external DNS server is a Debian box running Bind9.7; it has one CPU and 512mb of RAM (that's right, megabytes). Currently the box has a load average of "load average: 0.05, 0.04, 0.08" and (aside from cache) is using 141 MB of RAM.

Are you sure your hardware is the issue? DNS is not a very intensive operation.

Is that box running recursive or authoritative? Those are fine stats (if not overkill) for authoritative DNS for a few domains. It's the same queries over and over again, all of which will be cached in practically no RAM at all.

I very highly doubt 3K concurrent users would be able to use that machine for recursive DNS.

Ehren8879 · Oct 29, 2011

Update:

My ultimate goal is to transition from BIND to Unbound and NSD. Unbound is a dream to configure and is a performance champ. NSD is also fairly simple to configure once you understand zone files. I've got a lot of work to do. I'm not ready to detail everything just yet, but I will after the fact.

This has been one of the greatest learning experiences I've had since taking on these new responsibilities.

Deleted member 240415 · Nov 4, 2011

Outsource your DNS to OpenDNS.com. Just input their IP's at your DHCP server. Quick and Dirty. And/or setup a local PFsense box to locally cache DNS records to reduce the amount of network traffic.

6000 customers? You also might benefit from a local BlueCoat transparent caching server such as a BlueCoat Proxy SG 8000 or 8100.

Jay_2 · Nov 4, 2011

I think there is another issue here. I have seen ISPs with name servers with less power than that and they have no issues.

What is the actual load on the servers at the moment (RAM, CPU etc)

Ehren8879 · Nov 5, 2011

There are many issues with our current DNS servers, too many to list even. I would have to reconfigure BIND from the ground up to fix the problem, so I've decided to abandone them completely. Currently the servers are "patched" which has improved their performance significantly, but the writing is on the wall...

We're very much a "keep it local" type of shop so I'll be rolling out new servers. Outsourcing DNS duties would have been a bigger headache.

What I can say at this point is we're moving away from BIND in favor of Unbound and NSD. This has definitely been a learning experience to say the least. Thanks go to mwarps for putting up with my many stupid questions.

I'll detail everything once the migration is complete. I suspect that will be month or more still.

mwarps · Nov 5, 2011

Ehren8879 said:
What I can say at this point is we're moving away from BIND in favor of Unbound and NSD. This has definitely been a learning experience to say the least. Thanks go to mwarps for putting up with my many stupid questions.

I'll detail everything once the migration is complete. I suspect that will be month or more still.

Not stupid questions. Good questions.

Ehren8879 · Nov 21, 2011

Mini update:

I was getting ready to put the new boxes in production today only to find the recursive servers wouldn't communicate with the root servers when behind our ASA. Long story short we failed to update a TCP connection rule in our DNS group which was the issue all along. Since the change our BIND servers have been running like champs. Haha, only took $4000 in hardware and two months of my time to learn this!

We were due for a hardware upgrade anyway and we've sorted out some other issues along the way, so not a loss.

More to come soon.

mwarps · Nov 21, 2011

wow... that's brutal.

TCP is a requirement above a 512 byte response. Not many queries will be that large, but some can be, and it's going to ruin your day if the authoritative only wants to answer in TCP.

Were you restricting incoming or outgoing traffic? Incoming I can see..

I've seen issues at the 0.005% of all DNS traffic level caused by fragged TCP DNS queries, so .. I feel your pain.

When you're handling upwards of 1,000,000,000 queries a day .. That hurts.

Ehren8879 · Nov 22, 2011

In short, this was the fix

http://www.cisco.com/web/about/security/intelligence/dnssec.html

Edit: Note that at this point I have zero access to our ASAs, so I have to pretty much rule out everything before I can get stuff like this addressed.

Ehren8879 · Jan 8, 2012

I completed this DNS project last week. All in all everything went smoothly despite a motherboard shortage, but that gave me time to tweak the production configs and better stagger the rollout.

I figured out the source of our original DNS performance issues (already discussed in this thread) and implemented some security measures to help protect our customers. One thing I never mentioned until now is that our original servers were doing both recursive and authoritative duties. Apparently this is a big no no.

So we purchased twice the hardware we intended and seperated those functions out. Now we are fully DNSSEC capable (ahead of the curve it seems) and recursion is no longer wide open to the internet.

I can't sumarize all the details here but if anyone has any questions I'll do my best to share what I've learned along the way.

Big thanks go out to Mwarps and all the info pages out on the internet that helped me piece together my understanding of the dark art of DNS.

DNS Server Upgrade

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Deleted member 240415

Guest

2[H]4U

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness