Sometimes I take things for granted. For instance, the health and integrity of the lab environment. Although it is “lab”, I do run some workloads which are key to keep online on a regular basis. Primarily the web server which this blog is served from, the email server which is where I do a lot of collaboration, and the Active Directory Domain Controllers/DNS Servers which provide the authentication mechanisms, mailbox access, external host name resolution to fetch resources on the internet, and internal host name resolution.
The workloads and infrastructure in my lab are 100% virtualized. The only “physical” items I have are type 1 hypervisor hosts, storage, and network. By this point I’ll assume most are familiar with the benefits of consolidation. The downside is that when the wheels come off in a highly consolidated environment, the impacts can be severe as they fan out and tip over down stream dependencies like dominos.
A few weeks ago I had decided to recarve the EMC Celerra fibre channel SAN storage. The VMs which were running on the EMC fibre channel block storage were all moved to NFS on the NetApp filer. Then last week, the Gb switch which supports all the infrastructure died. Yes it was a single point of failure – it’s a lab. The timing for that to happen couldn’t have been worse since all lab workloads were running on NFS storage. All VMs had lost their virtual storage and the NFS connections on the ESX(i) hosts eventually timed out.
The network switch was replaced later that day and since all VMs were down and NFS storage had disconnected, I took the opportunity to gracefully reboot the ESX(i) hosts; good time for a fresh start. Not surprised, I had to use the vSphere Client to connect to each host by IP address since at that point I had no functional DNS name resolution in the lab whatsoever. When the hosts came back online, I was about to begin powering up VMs, but instead I encountered a situation which I hadn’t planned for – all the VMs were grayed out, esentially disconnected. I discovered the cause of this was that after the host reboot, the NFS storage hadn’t come back online – both NetApp and EMC Celerra – on both hosts. There’s no way both storage cabinets and/or both hosts were having a problem at the same time so I assumed it was a network or cabling problem. With the NFS mounts in the vSphere client staring back at me in their disconnected state, it dawned on me – lack of DNS name resolution was preventing the hosts from connecting to the storage. The hosts could not resolve the FQDN name of the EMC Celerra or the NetApp filer storage. I modified /etc/hosts on each ESX(i) host, adding the TCP/IP address and FQDN for the NetApp filer and Celerra Data Movers. Shortly after I was back in business.
What did I learn? Not much. It was more a reiteration of important design considerations which I was already aware of:
- 100% virtualization/consolidation is great – when it works. The web of upstream/downstream dependencies makes it a pain when something breaks. Consolidated dependencies which you might consider leaving physical or placing in a separate failure domain:
- vCenter Management
- Update Manager
- SQL/Oracle back ends
- Name Resolution (DNS/WINS)
- DHCP
- Routing
- Active Directory/FSMO Roles/LDAP/Authentication/Certification Authorities
- Internet connectivity
- Hardware redundancy is always key but expensive. Perform a risk assessment and make a decision based on the cost effectiveness.
- When available, diversify virtualized workload locations to reduce failure domain, particularly to split workloads which provide redundant infrastructure support such as Active Directory Domain Controllers, DNS servers. This can mean placing workloads on separate hosts, separate clusters, separate datastores, separate storage units, maybe even separate networks depending on the environment.
- Static entires in /etc/hosts isn’t a bad idea as a fallback if you plan on using NFS in an environment with unreliable DNS but I think the better point to discuss is the risk and pain which will be realized in deploying virtual infrastructure in an unreliable environment. Garbage In – Garbage Out. I’m not a big fan of using IP addresses to mount NFS storage unless the environment is small enough.
At least one physical AD and DNS server always made me sleep better at night.
Realistically, your NFS network would be isolated from production and normally you wouldn’t have DNS services on that separated network. Also, IP is the best practice on most storage arrays I’ve worked on because you VIF IP so you don’t always use the same hostname, you could use multiple IP’s for different mounts on the same arrays for better load balancing.
NFS + ESX + DNS = disaster waiting to happen
@Tony With the right convergence of circumstances, the combination you mention can indeed be disasterous. However, I wouldn’t necessary call foul on a design which uses FQDN for NFS mounts. I know of some large enterprise environments which are architected this way along with the necessary redundancy to prevent any achilles heel. FQDN is used not only with VMware infrastructure but with other operating systems as well (Windows, Linux, Unix, etc.). Ideally, NFS traffic will be isolated and there may not be DNS services ON that network, but that doesn’t mean there can’t be a DNS namespace created FOR that network on the DNS infrastructure. Thank you for your input!
Just remember that “manual input” on your /etc/hosts. a few times i found people who add host on that file and later nobody remember that and problems just start.
nice “lab” btw….
@jason,
I agree with @Tony about relying on DNS for “service tier 0” storage applications. Chicken-egg scenarios are just one of the issues to face: what about DNS poisoning and network failures when the DNS server is more than 1 layer-2 hop from the reliant service end-point?
That said, still I find it best practice to use IP addresses for mount points in NFS and populate DNS and /etc/hosts as well. Since IP addresses only rely on the supporting layer-2 and layer-3 infrastructure to function (and no other service dependencies) it is the only way to provision NFS/iSCSI storage at “service tier 0.” Client end-points – dependent upon a host of other services like DNS, DHCP, file servers, etc. – can use DNS-resolved NFS volumes, but not “service tier 0.” These guys are likely to be the underpinnings of your root file services, DNS servers and DHCP hosts.
In short (and in my opinion), you can’t have critical storage (service tier 0) tied to DNS – it simply opens you up to unnecessary race conditions. That includes ANY primary storage in virtual infrastructures. Sure, your ISO image store, backup volume or template store can be tied to DNS – anything you can stand to be without access to for some reasonable period of time – but not your VM, swap or production data volumes.
Once IP configurations and /etc/hosts updates are part of the “service tier 0” NFS configuration process, they won’t be forgotten: it’s process. If it’s a one-off “fix” you’re already in deep water…
By the way, dangerous to let this “secret” out of the bag – you’re spoiling the fun for the uninitiated! Once someone gets stung by bad DNS resolution, they will forever think twice about anchoring critical storage to it.
I’d hate to see a “green” consultant disregard the dangers only to leave a ticking time bomb for the end-user the next time an intermediate switch or server hiccups… This is a problem best encountered and successfully conquered in the lab… In fact, it ought to be something that is intentionally “pulled” on the trainee just to develop some respect for the potential problem it could cause.
BTW, I’ve had many questions about this from DIY clients in the past. It always seems to come up because host names are so easy to deal with and IP addresses are not thought of as physically tied to a service. I guess it’s easy coming from an ISP/DNS background to chalk it up this way:
How do Internet DNS servers work? In other words, how does a domain – call it myco.com – resolve when the authoritative DNS servers are dns1.myco.com and dns2.myco.com? Actually, at least one of these hosts (usually both) must be pinned to a host record in the root servers responsible for the domain registry. These records are resolvable regardless if dns1 and dns2 are reachable.
In a 100% virtualized infrastructure, there is no equivalent to a DNS root server (i.e. no physical server sitting outside the virtual hosts) to resolve DNS for NFS. For that matter, no iSNS hosts either (for the iSCSI and FC wonks out there). Any IP storage at “service tier 0” needs to boot-strap these services with either “persistent” local records (like /etc/hosts) or by IP address directly.
Does all this create a special problem for “service tier 0” configuration management? Yep. In such case, shrinking “service tier 0” down to the smallest possible footprint would be wise, requiring a staggered start-up process: bring-up service tier 0, then all other service platforms according to priority and serviceability. For my infrastructures, switches, routers, storage and hypervisors that host mission critical workloads comprise “service tier 0” and are prioritized in that order (OK, I like OSI approaches).
It would be interesting to hear how other pros have dealt with this issue…
For what it’s worth, I’ve been moving away from using DNS names for my NFS mounts in vSphere. There are situations where changing the underlying DNS references can actually cause vCenter to lose track of the datastores (and bring them back in as (1) datastores).
At that point, you’ve lost any flexibility from using DNS and actually introduced some rather nasty dependencies/recovery scenarios.
For what it’s worth, the NetApp RCU uses IP #’s when deploying NFS datastores.
We have an NFS environment as well, and I hate being forced to go physical because of DNS, so I decided to setup two DNS servers on the local storage of the ESX hosts, and I used DRS rules to make sure they wouldn’t run on the same ESX host. So far this seems to have worked well, and no need to “revert” to a physical device for DNS.
We do use IP for NFS mapping, but our SUN storage device starts to cry if it can do a reverse lookup, and all of our main NFS storage is on the SUN device.
Hey Jason – I’ve seen NFS mounts both with DNS and using IP. I prefer IP for the reason Andrew suggested. I have seen the (1) datastore, not pretty. Also, if you use DNS there is always the chance that the NFS would pop up to Layer 3 instead of sticking down a nice speedy Layer 2. This happened at another customer once upon a time when they swore to me over and over that everything was Layer 2 only. Not a super big deal but it can make a difference. Thanks!