Thanks to the help from blog sponsorship, I’m able to maintain a higher performing lab environment than I ever had been up until this point. One area which I hadn’t invested much in, at least from a lab standpoint, is networking. In the past, I’ve always had some sort of small to mid density unmanageable Ethernet switch. And this was fine. Household name brand switches like Netgear and SMC from Best Buy and NewEgg performed well enough and survived for years in the higher temperature lab environment. Add to that, by virtue of being unmanaged, they were plug and play. No time wasted fighting a mis configured network.
I recently picked up a 3Com SuperStack 3 Switch 3870 (48 1GbE ports). It’s not 10GbE but it does fit my budget along with a few other networking nice-to-haves like VLANs and Layer 3 routing. Because this switch is managed, I can now apply some best practices from the IP based storage realm. One of those best practices is configuring Flow Control for VMware vSphere with network storage. This blog post is mainly to record some pieces of information I’ve picked up along the way and to open a dialog with network minded readers who may have some input.
So what is network Flow Control?
NetApp defines Flow Control in TR-3749 as “the process of managing the rate of data transmission between two nodes to prevent a fast sender from over running a slow receiver.” NetApp goes on to advise that Flow Control can be set at the two endpoints (ESX(i) host level and the storage array level) and at the Ethernet switch(es) in between.
Wikipedia is in agreement with the above and adds more meat to the discussion including the following “The overwhelmed network element will send a PAUSE frame, which halts the transmission of the sender for a specified period of time. PAUSE is a flow control mechanism on full duplex Ethernet link segments defined by IEEE 802.3x and uses MAC Control frames to carry the PAUSE commands. The MAC Control opcode for PAUSE is 0X0001 (hexadecimal). Only stations configured for full-duplex operation may send PAUSE frames.“
What are network Flow Control best practices as they apply to VMware virtual infrastructure with NFS or iSCSI network storage?
Both NetApp and EMC agree that Flow Control should be enabled in a specific way at the endpoints as well as at the Ethernet switches which support the flow of traffic:
- Endpoints (that’s the ESX(i) hosts and the storage arrays) should be configured with Flow Control send/tx on, and receive/rx off.
- Supporting Ethernet switches should be configured with Flow Control “Desired” or send/tx off and receive/rx on.
One item to point out here is that although both mainstream storage vendors recommend these settings for VMware infrastructures as a best practice, neither of their multi protocol arrays ship configured this way. At least not the units I’ve had my hands on which includes the EMC Celerra NS-120 and the NetApp FAS3050c. The Celerra is configured out of the box with Flow Control fully disabled and I found the NetApp configured for Flow Control set to full (duplex?).
Here’s another item of interest. VMware vSphere hosts are configured out of the box to auto negotiate Flow Control settings. What does this mean? Network interfaces are able to advertise certain features and protocols which they were purpose built to understand following the OSI model and RFCs of course. One of these features is Flow Control. VMware ESX ships with a Flow Control setting which adapts to its environment. If you plug an ESX host into an unmanaged switch which doesn’t advertise Flow Control capabilities, ESX sets its tx and rx flags to off. These flags tie specifically to PAUSE frames mentioned above. When I plugged in my ESX host into the new 3Com managed switch and configured the ports for Flow Control to be enabled, I subsequently found out using the ethtool -a vmnic0 command that both tx and rx were enabled on the host (the 3Com switch has just one Flow Control toggle: enabled or disabled). NetApp provides a hint to this behavior in their best practice statement which says “Once these [Flow Control] settings have been configured on the storage controller and network switch ports, it will result in the desired configuration without modifying the flow control settings in ESX/ESXi.” Jase McCarty pointed out back in January a “feature” of the ethtool in ESX. Basically, ethtool can be used to display current Ethernet adapter settings (including Flow Control as mentioned above) and it can also be used to configure settings. Unfortunately, when ethtool is used to hard code a vmnic for a specific Flow Control configuration, that config lasts until the next time ESX is rebooted. After reboot, the modified configuration does not persist and it reverts back to auto/auto/auto. I tested with ESX 4.1 and the latest patches and the same holds true. Jase offers a workaround in his blog post which allows the change to persist by embedding it in /etc/rc.local.
Third item of interest. VMware KB 1013413 talks about disabling Flow Control using esxcfg-module for Intel NICs and ethtool for Broadcom NICs. This article specifically talks about disabling Flow Control when PAUSE frames are identified on the network. If PAUSE frames are indicative of a large amount of traffic which a receiver isn’t able to handle, it would seem to me we’d want to leave Flow Control enabled (by design to mediate the congestion) and perform root cause analysis on exactly why we’ve hit a sustained scaling limit (and what do we do about it long term).
Fourth. Flow Control seems to be a simple mechanism which hinges on PAUSE frames to work properly. If the Wikipedia article is correct in that only stations configured for full-duplex operation may send PAUSE frames, then it would seem to me that both network endpoints (in this case ESX(i) and the IP based storage array) should be configured with Flow Control set to full duplex, meaning both tx and rx ON. This conflicts with the best practice messages from EMC and NetApp although it does align with the FAS3050 out of box configuration. The only reasonable explanation is that I’m misinterpreting the meaning of full-duplex here.
Lastly, I’ve got myself all worked up into a frenzy over the proper configuration of Flow Control because I want to be sure I’m doing the right thing from both a lab and infrastructure design standpoint, but in the end Flow Control is like the Shares mechanism in VMware ESX(i): The values or configurations invoked apply only during periods of contention. In the case of Flow Control, this means that although it may be enabled, it serves no useful purpose until a receiver on the network says “I can’t take it any more” and sends the PAUSE frames to temporarily suspend traffic. I may never reach this tipping point in the lab but I know I’ll sleep better at night knowing the lab is configured according to VMware storage vendor best practices.