boche.net – VMware vEvangelist

KB1008130: VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUN(s) and LUN queue is blocked indefinitely

January 19th, 2009 by jason Leave a reply »

I became aware of this issue last week by word of mouth and received the official Email blast from VMware this morning.

The vulnerability lies in a convergence of circumstances:

1. Fibre channel SAN storage with multipathing
2. A fibre channel SAN path failure or planned path transition
3. Metadata update occurring during the fibre channel SAN path failure where metadata updates include but are not limited to:

a. Power operations of a VM
b. Snapshot operations of a VM (think backups)
c. Storage VMotion (sVMotion)
d. Changing a file’s attributes
e. Creating a VMFS volume
f. Creating, modifying, deleting, growing, or locking of a file on a VMFS volume

The chance of a fibre channel path failure can be rated as slim, however, metadata updates can happen quite frequently, or more often than you might think. Therefore, if a fibre channel path failure occurs, chances are good that a metadata update could be in flight which is precisely when disaster will strike. Moreover, the safety benefit and reliance on multipathing is diminished by the vulnerability.

Please be aware of this.

Dear ESX 3.5 Customer,

Our records indicate you recently downloaded VMware® ESX Version 3.5 U3 from our product download site. This email is to alert you that an issue with that product version could adversely effect your environment. This email provides a detailed description of the issue so that you can evaluate whether it affects you, and the next steps you can take to get resolution or avoid encountering the issue.

ISSUE DETAILS:
VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUN(s) and LUN queue is blocked indefinitely. This occurs when VMFS3 metadata updates are being done at the same time failover to an alternate path occurs for the LUN on which the VMFS3 volume resides. The effected releases are ESX 3.5 Update 3 and ESXi 3.5 U3 Embedded and Installable with both Active/Active or Active/Passive SAN arrays (Fibre Channel and iSCSI).

PROBLEM STATEMENT AND SYMPTONS:
ESX or ESXi Host may get disconnected from Virtual Center
All paths to the LUNs are in standby state
Esxcfg-rescan might take a long tome to complete or never complete (hung)
VMKernel logs show entries similar to the following:

Queue for device vml.02001600006006016086741d00c6a0bc934902dd115241 49442035 has been blocked for 6399 seconds.

Please refer to KB 1008130.

SOLUTION:
A reboot is required to clear this condition.

VMware is working on a patch to address this issue. The knowledge base article for this issue will be updated after the patch is available.

NEXT STEPS:
If you encounter this condition, please collect the following information and open an SR with VMware Support:

1. Collect a vsi dump before reboot using /usr/lib/vmware/bin/vsi_traverse.

2. Reboot the server and collect the vm-support dump.

3. Note the activities around the time where a first “blocked for xxxx seconds” message is shown in the VMkernel.

Please consult your local support center if you require further information or assistance. We apologize in advance for any inconvenience this issue may cause you. Your satisfaction is our number one goal.

Update: The patch has been released that resolves this

Posted in Virtualization