That… Could Be A Problem…

30Mar/126

When installing the Dell Equallogic MEM goes wrong…

For those those don't know, Dell and Equallogic have been releasing firmwares and updates at a massive pace. One which I, personally, prefer compared to some of the other vendors which release updates at a quarterly rate.

However, after upgrading to the VMware version of MEM 1.1 things started to go very... very wrong.

I happened to take a screenshot of the successful install of 1.1 and removal of 1.0.9:
MEM Install

After a reboot, none of the volumes returned. None, not a single one. I checked the Management Console, everything was fine, except for the connections tab which showed zero connections instead of the normal 48.

I then connected to the host and could ping all of the storage array's iSCSI ports without problem.

Being that I'm already SSH'd in to the host, figure I'll do some log diving.

In the vmkwarning.log I'm hit instantly with a bunch of these:

2012-03-26T20:31:17.714Z cpu13:7053)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.6090a0a8c099b78b22e8d4f8d3904f66" - failed to issue command due to Not found (APD), try again...

2012-03-26T20:31:17.714Z cpu13:7053)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.6090a0a8c099b78b22e8d4f8d3904f66": awaiting fast path state update...

2012-03-26T20:31:18.714Z cpu2:6930)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:972:Could not select path for device "naa.6090a0a8c099b78b22e8d4f8d3904f66".

2012-03-26T20:31:18.714Z cpu2:4786)WARNING: vmw_psp_rr: psp_rrSelectPath:1146:Could not select path for device "naa.6090a0a8c099b78b22e8d4f8d3904f66".




2012-03-27T01:32:31.371Z cpu7:5651)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: vmhba33:CH:1 T:15 CN:0: Failed to receive data: Connection closed by peer

2012-03-27T01:32:31.371Z cpu7:5651)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]

2012-03-27T01:32:31.371Z cpu7:5651)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Conn [CID: 0 L: 10.*.*.22:62322 R: 10.*.*.5:3260]
2012-03-27T01:32:31.623Z cpu7:5651)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:0 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)

2012-03-27T01:32:31.623Z cpu7:5651)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]

2012-03-27T01:32:31.623Z cpu7:5651)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 10.*.*.21:52273 R: 10.*.*.5:3260]



2012-03-27T03:20:39.154Z cpu4:4980)WARNING: NMP: nmp_SatpGetDefaultPspi:624:Default psp DELL_PSP_EQL_ROUTED for SATP: VMW_SATP_EQL Load Failed! [Not found]

2012-03-27T03:20:39.159Z cpu0:4980)WARNING: ScsiDeviceIO: 6235: The device naa.6090a0a8c099072610ef54ee86018060 does not permit the system to change the sitpua bit to 1.

Then, to add insult to injury, I check out the vmkernel.log and find more horribleness:

2012-03-27T03:17:23.611Z cpu5:5624)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x41002c0b7640 network resource pool netsched.pools.persist.iscsi associated

2012-03-27T03:17:23.611Z cpu5:5624)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x41002c0b7640 network tracker id 1 tracker.iSCSI.10.*.*.5 associated

2012-03-27T03:17:23.612Z cpu5:5624)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: vmhba33:CH:0 T:3 CN:0: Failed to receive data: Connection closed by peer

2012-03-27T03:17:23.612Z cpu5:5624)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]

2012-03-27T03:17:23.613Z cpu5:5624)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Conn [CID: 0 L: 10.*.*.21:56710 R: 10.*.*.5:3260]

2012-03-27T03:17:23.613Z cpu5:5624)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba33:CH:0 T:3 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound

2012-03-27T03:17:23.613Z cpu5:5624)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]

2012-03-27T03:17:23.613Z cpu5:5624)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Conn [CID: 0 L: 10.*.*.21:56710 R: 10.*.*.5:3260]



2012-03-27T03:19:19.127Z cpu14:5624)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x41002c09d020 network resource pool netsched.pools.persist.iscsi associated

2012-03-27T03:19:19.127Z cpu14:5624)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x41002c09d020 network tracker id 1 tracker.iSCSI.10.*.*.10 associated
2012-03-27T03:19:19.595Z cpu14:5624)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba33:CH:0 T:0 CN:0: iSCSI connection is being marked "ONLINE"

2012-03-27T03:19:19.596Z cpu14:5624)WARNING: iscsi_vmk: iscsivmk_StartConnection: Sess [ISID: 00023d000001 TARGET: iqn.2001-05.com.equallogic:0-8a0906-8bb799c0a-664f90d3f8d4e822-remotesystems TPGT: 1 TSIH: 0]

2012-03-27T03:19:19.596Z cpu14:5624)WARNING: iscsi_vmk: iscsivmk_StartConnection: Conn [CID: 0 L: 10.*.*.21:58840 R: 10.*.*.10:3260]



2012-03-27T03:19:20.344Z cpu22:5497)ScsiScan: 1098: Path 'vmhba33:C2:T0:L0': Vendor: 'EQLOGIC ' Model: '100E-00 ' Rev: '5.2 '

2012-03-27T03:19:20.344Z cpu22:5497)ScsiScan: 1101: Path 'vmhba33:C2:T0:L0': Type: 0x0, ANSI rev: 5, TPGS: 1 (implicit only)

2012-03-27T03:19:20.345Z cpu0:5497)ScsiScan: 1582: Add path: vmhba33:C2:T0:L0

2012-03-27T03:19:20.385Z cpu0:5497)VMKAPICore: 2204: Loading Module vmw_satp_eql

2012-03-27T03:19:20.386Z cpu21:4783)WARNING: UserObj: 3232: Unimplemented operation on 0x410017c05b60/RPC

2012-03-27T03:19:20.386Z cpu21:4783)WARNING: UserObj: 675: Failed to crossdup fd 9, cnxId: 0x80000000 type RPC: Not implemented

2012-03-27T03:19:20.544Z cpu4:6392)Loading module vmw_satp_eql ...

2012-03-27T03:19:20.544Z cpu4:6392)Elf: 1862: module vmw_satp_eql has license VMware

2012-03-27T03:19:20.544Z cpu4:6392)Mod: 4015: Initialization of vmw_satp_eql succeeded with module ID 62.

2012-03-27T03:19:20.544Z cpu4:6392)vmw_satp_eql loaded successfully.

2012-03-27T03:19:20.549Z cpu0:5497)NMP: nmp_LoadPlugin:3188: Plugin DELL_PSP_EQL_ROUTED is not registered!

2012-03-27T03:19:20.549Z cpu0:5497)WARNING: NMP: nmp_SatpGetDefaultPspi:624:Default psp DELL_PSP_EQL_ROUTED for SATP: VMW_SATP_EQL Load Failed! [Not found]

2012-03-27T03:19:20.549Z cpu0:5497)ScsiPath: 4541: Plugin 'NMP' claimed path 'vmhba33:C0:T0:L0'

2012-03-27T03:19:20.551Z cpu10:5497)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:479: Changing active path from NONE to vmhba33:C2:T0:L0 for device "Unregistered Device".



2012-03-27T03:41:46.169Z cpu14:4977)VMWARE SCSI Id: Id for vmhba33:C1:T14:L0
0x60 0x90 0xa0 0xb8 0x60 0x6e 0xd5 0x3a 0xaf 0xf0 0xd4 0x01 0x00 0x00 0xf0 0x0b 0x31 0x30 0x30 0x45 0x2d 0x30

2012-03-27T03:41:46.169Z cpu14:4977)VMWARE SCSI Id: Id for vmhba33:C0:T14:L0
0x60 0x90 0xa0 0xb8 0x60 0x6e 0xd5 0x3a 0xaf 0xf0 0xd4 0x01 0x00 0x00 0xf0 0x0b 0x31 0x30 0x30 0x45 0x2d 0x30

2012-03-27T03:41:46.171Z cpu14:4977)ScsiDeviceIO: 5843: QErr is correctly set to 0x0 for device naa.6090a0b8606ed53aaff0d4010000f00b.

2012-03-27T03:41:46.171Z cpu14:4977)WARNING: ScsiDeviceIO: 6235: The device naa.6090a0b8606ed53aaff0d4010000f00b does not permit the system to change the sitpua bit to 1.

2012-03-27T03:41:46.171Z cpu14:4977)VAAI_FILTER: VaaiFilterClaimDevice:270: Attached vaai filter (vaaip:VMW_VAAIP_EQL) to logical device 'naa.6090a0b8606ed53aaff0d4010000f00b'

2012-03-27T03:41:46.203Z cpu14:4977)ScsiDevice: 3121: Successfully registered device "naa.6090a0b8606ed53aaff0d4010000f00b" from plugin "NMP" of type 0

2012-03-27T03:41:46.205Z cpu14:4977)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:479: Changing active path from NONE to vmhba33:C2:T6:L0 for device "Unregistered Device".

These errors are all included in the following VMware KB articles which all point at the MEM plugin:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016381
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2004432

First and foremost, I restarted the management agents by doing this in the SSH console:
service mgmt-vmware restart

Still no volumes, so I had the system do a full restart, only to find that there were still no volumes present.

I'm still seeing similar errors in the logs, so I go through and uninstall the Dell Equallogic MEM plugin. To do that, do the following and then reboot:
esxcli software vib remove --vibname dell-eql-host-connection-mgr --vibname dell-eql-hostprofile --vibname dell-eql-routed-psp

System comes back up, still no volumes. At this point, I did something I don't know that I really had to do. I disabled the iSCSI software initiator, rebooted, and re-enabled it. It didn't solve anything, and might have caused more trouble. I mention it because I'm not officially sure what the fix really was.

Last ditch effort, I put one of the volumes offline and then brought it back online. Amazingly enough, the ESXi host found it. It was given a path selection of "Fixed".

I wish I could find the KB articles (or maybe it was a blog) where it was talking about, essentially, a cached path selection. Proof of that could be found in the bottom three lines of the code above. At the time stamp 03:41 the MEM plugin had already been uninstalled so the ESXi host shouldn't have seen that plugin anymore. So on the bottom line, it swapped over to be "Fixed"

I haven't made another effort to reinstall the MEM plugin so far. I'd like to do some more testing on a smaller scale before I attempt it again.

Small follow up: I have created a support ticket with Dell. They are suggesting that I switch all of the path policies over to be Round Robin before upgrading.

I'm not exactly thrilled with that response, but will be going through and disabling the MEM Plugin, then trying the update again. Perhaps the plugin needs to be disabled beforehand.

I'll update with any results.

Update

Well, that seems to be the official word. The update process is not as streamlined as I would have hoped. The tech left me with "It is suggested to change all datastore multi pathing to round-robin on ESX 5.x before installing MEM."

I did happen to test it elsewhere and it does work just fine if you change all of the PSPs off the DELL_PSP_EQL_ROUTED to either Round Robin or Fixed, everything does work and the install of the new MEM complete successfully.

Comments (6) Trackbacks (0)
  1. Hello,

    I’m curious what’s the Dell case #?

    What’s the build number on the ESXi server?

    Here’s a little script to change the EQL volumes over to Round Robin and set the IOPs to 3 to improve performance until you get this straightend out.

    This is a script you can run to set all EQL volumes to Round Robin and set the IOPs value to 3.

    esxcli storage nmp satp set –default-psp=VMW_PSP_RR –satp=VMW_SATP_EQL ; for i in `esxcli storage nmp device list | grep EQLOGIC|awk ‘{print $7}’|sed ‘s/(//g’|sed ‘s/)//g’` ; do esxcli storage nmp device set -d $i –psp=VMW_PSP_RR ; esxcli storage nmp psp roundrobin deviceconfig set -d $i -I 3 -t iops ; done

    After you run the script you should verify that the changes took effect.
    #esxcli storage nmp device list

    The offline/online solution, is symptomatic of an SCSI reservation issue. Where a SCSI-2 exclusive reservation is ‘stuck’ and that prevents another node from being able to read the partition table/filesystem. ESX uses these reservations at times. Did you happen to check the EQL GUI to see if that node actually had an iSCSI connection to the volume?

    Regards,

    Don

    • Case: 00401565
      ESXi Build: 623860

      While the volumes were offline to the host, they were not connecting via the EQL GUI.

      Thanks for the script, that’ll be quite helpful in the future.

  2. Interesting, that case shows closed and resolved.

    Have you re-installed MEM yet?

    Thanks,

    • Yes.

      It took changing the PSP over to a VMware PSP and then installing the new MEM (which uninstalls the previous MEM) and a reboot followed by moving the PSP back to the DELL_PSP_EQL_ROUTED path.

    • Hello Akin,

      That KB isn’t related to this thread. The issue described there affects VMware ESX/ESXi users running Dell/EQL FW v6.0.6 or earlier. All EQL customers running VMware need to be at 6.0.7 or greater to avoid this problem.

      Regards,

      Don


Leave a Reply

No trackbacks yet.