Friday, March 28, 2008

Wicked Problem with EMC NS, iSCSI and VMWare

Wanted to let some engineers know about a “bug” that exists in the EMC Dart Code for the Cellera (NS) series of NAS.

There is a bug in that if an iSCSI read request that is larger than 1 MB in size is made to a Cellera, that the data mover may crash on the request. What happens is that the iSCSI service cannot process the request and results in a kernel panic on the data mover. The data mover then fails over, but due to the nature of the crash, meaning that the greater than 1 MB read request still must be processed, it could lead to the second data mover crashing, or in some cases, complete loss of data as the ESX server has writes needing to be processed.

There is also a HUGE possibility with ESX 3.5 where an HA event may be triggered and that the VM’s on that 3.5 node will be powered down. If you are in a mixture ESX 3.0.x and 3.5 nodes in an HA cluster, a HUGE amount of confusion can be brought on by the cluster with the HA event not being seen by the 3.0 hosts. This in fact lead to, in the case of one customer, 17 VM’s being corrupted and over 6 having to have DR performed on them to recover.

According to VMWare, they have made 3.5 much more sensitive to storage problems (???) and that it will force an HA event. Now, the customer in this case was using iSCSI QLogic HBA’s, so ESX is not aware of the underlying network calls being processed and in their defense treat the code the same as a Fiber Channel HBA so the timing is as if it’s an actual SAN and therefore doesn’t wait for the 45 second failover period of the data movers. This variable cannot be changed according to an escalation engineer within VMWare.

The cure is to patch the Dart Code on the NS so that it doesn’t panic on greater than 1 MB read requests.

I’d like to point out how easily a read over greater than 1 MB can be processed. VMFS formats a LUN in 1 MB blocks or larger. Windows NTFS formats it’s file system with 4k blocks. We all know that fragmentation occurs when a file is larger than 4k is written to the file system that it must take up at least 2 blocks. So, a 6k file actually takes up 8k of space on the file system. When the OS writes and it the next contiguous block is occupied, Windows (Linux, whatever) writes to the next available free block. So when FileA is 6k, it gets written to block 23 and the next block available is block 238,654. When the OS needs to read that file, it has to read both blocks. This is fragmentation. It takes a while to spin the disk and hence we get slower performance.

Well, since the 2 blocks are located on the VMFS VMDK file, ESX has to read 2 – 1 MB blocks to service the request for its Guest VM. Boom, kernel panic!

This problem was not found for this particular customer as the VM’s did not have fragmentation at the beginning of the implementation. However, now that the systems have run for a period of time, fragmentation builds up within both their file servers and database servers and has led to 2 major outages in exactly 10 days. They have thus decided, even though EMC claims the problem will not continue, that they are breaking the CX Clariion backend out and trashing the Cellerra NAS head. I really can’t say I blame them. EMC did not replace the first data mover that failed or was even able to determine the problem in the 10 days between the failures. So their failure yesterday did not have another data mover to move to. They were down for 10 hours. This customer is a bank and is very dependent on the services their infrastructure provides.

To make things worse, they did have a hot site that was configured for host based replication of their existing VM”s to be mirrored at. However, about 2 months ago their NS in that facility suffered from this very same problem and has not been completely repaired. They are to break the CX out there as well.

So at any rate, thought you guys should know. I have to admit I am personally a bit hesitant about pushing ESX with iSCSI out on these devices. It’s not solid, and EMC is HIGHLY unresponsive on resolving the issues when they do occur. When the sales rep is calling you to say (which I do have to say, that was awesome for him to do) “hey, I noticed your NAS was down” 2 hours after it occurred and tech support has not called the customer, there is a huge lack of communication or ability to fulfill the service requests that EMC receives. I know that this part is a rant, and everyone has some problems, but it’s not like we’re buying a 10k kia rio…Ok, I’ll shut up now.

3 comments:

goktugy said...

Hi,

I have Cellara also, but is not able to failover with in 45 seconds. It takes 600 seconds to failover to standby. Do you have an idea?
My EMC is NS40 with 5.6 DART.

Thanks.

Garrett Downs said...

That is something definately related to the Dart code on your DM's. That is a huge timeout and is obviously a huge problem.

goktugy said...

When setting the EnableAptpl=1 failover time reduced to 60 seconds which is acceptable.
It would be by default within the next DART release.