vCenter tasks time-out or ESX host disconnects
Duncan Epping over at Yellow bricks has posted a most interesting article which I read tonight when reviewing my RSS feeds. It instantly struck a cord with me because recently we have been having issues with an ESX host at a site remote from our recently upgraded vCenter 2.5 U3 server.
I just received an email from a fellow consultant about a customer which had vCenter tasks time-out every once in a while. At times also ESX hosts got disconnected for no apparent reason at all. He discovered the following article by Richard Blythe aka VMware Wolf: ESX disconnects randomly or when doing VI client tasks from VC, task randomly timeout after a long idle time. Richard created a list of issues/errors that might be related to this issue:
- ESX disconnects randomly from VirtualCenter
- ESX disconnects when performing VI Client tasks from VirtualCenter.
- Tasks randomly timeout after a long idle time
- “An error occurred communicating to the remote host” pops up.
The article refers to an issue with vCenter Update 3 in combination with firewalls using state-ful inspection. The problem occurs because of SOAP timeouts, and this behavior did not exist in VC 2.0.x or 2.5 GA, as they used a different mechanism to communicate with ESX. The official KB article hasn’t been released yet but a temporary workaround has been published by Richard. If you run into any of the before mentioned issues head over to Richard’s website and try out the workaround until the fix or official KB article is released.
When conducting operations with this particular host using VI client attached to vCenter server, we get “An error occurred communicating to the remote host” pop up more often than not. I have been looking through the logs in vCenter for this host and it appears as well as manual tasks, our overnight Platespin protection replication jobs are also getting this message when executing. This might explain some of the issues we’ve been having with some of our newer replication jobs not completing.
I’ve had a quick look at VmwareWolf’s workaround and have asked Richard if you need to create a dummy vm on each host or just the hosts that experience the problem as its not completely clear. If i get a response I’ll let you all know what it is, meantime we look forward to an official KB hot fix release from VMware
* UPDATE - 09/02/09*
Richard at VMwareWolf came back to me and informed me that the dummy vm only needs to be setup on the affected host. I set ithe workaround up yesterday and it appears to have resolved our issue for a job that was consistently reporting the error. A permanent fix is still outstanding from VMware.



