Here recently I’ve noticed that there is an occasional time where the VMs I have replicating using the vSphere Replication system are stuck in a “Sync” status for an overly long time.
After pulling the logs, I was able to figure out what was happening… Timeouts, lots of them. The log file vmware-dr.log pulled from the remote site was full of lines like the following: (local is the SRM server, peer is the vCenter server)
2012-04-02T07:35:04.077-04:00 [02784 verbose 'Default'] Timed out reading between HTTP requests. : Read timeout after approximately 50000ms. Closing stream TCPStreamWin32(socket=TCP(fd=2596) local=10.xx.xx.xxx:9085, peer=10.xx.xx.xxx:55039)
2012-04-02T11:54:34.159-04:00 [02744 verbose 'Licensing'] Asset in sync.
2012-04-02T11:58:12.527-04:00 [02868 info 'LocalVC' opID=ac2d1cb] [PCM] Received NULL results from PropertyCollector::WaitForUpdatesEx due to timeout of 900 seconds
2012-04-02T11:58:12.723-04:00 [02860 info 'LocalVC' opID=596971f7] [PCM] Received NULL results from PropertyCollector::WaitForUpdatesEx due to timeout of 900 seconds
After a brief discussion with our network engineers, it was believed that there was no problem with the connection between the local and remote site. So I took a “when in doubt, reboot” approach. I restarted the SRM service on the remote SRM server. No luck. After that, I did a “Restart Guest” on the VRS system at the remote site. After about 5 minutes, the systems started to connect and replicate again.
I’ve noticed it a lot, and I’ve heard from other people whom also manage their own SRM deployments that a reboot is a pretty good first step in troubleshooting. So keep that in mind as issues arise and troubleshooting is required.