That… Could Be A Problem…

10Jan/126

SRM Troubleshooting Fun!

We finally decided to get some real disaster recovery and business continuity plans set in place and after deliberating between a couple options, we decided to go with Site Recovery Manager 5 with hypervisor based replication.

There's been plenty of fun in setting it all up and starting the replications.

Database Configuration

Couple notes:
Default Instance is absolutely required.
Mixed Mode Authentication is also absolutely required.
Create a login, database, and a schema within the database all with the same exact name
Database Settings
As a precaution, the created login is also a sysadmin and db_owner for the created database.
Connect to the SQL Server with IP, FQDN wouldn't work.
vCenter connection should be listed the same as in the site connection (ie. if you connected the sites via FQDN, use FQDN for the VRMS configuration).

Only after those steps, could I get VRMS to connect to the SQL server (which if you notice by the screenshot, is the same SQL server as the SRM connection).

Initial Replication Error

Now that I had green check marks across the board... While trying to set the replication on any of the VMs, I would receive this error: Call "HmsGroup.OnlineSync" for object "GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" on Server "x.x.x.x" failed. An unknown error has occurred.
Online Sync Error

Going through the logs, I saw lots of SSL Handshake errors and general connectivity problems, so I had to go back to our networking people and have them alter the hardware firewalls to allow connectivity across the board to all the systems involved (ESXi host, vCenter, SRM server, VRMS, VRS). Once that was done, it would successfully configure the virtual machine for replication.
Replication Success

I have yet to go back to firm up all of these port rules, I'll report back once I have it done.
Side note: I have no reason to doubt the VMware TCP/UDP KB, I just know that I was still having some connectivity problems after following it.

Replication Locking Failure

Now that the connections are all good, I have a couple VMs replicating. I go to another VM, right click, vSphere Replication, add in a schedule and then I receive an error of: Configuring Virtual Maching for Replication... failed.
VRM Server generic error. Please Check the documentation for any troubleshooting information. The detailed exception is 'Optimistic locking failure'.
Optimistic Locking failure

After searching the documentation, I come up with a synchronize storage step error, however this is not correct for this situation as I have not yet synchronized it.

I check the system and it's up, it's running, the VMware Tools are installed and functioning properly, everything looks good. So I go back and try and run the vSphere Replication wizard again and I am instantly hit with an error of: Call "HmsGroup.CurrentSpec" for object "GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" on Server "x.x.x.x" failed. An unknown error has occurred.
HmsGroup.CurrentSpec

So I started by rebooting the system, once the VMware Tools were running I did it again only to see the same error. This time, I powered off the system, removed it from inventory, added it back by the browsing the datastore for the vmx file, and powered it on. Once the VMware Tools were running, I tried it again and it worked perfectly! That was a little painful and I wished I had made notes of the timestamps to go through the logs, but it was a success nonetheless.

Large VMDK Replication Problems

With that figured out, it was time to get the VMs replicated and SRM with VRMS worked wonderfully from that point... until we got to the file servers.

I know the "2TB - 512" disk sizing rule, we found that out the hardware from having upgraded from 3.5 to 4 with some RDMs. It was not a fun experience. So all of the vmdks of our file servers are 2040GB in size.

The initial replication is successful, however the re-sync is not. It gives an event of: Replication operation error: Virtual machine is in an invalid state.

As before, the system is up, it's running, the VMware Tools are installed and functioning properly, everything looks good. So I go into the SRM plugin and tell it to "Syncronize Now" and I receive another error of: Call "HmsGroup.OnlineSync" for object "GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" on Server "x.x.x.x" failed. An unknown error has occurred.
HmsGroup.OnlineSync

So I pull down the logs by going to the Sites button, Summary tab, looking in the commands for "Export System Logs". The important part here is to get the logs for the site giving you the error. IE. if a server at your remote site is the one failing in the message, that's the log you'll want.

Unfortunately the event logs contain items such as:
2012-01-04T18:58:34.533Z [F5993B70 error 'Main' opID=hsl-a0edc478] [0] ExcError: exception N3Vim5Fault12FileTooLarge9ExceptionE: vim.fault.FileTooLarge
2012-01-04T18:58:34.533Z [F5993B70 error 'Main' opID=hsl-a0edc478] [1] Code set to: Generic storage error.

I'm currently working with a VMware Support Engineer to fix this problem. There has been a bug filed, so hopefully there is some new news soon. I'll update when I know something.

Large VMDK Replication Problems - Resolved

Received some unfortunate news from VMware Support yesterday. With vSphere Replication, it uses snapshot technology to forward the deltas to the remote site. Well, unbeknownst to me, there is actually some overhead that is applied to the VMDK file itself! So really the 2TB minus 512B should be more like 2TB minus 512B minus 16GB to make a total maximum of around 2030GB.

So once I reduced the VMDK size down to 2030GB, it replicates and updates just fine. Now the problem is how do I shrink the VMDK files...

If you want to read more, check out the VMware Knowledge Base article: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1012384 and more specifically the "Calculating the overhead required by snapshot files" section towards the bottom.

Disable Replication of VMDK = Delete VMDK

This was one of the things I learned the hard way through the testing of the larger VMDK files. I happened to go through and set one of the disks that had already been replicated to be disabled from replication. From the local site vCenter, it looks like the replication was turned off, right?
Disable Replication

Unfortunately, that was not the case. I pulled up the remote vCenter and was greeted with an event that says the virtual disk was deleted!
VMDK Deleted

That was certainly a surprise. I guess I understand why it does that, but the first time I did it, the VMDK that was deleted was 2040GB in size. It took me almost 2 days to copy all of that information down!

Comments (6) Trackbacks (0)
  1. Hi I have some problem taht you wit replcation
    Call “HmsGroup.OnlineSync” for object “GID-e4605751-477b-47e6-b4c7-ca” on Server “ip” failed.
    An unknown error has occurred.

    When click on synchronize now??

    Can you please Help!!

    Thank you Regards

    Denis

    • There are quite a few possibilities for what the cause of the error you’re seeing is…
      Can you set up the replication on any other VMs?
      If not, are your mappings set correctly for disk, network, etc? Are your firewall ports setup correctly?
      If yes, have you powered off the system, removed it from inventory, and re-added it to inventory and tried again?
      This should probably be your first step, but have you gone through the logs to see at what point you start seeing errors? If not, you can export the logs via the Sites button, then the Summary tab, and then to the right hand side, look for the “Export System Logs”

  2. Hi,

    Any updates about this issue with replcation ???

    I got ervery time the same issue.

    Call “HmsGroup.OnlineSync” for object “GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx” on Server “x.x.x.x” failed

    • One other thing I’ve found which seems to work… rebooting the VRS and VRMS appliances.
      It’s not the best way, but it worked for me on two systems.
      The SRM logs don’t exactly give great detail, so use common sense and your best judgement when attempting fixes like this.

  3. Hi,
    I have problem with replication between Protect and Recover site. Replication status is “not active” and when click sync now, I received message “Call “HmsGroup.OnlineSync” for object “GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx” on Server “x.x.x.x” failed. An unknown error has occured.
    Before 15minutes I restarted all key servers in vmware environment, but have same problem.
    Could you please help me about that issue? When I create new VM, I have successfully replication.

  4. Connectivity from Site1 to Site2 is fine.
    i have open all ports at present between two sites.
    Trying to replication only 1 VM using vSphere Replication.
    In the vSphere Replication. Replication status shows – Not Active
    Protection Groups has been setup – Protection status – Replication warning
    when i run the TEST Plan. i got this warning
    “Connection to the remote vSphere Replication site ‘xxvc1.xxx.local’ is down error 10/12/2014 9:47:24 a.m. Administrator”


Leave a comment

No trackbacks yet.