I recently worked with a customer who suffered an IMA service failure in their XenDesktop 4 farm.  The troubleshooting process taught me number of things about XenDesktop, and I wanted to share our findings for the benefit of other XenDesktop 4 customers.

Customer Environment

This customer has a medium-sized XenDesktop deployment.  The customer environment is:

  • VMware vSphere 4 hosting DDCs and VDAs
  • DDCs are Windows Server 2003 32-bit
  • XenDesktop 4 SP1, the customer had deployed all hotfixes released up to June 2011
  • Approximately 1500 Windows XP SP3 virtual desktops with a slightly older VDA version in use
  • Static Virtual Desktops (no Provisioning Services in use)
  • The DDC roles are divided according to our best practices about which I have blogged in the past:
    • dedicated Farm Master server (maxworkers = 0, highest election preference)
    • dedicated backup farm master server, which is also the primary XML broker (maxworkers = 0, second election preference)
    • 4 “brokering” DDCs (maxworkers not set, default election preference)

 

Service Failure

Several weeks ago the customer experienced an IMA service failure on the farm master server.  This caused disruption to the operation of the farm, and prevented new desktop sessions from being established.

The roles being carried out by the farm master server should have failed over to the backup farm master server.  However, this did not happen.  Understandably, the customer’s efforts focussed on service restoration so diagnostic information was not captured at the time.  To understand why the farm master role did not fail over, we would have used CDF Control to capture traces across a number of the DDCs in the farm.  These traces would have shown us a line-by-line description of every IMA event and message, and given us a good understanding of the behaviour of the farm at the time the IMA Service failure occurred.

When we began examining the Farm Master server we could find no clues as to why the service failure occurred.  The customer of course wanted to know the root cause so as to avoid a re-occurrence of the problem.

We set a process in place so that if the problem occurred again we could capture enough diagnostic information to find the root cause.  The process was:

  • Log onto the Farm Master server and start a CDF trace using all Modules.
  • Make a note, or take screenshots/videos of any behaviour you can see, eg duplicate commands in vCentre, other errors etc.
  • Stop the CDF trace after 5 minutes.
  • Extract the System and Applications logs from the DDC.
  • Save the CDF Trace, event logs and any screenshots or other supporting data off the DDC for later uploading to Citrix for analysis.
  • Using the vSphere client, Suspend the Farm Master server.
  • Use Veam FastSCP, WinSCP or vSphere client (or your tool of choice) to extract the VMSS file off the VMFS datastore.  Zip this file for later uploading to Citrix for analysis.
  • Power the VM back on and allow it to resume from suspend.
  • Perform a clean restart of the Farm Master server, using the OS restart command (not a forced shutdown).
  • Of course if the VM will not reboot then you will need to force a reboot.
  • Upload the VMSS file, the CDF Trace, event logs and any other information using our FTP service.

While the customer put this process in place we continued to look for clues as to why the service failed originally.

As we discussed the precise steps the customer took to restore service, it turned out that they had stopped the IMA Service, and used the “dsmaint recreatelhc” command to refresh the local host cache, on the always-reliable assumption that a corrupt LHC might have caused the failure.  In this case this did not restore service; the customer had to reboot the Farm Master server.

Of course when this is done, the previous LHC is saved as a .bak file.  We looked at the .bak file and saw it was 2GB in size!

 

2GB Local Host Cache

What could account for such a large local host cache?  One of our Escalation engineers suggested checking for any use of scripting in the environment.

A quick call with the customer established that they ran scripts against the environment throughout the early hours of the morning and again throughout the day.  The customer observed the LHC file while the scripts were executing, and indeed did see a growth in the file size.

A limitation of the JET database format (commonly called Access databases due to it’s file extension .mdb) is a maximum size of 2GB.  There is a very good article about the JET database format on Wikipedia here: http://en.wikipedia.org/wiki/Microsoft_Jet_Database_Engine)

We actually had identified a a specific problem with an earlier version of the SDK where scripting caused large LHC growth, see here for the latest SDK which includes the fix: http://support.citrix.com/article/CTX127167.  In this case my customer was already using the latest version of the SDK in their environment.

 

Tracing and Measuring the Problem

To prove the amount of data being written to the LHC the Escalation Engineer investigating the problem asked the customer to capture a CDF trace while the scripts were being executed.  We limited CDF control to only capture data being written into the Local Host Cache.

The traces were analysed and found that the script was writing data to each row in the Local Host Cache that referenced the virtual desktops.  Performing some simple maths, the Escalation Engineer estimated that the LHC would grow by approximately 56 MB per day due to the customer’s scripts.

And of course we should remind XenDesktop admins that other administrative changes and general farm behaviour would also increase the size of the LHC.

 

Conclusion

We provided our findings to the customer and advised them that our SDK, the cmdlets and the LHC was working as designed; there was no specific bug in XenDesktop causing this behaviour.

The customer has implemented regular maintenance to recreate the local host cache and thus reset the size while they consider other options.

We suggested that they review their scripting design and implementation.  In particular we recommended enabling automated monitoring of the size of the local host cache so that if it grows towards to the 2GB limit it will trigger an alert to the support team.