Before I joined Citrix I worked in Infrastructure Support for a large UK financial services company. When we recruited new members of the team we gave each candidate a technical test covering the technologies the team supported; AD, DNS, WINS, DHCP, Clustering, Citrix, VMware etc etc etc. The objective of the test was to measure a candidate’s technical ability and in particular to ensure that the certifications they had publicised could be supported by real-world knowledge.
Within the Citrix section of the test was a simple question:
Under what circumstances would you run the following command “dsmaint recreatelhc”?
I used to read all sorts of interesting and complicated answers to this question, all of which were valid. However no-one ever wrote down what I was really looking for: “It is the first thing you should try when something is wrong with your XenApp farm.”
During a recent phone conversation with one of my customers, he told me that they had a problem with the European zone of their primary production XenApp farm. The replica DB that supported the zone was no longer replicating with the publisher DB, and they were going to recreate the replica. The process is pretty straightforward and the customer has loads of experience with this so I filed this away and didn’t think any more about it.
A few days later I got another call from the customer saying that users who connect to the European zone were no longer able to authenticate through the Web Interface with the following error:
And in the event log of the Web Interface server were a number of errors as below:
The customer was able to work around the issue temporarily by pointing the European Web Interface servers to XML brokers from another zone in the farm. This no doubt introduced some small delays to the user authentication and application resolution process, but was clearly better than having no service at all.
I have been working with the customer for some time, so know their Citrix environment fairly well. They have a very large distributed Xenapp farm with a number of zones, each supported by a replica IMA datastore. We had been troubleshooting a number of IMA and replica-related issues so I was now dreading that yet another IMA/replica problem was occurring in their environment.
We set up a GoToMeeting and quickly began reviewing the history of the zone, and as I said above, I was starting to assume this was a serious problem and was thinking about how to troubleshoot it. Then the customer reminded me that they had recreated the SQL replica DB for the zone just a few days previously…and that this seemed to co-incide with the Web Interface problem occurring.
I had a sudden flash of inspiration which really ought to have been my first reaction to the customer logging the call: have you recreated the local host cache on the Zone Data Collector for the European zone?
I was told; No, this hasn’t been done.
Within a minute the cache was re-created, IMA was started, WI authentication was tested and worked!!
So knowing that it was a simple corrupt LHC that was causing the customer’s problems, we quickly did the same operation on the backup ZDC for the zone.
Thinking about the longer term stability of the zone, I suggested that the customer perform a “dsmaint recreatelhc” on all zone member servers as soon as possible. I also suggested that they add these steps to their support procedures, so that if they have to recreate a replica database, they know to recreate the local host cache also.