In XenApp & XenDesktop 7.7, we introduced zones, and I wrote a blog post about how they behaved with latency. The higher the latency, the worse things become. Well, we’ve been busy improving things…
The first area I want to cover is improvements for brokering with latency. I only have two new data points from 7.11, but you’ll see why I only have two.
|Average Response Time (s)||12.9||3.7||26.7||N/A||7.6|
|Brokering Requests per second||3.7||12.6||1.3||N/A||6.3|
|Time to launch 10k users||44m55s||13m10s||2h03m||N/A||26m27s|
As you can see with 250ms latency, we now outperform the 7.7 code at 90ms, so rather than spend time on doing lots of data points, I tested one that completely failed previously. So, if you’re on 7.11 or later, users should see quicker brokering of resources, even with latency between a broker and the SQL server.
We’re also not forgetting those on LTSR. If you have 7.6 CU3 DDCs, then you also have the same improvements. Although we don’t expect 7.6 to be deployed with latency, it still improves performance for no latency, and we know some customers do have 7.6 with a bit of latency.
So, what did we do?
We revisited the core brokering SQL code that determines which VDA is the least loaded and then sends a launch request to that VDA. We decided that rather than use a “perfect” load balance algorithm, we’d use a “good enough” load balance algorithm.
Previously, we’d wanted the least-loaded VDA, and would lock/block waiting for the VDA to become available, and so backup (serialize/block) all other brokering requests. Instead, we now look for the least loaded worker that isn’t currently locked, so this means we may not get the least loaded, perhaps the 2nd or 3rd least loaded. But we can do so without locking all the other launch requests.
If we can’t find one not currently locked then we’ll sit and wait for locks, with with enough VDAs it’s rare to hit the case of all of them being locked at the same time, and the behaviour is then the same as before this fix.
In some scenarios admins may notice a slight difference in load balancing, but you’d have to be paying very very close attention to spot we didn’t use the least loaded VDA.
This isn’t the only location we’ve been removing and resolving SQL blocking issues (in the core brokering code) and there are a few others that have been fixed. So, at this time I’d recommend large sites use a 7.13 or 7.6 CU3 broker to have all the currently known improvements.
Registration Storm serialisations
Unfortunately, one big area that we know there is a lock is VDA registration. The reason for the lock is to avoid deadlocks when registering workers. We now have a much better understanding of the cause of the deadlocks, which was due to not locking sessions for a worker in a consistent order over multiple registration threads. We now do the session locking by session id, which now stops the VDAs deadlocking.
However, we viewed it as a little risky to remove the lock entirely until we’ve had more time to test it out. But we know some customers have hit this lock, and so we decided to provide a tuneable on it’s usage.
So where is the tuneable, firstly you need 7.12 or later, and it’s actually in the database, and you have to update it directly in the database. It’s in the chb_Config.Site table:
select SerializeMultiSessionAudits, SerializeMultiSessionDeregistrations from chb_config.Site SerializeMultiSessionAudits SerializeMultiSessionDeregistrations --------------------------- ------------------------------------ 1 1
These flags can be set to 0 to remove the usage of the lock:
update chb_config.Site set SerializeMultiSessionAudits=0, SerializeMultiSessionDeregistrations=0 select SerializeMultiSessionAudits, SerializeMultiSessionDeregistrations from chb_config.Site (1 row(s) affected) SerializeMultiSessionAudits SerializeMultiSessionDeregistrations --------------------------- ------------------------------------ 0 0
We’ve tested this internally, and found it helped resolve some issues in our re-registration scale tests. But we also know that customers have very complex envs, and we didn’t want to make such a change of behaviour without a little more time to test it out.
It’s expected that the default will be to be not do this locking in a future release, but leave the tuneable to re-enable it should a customer have any issues without the locking.