Similar to what I did in my farm & zone design article, I thought it was time to take a “fresh” look at XenApp scalability or user density (hence the “2013 version”). Because many best practices in this area have changed as we’ve made some dramatic hardware improvements and hypervisor advancements over the past 2-3 years or so in particular. And after all, one of the top questions I still get asked is “How many users can I get on a box?”. Well, I had one answer 10 years ago when I started…then I had another answer about 3 years ago…now I have another answer as we approach the year 2014. A lot has changed over the last decade, but I think in this area of XA scalability in particular, the most has changed maybe in the last few years as we’ve introduced things like NUMA and re-wrote hypervisor CPU schedulers. Before we dive in, if you haven’t read Andy Baker’s 3 part series on Hosted Shared Desktop Scalability, I suggest you do – it’s really great and I want to show how even things have changed in the last year since he wrote those articles. And if you haven’t read Project VRC’s whitepapers, specifically “Phase 2” where they looked at XA scalability on all 3 major hypervisors in 2010, I suggest you do – very informative and I’m going to comment on what’s changed since they conducted those tests.
I decided to do this article a little differently – I’m going to give you the “Results” or “Key Findings” first in Part 1…then explain how we arrived at those results and some of the concepts like NUMA and CPU Over-Subscription in Part 2 (Now Published!). I recently conducted a pretty strenuous XenApp Scalability Test at a customer and some of these results are from that engagement (I also added a couple more so I could talk about 1 or 2 other things in this article). Without further ado:
The following conclusions were drawn from the XenApp Scalability Tests conducted at Company ABC:
- 130 Users Per Physical Host; CPU-Bound Workload. Exactly 130 users or XenApp sessions were able to run with an acceptable user experience on each physical host with the default or “Medium” Login VSI workload using Company ABC’s unique hardware. This is slightly lower than originally anticipated, but after further investigation into the new default workload used by Login VSI 4.0.x, this is expected due to the high activity ratio and intensity of the workload. The workload was also CPU-bound as expected and this is typical of XA workloads running on 64-bit operating systems.
- 4 vCPU VM Specification Resulted in Optimal User Density. Citrix and Company ABC tested various configurations with 2 vCPUs, 4 vCPUs and even 8 vCPUs to determine the optimal VM resource allocation for a CPU intensive workload on Company ABC’s chosen hardware. It was determined that the 4 vCPU VM specification resulted in the highest user density, while still providing a good user experience. Citrix believes this is due to the underlying NUMA architecture in the Intel chipset that was used in the testing – each socket (with 8 cores) is split into two NUMA nodes, each with 4 cores.
- Modest CPU Over-Subscription Resulted in Optimal User Density while Maintaining a Good User Experience. Citrix also tested various levels of CPU over-subscription, namely 1:1, 1.5:1 and 2:1. It was determined that a modest level of CPU over-subscription (1.5:1) resulted in the highest user density, while still providing a good user experience. This can largely be attributed to Hyper-Threading being enabled, which typically provides performance gains anywhere from 20-30%.
- Small Performance Gains from Hyper-Threading due to High Activity Ratio. The performance gains achieved through Hyper-Threading were relatively small (~10%) in Company ABC’s environment compared to other customers and industry standards (~20-30%). After further investigation into the default workload that Login VSI has implemented, it appears the “activity ratio” (the amount of active processing versus idle time) is very high – on the order of ~85%. In most XenApp environments that Citrix Consulting has seen, the activity ratio is closer to 50-60% in practice. Citrix believes this high activity ratio is negatively affecting Hyper-Threading performance on an already CPU-bound workload. If Company ABC so chooses, the Login VSI script can be modified to include more idle or sleep time and the 4 vCPU test can be re-run to likely achieve higher user density and more performance benefits from Hyper-Threading.
Pretty interesting, eh? Now let’s talk about the results a little bit. But first, let me provide some background on what hardware and software we used in these latest tests our Consulting team conducted at this particular customer.
We used a hardware spec that is fairly popular right now (and also a very good “sweet spot” for XA workloads in terms of CPU and RAM I might add…) – Dell R810’s with 2 sockets (8 cores each) and 128 GB RAM. We were using Intel chips and Hyper-Threading was enabled in all tests. We were using XenApp 6.5 on 2008 R2 (both fully patched) and vSphere 4.1 for the hypervisor (fully patched at that level – I believe “Update 3a”). We used Login VSI 4.0.x as our load testing tool to conduct all tests – we did not customize the workload, we only used the default “Medium” workload which is essentially a pretty heavy Office user (more on that later). We monitored everything and then monitored everything some more. We looked at a variety of different VM specs (2 vCPUs vs. 4 vCPUs vs. 8 vCPUs) and CPU over-subscription ratios (using only physical cores, using all virtual CPUs, and using somewhere in between the two). We measured the user experience in a variety of ways with the Login VSI Analyzer. The results confirmed a lot of the things we are preaching as a Consulting team these days and also seeing in the field:
- Hyper-Threading still provides performance gains – you should enable it 99.9% of the time when virtualizing XA.
- You shouldn’t just use physical cores when sizing XA workloads, but you also shouldn’t use all logical/virtual cores when sizing either.
- You should absolutely consider using 4, 6 or 8 vCPUs for your XA VM spec instead of 2 vCPUs.
- Users actually “work” less than you think – activity ratios should accurately reflect how users work and interact with apps in the real world.
In the next article, we’re going to really dig into these results and talk about things like how NUMA affects VM specs, pCPUs vs. vCPUs, why we didn’t get 192 users per box like we did a year ago on this same hardware, and much more. Stay tuned and I’ll update this article with the link to “Part 2” when it’s published.
Nick Rintalan, Lead Architect, Citrix Consulting