XenApp Scalability v2013 – Part 2

Overview

In the first part of this article (which you should read first if you haven’t already), I revealed some interesting results from an in-depth scalability testing engagement we recently completed at a large customer. And I’m sure some of the results made you scratch your head a bit…and maybe you had questions like I initially did, such as:

Why was 4 vCPUs chosen as the VM spec? Don’t Citrix and VMware always recommend 2 vCPUs for XA?!?
Why did we only get 130 users on each physical host? Andy’s article written a year ago said we should expect 192 users with this hardware and a medium workload?!?
Why did we only over-commit “half” of the total available CPUs (i.e. 24 vCPUs)? Shouldn’t you always use all logical vCPUs available in the box (i.e. 32 vCPUs)?
How did NUMA factor into this particular test? Why does NUMA even matter when sizing large VMs these days?
Why do users “work” a lot less than we think? (Kidding…I’m staying away from this one!)

If you have more questions than this, please feel free to drop me a comment below. But let’s tackle the questions above first.

CPU Over-Subscription

Whether or not you use 2 vCPUs, 3vCPUs, 4 vCPUs, 6 vCPUs, 8 vCPUs (or some other number) for each XA VM is a harder question that really requires testing and an understanding of NUMA (and I’ll get to that next). But an easier question to address first is how much CPU over-subscription should you do? It is easiest to explain with an example…so remember for this particular test we had a box with a “2×8” CPU configuration (that’s sort of my slang for 2 sockets and 8 cores per socket). And remember we’re enabling hyper-threading. So we have 16 “physical” CPUs or cores and 32 “virtual” or logical CPUs. And most people refer to that as 16 pCPUs and 32 vCPUs. So the question is do I deploy enough XA VMs to use 16 CPUs, 32 CPUs or somewhere in between? The short answer is somewhere in between and most of the time the best “sweet spot” will be an over-subscription ratio of 1.5:1, meaning 24 vCPUs in this case. So I might deploy 12 XA VMs on each host if I’m using a 2 vCPU spec, or 6 XA VMs if I’m using a 4 vCPU spec, or maybe just 3 XA VMs if I’m using a 8 vCPU spec. You basically want the math to add up to 24 vCPUs in case you didn’t catch that. Now, the question…why is using a 1.5:1 over-subscription ratio or “splitting the difference” between pCPUs and vCPUs typically the best sweet spot? For that, let’s look at some more detailed results from our test and then we’ll do our best to interpret the data.

We tested various over-subscription ratios using LoginVSI to find the optimal sweet spot. To keep things simple, I’ll just provide test results with the 4 vCPU VM configuration. So in a nutshell, we tested 4 XA VMs, 6 XA VMs and 8 XA VMs (again, all using 4 vCPUs) to see if using 16, 24 or 32 total CPUs would yield the best results. The VSI Max score for each test was 119, 130, and 122, respectively. Meaning that the 6 XA VMs with 4 vCPU test resulted in optimal density while maintaining a great user experience (which is measured by LoginVSI in a ton of different ways during these tests). Now the question becomes WHY did the 1.5:1 over-subscription ratio “win” or yield the best results?

It sort of makes sense if you think about it. When you only use the 16 physical cores in the box, you’re really not taking advantage of hyper-threading (which should give you about a 20-30% performance bump). So in this scenario with a 1:1 over-subscription ratio, you’re really not taking full advantage of the box. On the other hand, when we try to use all 32 virtual cores in the box, we stress the CPU scheduler too much, so it has a diminishing returns effect. Remember, hyper-threading only gives you a 20-30% performance increase…not 100%. Plus, we need to save a few resources (CPU cycles) for the CPU scheduler in the ESX host itself – that process of determining which logical CPU to place the next workload cycle doesn’t come “free”. So in this scenario with a 2:1 over-subscription ratio, we’re really hammering the box and it’s not the optimal configuration either. That’s where splitting the difference between the amount of pCPUs and vCPUs makes a lot of sense and really shines – we’re leaving some precious resources for the scheduler itself, but we’re also taking advantage of hyper-threading. And don’t believe me – believe the data…the scenario with a 1.5:1 over-subscription ratio yielded the highest VSI Max score.

One other thing I wanted to share is this isn’t the first time I’ve seen this over-subscription ratio yield the best results. In fact, one of our largest partners in the industry (who makes EMR/EHR software) always recommends a 1.5:1 over-subscription ratio. And they have done more testing with LoadRunner on a variety of hardware configurations than anyone I know. And this test we did really validated those findings.

Now does this mean that a 1.5:1 over-subscription ratio should always be used? Not necessarily. It depends on the activity ratio, hardware, hypervisor version and scheduler, etc. For example, I might find a slightly different sweet spot if my activity ratio is 50% compared to 85%…if my hardware is a 4×6 compared to a 2×8…if I’m using XS compared to VMW. The sweet spot might be 1.7:1 or 1.25:1…and the only way to find out is through proper testing with a tool like LoginVSI or LoadRunner. But what I am saying is if you don’t know or don’t have time to test, then using a 1.5:1 CPU over-subscription ratio for XA is my recommendation.

NUMA

It still amazes me how many people don’t know what NUMA is or care to understand how it impacts sizing exercises like the one we’re doing. And this is not specific to XA workloads – this concept applies across the board and you should take NUMA into account when virtualizing any large, server workload such as Exchange, SQL, etc. Now, I’m not going to explain what NUMA is in detail in this article since it’s already been done 100 times before on the Interwebs. But in short, it stands for Non-Uniform Memory Access and I like to think of it as “keep things local!”. The idea is pretty simple – when you have a box with multiple sockets, each socket has local memory that it can access very quickly…and remote memory (on the other socket across a bus or inter-socket connection) that it can access not-so-quickly. And if a hypervisor’s CPU scheduler is not “NUMA aware” as some say, then bad things can happen – we could be sending processes and threads to remote CPU or memory as opposed to local CPU or memory. And when we do that, it introduces some latency and it ultimately affects performance (user density in the case of XA). How much of a performance impact can NUMA really have? We did a study on this a couple years ago when we introduced NUMA support in Xen and it had a ~25% hit on user density! Meaning that we could get 100 users on a box without NUMA awareness…and 125 with NUMA awareness. Pretty significant. Luckily for you and me, all major hypervisors these days are NUMA aware. So that means they understand how the hardware is configured in terms of underlying NUMA nodes and their scheduler algorithms are optimized with NUMA in mind. So that’s fantastic – but what the heck does that mean and how should I size my XA VMs? 😉

Notice above I said “underlying NUMA nodes” – what did I mean by that? This is really important stuff. Believe it or not, all Intel chips are not created equal. And not all sockets have just one (1) underlying NUMA node. What I mean by that is a socket with 8 cores (like in our example), might actually be split into multiple underlying NUMA nodes. Each socket with 8 cores could likely have 4 NUMA nodes, 2 NUMA nodes or 1 NUMA node. And if the socket is split into say 4 NUMA nodes, then each node has it’s own “local” CPU and memory resources that it can access the quickest – in this case 2 cores each. If it’s split into 2 nodes, then each node would have 4 cores each. So the concept of NUMA doesn’t just apply with multiple socket boxes…it also applies within each socket or die! So the next obvious question is how do you know what the underlying NUMA configuration is for the hardware you’ve bought (or are thinking about buying hopefully!)? Well, there are tools like Coreinfo you can run on Windows operating systems…and there are commands like ‘xl info –l‘ you could run on Xen or ‘numactl –hardware’ you could run on Linux. And those will spit out all sorts of good information on the CPUs in the box, but most importantly the NUMA configuration to help you understand where the NUMA “boundaries” lie. If you don’t like the CLI, you can simply ask the vendor or manufacturer (Intel…or HP, Dell, IBM, Cisco, etc.). Sometimes the hardware spec sheets even have it on there. But I can tell you from experience that a few years ago, sockets were almost always split into multiple NUMA nodes. Today (i.e. newer Intel chips), they are split less or maybe not at all. This is one of the reasons why we always recommended 2 vCPUs for XA VMs back in the day – because most boxes we were using back then were 2×2’s or 2×4’s and each quad-socket box was split into 2 nodes, respectively (each with 2 cores each). So 2 vCPUs was really the sweet spot and yielded optimal performance on those older chips/boxes. But fast-forward to today…and the sockets on the latest boxes aren’t split as much or they are split into much larger nodes. So you might get a 2×8 and each socket has only 1 node, making it super-awesome and easy to size stuff. I’ll explain why it’s awesome next.

XA VM Spec – How many vCPUs?

So now that you know a thing or two about NUMA (and if you’re still a bit hazy on this subject, I highly recommend this and this article – they explain it better than I do), can you guess what the underlying NUMA configuration was on our 2×8 box used in our testing? Well, since I already told you 4 vCPUs yielded the best results, you might deduct that each socket was in fact comprised of 2 underlying NUMA nodes (with 4 cores at their disposal each). And when I initially found that out, I immediately looked up the specific Intel chip we were using in these Dell boxes (because I was under the impression they were newer Intel chips and they would therefore not be split into multiple nodes within each socket), and sure enough, the Intel chips we were using were about 2 years old (circa 2011). So it made sense. But chances are if you’re buying hardware today with the latest Intel chips, the sockets will likely have a single node. And if you’ve got a box that is a 2×8 with a single NUMA node in each socket, then chances are you’re going to get linear scalability with 8 vCPUs assigned to each XA VM. Let’s dig into that a little bit more…

To better illustrate the concepts of “NUMA thrashing” and “linear scalability”, let’s again look at some real results from our tests. To keep things simple, let’s assume we’re using an over-subscription ratio of 1.5:1 for all 3 of these scenarios. We wanted to test 2, 4 and 8 vCPUs (with 12, 6 and 3 VMs, respectively) to see if underlying NUMA nodes really were a factor and if things scaled linearly or not. We achieved a VSI Max score of 125, 130 and 90, respectively. So how should we interpret those results?

Well, these results tell me that the difference between a 2 vCPU and 4 vCPU VM spec is almost negligible…and things scale linearly (i.e. we get about 10 users per XA VM with 2 vCPUs and 12 GB RAM and about 20 users per XA VM with 4 vCPUs and 24 GB RAM). And that is really expected knowing that all these VMs can fit nicely into each NUMA node and we’re not crossing NUMA boundaries. But what happens when we have bigger VMs with 8 vCPUs and they don’t fit nicely into these NUMA nodes with 4 cores each? That’s when we have to use both local and remote resources across NUMA nodes or even sockets…or maybe the scheduler is moving things around too often trying to compensate for locality…this might result in something called “NUMA thrashing”. And what does NUMA thrashing mean for XA scalability and what we’re doing? That means increased latency, starving the scheduler more often and it results in pretty poor user density. In our case, it meant not scaling even close to linearly – we took about a 30% performance hit when going with the 8 vCPU XA VM spec! The moral of the story is try to size your VMs as a multiple of your physical host’s NUMA node size.

And one other fun fact I wanted to mention on that note…if you have a 2×6 box (still quite popular) and it’s comprised of NUMA nodes with 3 cores each…then using a XA VM spec with an odd number of vCPUs might actually yield the best results! So while 3 vCPU XA VMs might sound weird…NUMA is the reason why some folks might recommend that seemingly peculiar VM spec. Going with 2 or 4 vCPUs might actually result in more NUMA thrashing and worse response times and user density (and this has been tested and proven by many of our large healthcare customers I might add!).

A Quick Word About Activity Ratio and LoginVSI

I know this article is getting long, but stay with me because there is one more thing I want to address quickly if I can – and that is why we were able to “only” get 130 XA users on each physical piece of hardware (versus the 192 number we were getting a year ago on seemingly the same piece of hardware). Well, that has to do with the new default or medium workload of Login VSI 4.0 (the latest shipping version). The scalability tests that Andy and team conducted a year ago (and even the Project VRC guys did a few years ago) were done with LoginVSI 3.x…and the “normal” or medium workload back then was pretty, well…medium’ish! The script was a lot shorter and a lot less intense. Fast-forward to LoginVSI 4.0 and the script is much longer and much more intense. In fact, a colleague and I did a quick test and the average bandwidth used by a single user was about 260 kbps with the new v4 script! And the script is also “active” (versus idle) almost 90% of the time!!! I don’t know about you, but I wouldn’t classify that as a typical XA task worker or even a medium/normal user…that, to me, is a pretty heavy user that is working way more than the average bear. I really hope LoginVSI takes this feedback to heart and changes their script – most XA users still average about 20 kbps today and most users only “work” about 50-60% of the time…so that’s what is really skewing the numbers in our test with LoginVSI 4. Once we introduced more idle (sleep) time and “softened” the test script a bit, we were magically getting close to 200 users per box. So again, it goes back to making sure you’re simulating what your real users are doing – shame on us for using the medium or default script that came with the free version of LoginVSI!

UPDATE: LoginVSI reached out to me after publishing this article and let me know that they do have some of this documented on page 9 of their new v4 Upgrade Guide. But it appears they tested XD (as opposed to XA) and only found that the new script increases CPU utilization by 22%. That is still a lot, but I found more like 33% in my tests. LoginVSI is also thinking about modifying their default medium workload and I’ll be meeting with them in November to discuss further. Good stuff…cheers to LoginVSI for listening to our feedback and +1 for the Community!

Wrap-Up and Key Takeways

That was a long article…I know. So let me try to summarize the main points:

If you can’t do any testing to find the optimal CPU over-subscription ratio, “split the difference” between the number of pCPUs and vCPUs.
Don’t forget to factor in NUMA when sizing your XA VMs – if possible, go with a multiple of the physical host’s NUMA node size.
Don’t be afraid to go with bigger specs for XA, such as 4 or 8 vCPUs each. On the newer Intel chips, we’re seeing linear scalability or even slightly better scalability with bigger VM specs! And this also means less Windows OS licensing costs and significantly less total VMs to manage!!! This is a big benefit I sort of failed to explicitly mention above. I know one customer that used to manage about 800 XA VMs with 2 vCPUs each – they now manage about 200 VMs with 8 vCPUs!
Don’t be afraid to use an odd number of vCPUs for your XA VMs – on a 2×6 box, 3 vCPUs might be the best bet. On the newer 2×10 boxes with E7 procs, 5 vCPUs might be best if the die is split into 2 underlying NUMA nodes. Or maybe you use 10 vCPUs each and turn on vNUMA if you’re using VMware (a topic for another day, but see page 42 of the latest vSphere 5.5 performance best practices whitepaper for more info on vNUMA and 8 vCPU+ VMs).
Be careful with the new default/medium workload that ships with LoginVSI 4.0 – it’s pretty heavy and can skew your results by anywhere from 22-33%! This means you could really over-buy hardware – it’s always best to customize your workload or test script to match what your users are really doing.

Hope you enjoyed this refresher on XenApp scalability – the 2013 edition! Please drop me a comment below if you learned something or if you have other questions I didn’t manage to answer. Who knows…maybe there is a Part 3 down the road.

Cheers, Nick

Nick Rintalan, Lead Architect, Citrix Consulting

Topics

Products