It’s time to talk about one of my favorite subjects again: scalability.

And what I really mean by “scalability” is single server scalability (SSS) of XenApp or RDSH-based workloads. If you haven’t read my first couple of articles on XenApp Scalability (Part 1 and Part 2), I highly recommend it, so you’re not lost when I’m talking about CPU over-subscription ratios and NUMA. Because that was sooo 2 years ago.

Today we’re going to revisit CPU over-subscription and talk about a new concept called “Cluster on Die” and how that impacts XenApp scalability.

CPU Over-Subscription –> 1.5x or 2x?

If you read those scalability articles from 2013, you saw that I recommend “1.5x” or 150% CPU over-commitment for XenApp workloads.  The only way to determine the optimal CPU over-subscription ratio is to test all the variations with a tool such as LoginVSI.  But if you don’t have time to test, then 1.5x is a pretty good “sweet spot” in terms of risk and performance.

I must say that the majority of our research and testing (in coming up with that 1.5x number) was based on older MSFT platforms, legacy Intel chips and legacy versions of ESXi.  In other words, it was largely based on XenApp 6.5 workloads on 2008 R2 on ESXi 4.x and 5.x.

We did most deployments back then on Sandy Bridge and Ivy Bridge chips. I point that out because something weird happened over the last couple years,as we did more load testing and real-world deployments with XA 7.x on 2012 R2 and with ESXi 5.x and 6, we saw that “2x” was winning out just as often or more than 1.5x! And it sort of makes sense as the hardware becomes more efficient and the software becomes smarter and more virtualization-aware.

Anyway, this is kind of a big deal because we have been basing most of our deployments on 1.5x for years … and we might have been leaving some money on the table by not going with 2x.

As I pointed out in my BriForum session in Denver earlier this year, in one test we conducted, that meant as much as 9-12% in terms of SSS. And when we’re routinely getting 300 users on a box these days, that’s an extra 36 users! I’m not saying 2x is the way to go all the time, but 1.5x is no longer my golden rule of thumb. The new CPU over-subscription sweet spot is likely somewhere in between 1.5 and 2x.

Cluster on Die

So, what is Cluster on Die (COD) and why should you care? Well, for the last year or so we’ve seen our customers buy new boxes with Intel Haswell-EP chips. And these chips are pretty awesome, but they also present some unique challenges in terms of how to carve up XenApp VMs on an ESXi host, for example.

Why? Let’s take a step back first and specify which chips we’re talking about and then it will make more sense I think. The Haswell-EP chips I’m talking about are Intel’s “Segmented Optimized High Core Count” chips within the E5-2600 v3 family. So, that means the 2683, 2695, 2697, 2698 and 2699 models. And these basically come in 3 flavors: 14, 16 or 18 cores per socket. These are all dual socket boxes.

The first 3 models in that list are 14 core boxes (i.e. “2×14”) and that’s what I’ve run into in the field 3 or 4 times now. So, that’s what I’m going to use as an example because it will help illustrate the Cluster on Die concept better.

Imagine your customer just bought some new blades and they’re equipped with these dual socket 14 core Haswell-EP chips. And they ask you how you should carve up the XenApp VMs on the host to optimize performance and scalability – so what should you do? I’d argue with our workload “type” you should enable COD in the BIOS to change the default snooping method and configure 6 or 8 VMs on each host with 7 vCPUs each. Easy. Done. 😉

But why? Well, the first thing to understand is that these Haswell-EP chips are unique in the way they are manufactured. The 14 cores (within each socket) are actually made up of a 6 and 8 core chip under the covers. It is more complicated than that under the covers with home agents and the new microprocessor architecture, but I’m trying to keep things simple and I believe that is the easiest way to think about it.

This is important to understand because the NUMA nodes are “uneven” by default. And if you thought each socket was broken down into 2 even NUMA nodes with 7 cores each by default and configured a 7 vCPU XA VM spec, you’d get sub-optimal performance because you’d be crossing NUMA boundaries on that 6 core node. With the default BIOS, chip and NUMA config, you’re probably best off going with a 6 vCPU XA VM spec.

But is there a better way to squeeze out a few more users per box? Yes! And that is where COD comes into play. COD is a new thing Intel shipped along with these HSW-EP chips last year. It is an advanced BIOS CPU configuration that alters the default “snooping” mode from “Early Snoop” to “Cluster on Die” (check out this awesome Intel presentation if you want to learn more about snooping modes and COD).

Once enabled, Intel basically “steals” a core from the 8 core node and presents it to the 6 core node, resulting in 2 evenly split NUMA nodes or clusters with 7 cores each. And now a 7 vCPU XA VM spec fits perfectly! And we can configure 6 or 8 VMs on each host to manage a few less VMs per host (vs. the 6 vCPU spec or something smaller) and that aligns perfectly with the 1.5-2x CPU over-commitment sweet spot I just talked about.

So you might be asking (as I did at first), that “theft” has to come at a price, right?  Sure, there are always trade-offs.  And in this case we’re trading latency for bandwidth.  And it’s a tad more complicated than that due to locality tradeoffs so try to stay with me for a minute.  The default snooping mode (Early Snoop) is designed for general purpose workloads that are not NUMA aware, and therefore, it is designed to optimize remote memory latency.  But what if you have a workload that is NUMA aware and the goals are to minimize LLC hits & local memory latency, while maximizing local memory bandwidth?  That is precisely where COD comes in – it is designed for workloads like Windows VMs and ESXi hypervisors that are highly NUMA optimized or “aware”.  And it effectively trades remote memory latency and bandwidth for local memory latency and bandwidth.  If I’ve totally lost you, slide 34 of that Intel presentation has a nice summary table talking about the pros and cons of the various snooping modes.  But it all comes down to maximizing performance for NUMA-aware workloads and COD is the best way to do that.

To sum things up, I think there are 3 or 4 viable options you could theoretically go with (from worst to best in my opinion for maximum SSS).  Let’s assume 1.5x for these options to keep things simple.

Option 1 – COD disabled, 6 vCPUs, 7 VMs per host

  • This will work and aligns with our 1.5x CPU over-commitment rule of thumb, but we never like to have an odd number of VMs per host with an even number of NUMA nodes (2 nodes per socket and 4 total nodes per host).  It presents some challenges for the hypervisor scheduler and can impact performance.

Option 2a and 2b – COD disabled, 6 vCPUs, 6 or 8 VMs per host

  • Since we don’t like an odd number of VMs per host, we can simply go with 1 more or 1 less VM per host.  But then we could stress the box or scheduler since we’re in the 1.7x range now (remember, for the sake of argument, 1.5x was deemed optimal for this example).  Or we leave some money on the table by going with only 1.3x. I personally don’t like either option.

Option 3 – COD enabled, 7 vCPUs, 6 VMs per host

  • As I mentioned above, I prefer this config since COD is designed for “highly NUMA optimized workloads” according to Intel and it gives us an even number of VMs per host and aligns with our 1.5x rule of thumb.  And if you determine that 2x is the optimal CPU over-subscription ratio, then you simply go with 8 VMs per host.  What’s not to like?!

All you have to do is enable COD in the BIOS and you’ll be off and running. And this same sizing can be applied to the 18 core boxes as well (I’ve run into these only once in the field, but expect to see them a lot more in the near future).  I’d recommend enabling COD and configuring 9 vCPU XA VMs (with vNUMA enabled) and 6 or 8 VMs on each host.

The other thing you’re probably wondering is how many users you can get on a box with the above 3 or 4 options I mentioned (i.e. does COD really mean 20 more users per box or just 1 or 2?) Well, that’s why this is Part 1. 😉

I’m going to unveil some fascinating test results in Part 2 early next year.  So stay tuned and have a great holiday break! I hope this guidance helps in the meantime.

Cheers, Nick

Nick Rintalan, Lead Architect & Director
Americas, Citrix Consulting Services (CCS)