I have sort of been waiting for this – after all, it’s been almost a year since our friends in Palo Alto announced Remote Desktop Session Host (RDSH) capabilities with Horizon (their XenApp-like product if you’re not familiar).  But a couple weeks ago, they finally published some performance and sizing “best practices.”

And since I have done a lot of performance and scalability testing with TS/RDS/XenApp for the last decade, I found it pretty darn interesting. Mainly due to the fact that VMW is new to this “game” and this is their first stab at documenting some of the best practices associated with sizing guest VMs and determining how many RDSH sessions can be hosted on an ESXi host.

So I’m going to give you a quick run-down of some of their key findings in terms of what I thought they did well (or aligns with what Citrix and MSFT have been preaching for years) and what they could probably do better in the future (or what was sort of a “curious” finding or result that doesn’t align with what we have been doing for a long time in this space).

By the way – if you haven’t read my two-part article on XenApp Scalability, please start there. Because I am going to be touching on a lot of things from my articles with regards to concepts like CPU over-subscription and NUMA.

The Good (Consistent Findings or Results)

  • Bigger is Better.  I commend VMW for testing different vCPU configurations and CPU over-subscription ratios.  This is literally the only way to figure out what the optimal guest size is and how many guests should be squeezed onto a single ESXi host.  And what VMW found is that 8 vCPU VMs perform the best, which is exactly what I have been preaching for the last few years.  Going with bigger guest VM sizes has the added advantage that you have to manage less overall nodes in your environment, not to mention there could be additional cost savings due to Windows OS licensing.
  • Activity Ratio.  I was happy to see that VMW built about 5 seconds of what they call “think time” in between each operation that the test script was executing.  I have often commented that users work a lot less than most people think, so some sleep or think time should be added to every script.  In fact, now when we do testing with LoginVSI and XA, we often tweak the script to add additional time to mimic a real-world XA user.  Just check out what Dan Allen did here with LoginVSI.
  • User Density.  VMW found that a 4×8 box can serve or host 288 RDSH-based users on a single ESXi 5.5 host with acceptable performance.  That is refreshing because we have seen similar numbers in the field with such boxes and chipsets (~300 users/host).  And even if you look at AndyB’s table from 2012 where we published some XA scalability numbers for a 4×8 server, the medium or “normal” number of users we said we could support on a single host was 320.   So VMW’s density numbers are actually right in line with what we are seeing and preaching.
  • Protocol Comparison.  One of the weird (but cool in my opinion) things VMW did in this study was compare the CPU and bandwidth utilization of PCoIP, RDP and ICA/HDX.  And they found that guest CPU usage was slightly better with ICA and RDP vs. PCoIP.  And bandwidth usage was slightly better with PCoIP vs. ICA and RDP.  But honestly, the numbers are all negligible if you ask me (72% vs. 71% CPU and 45k vs. 48k, respectively for PCoIP and ICA).  The more interesting thing I keyed in on was they found the average bandwidth per user or session to be about 45-50k.  This all depends on the script and what kind of work you’re doing, but we’ve often said about 20-75k per user for ICA traffic for almost 15 years now.  So VMW’s findings (and test script they used) is right in line with the numbers we see in the field.
  • NUMA Matters.  VMW also paid homage to Non-uniform memory access (NUMA) at the end of the whitepaper and basically echoed what I said in my two-part series – try to size your guest VMs with NUMA in mind and don’t cross any boundaries if possible.
  • RDSH Tuning and Server Optimizations.  Similar to what we always recommend with XA (or XD) workloads, VMW used their Horizon 6 Optimization Tool to tune a few Windows items and achieve maximum user density.  So that was good to see they ported over their tool from View/VDI and are also using it for Horizon/RDSH workloads now.

The Bad (“Curious” Findings or Results)

  • View Planner vs. LoginVSI.  VMW used their own capacity planning tool (View Planner 3.5) versus something a bit more industry standard such as LoginVSI.  And while I understand their reasoning and they briefly shared what the workload was doing and how much “think time” they added to the script, they did not share all the details and it makes it difficult to compare their numbers to any other numbers, even on identical hardware or with the same ESXi version. I wish VMW would use LoginVSI going forward, which would really help the Community.
  • Response Time.  I have to admit I found it extremely odd how VMW went about calculating “acceptable performance”.  A year or so ago in their View performance & scalability whitepaper, they said that 1 second was the threshold for acceptable performance (see graph on page 14).  But now they are saying the acceptable amount of response time is 6 seconds?!?  Seriously, on page 9 they say that “anything above 6 seconds is too high and therefore unusable”.  Maybe this is apples and oranges or the Planner tool has seen some significant updates, but it certainly doesn’t seem like it.  And since when is 6 seconds acceptable to a user?  How did VMW come up with that number when it was 1 second a year ago?  We usually use 1 or 2 seconds for response time at Citrix when doing scalability testing, so I found this to be very curious.  But here is what is even stranger – after I read that 6 seconds was their threshold, I thought they were going to say that each ESXi box could support 1000 users or something.  But no – only 288, which is actually slightly lower than we typically see in the field with this hardware and a medium/Office’ish user like was used in their script.  So who knows what is going on there – more reason to explain how they arrived at this 6 second threshold in future whitepapers or simply make the switch to LoginVSI since they do a great job of looking at dozens of metrics associated with response time to truly deem what is acceptable to a user and what is not.
  • Host Metrics and Hardware.  The very first thing that struck me as interesting in VMW’s test setup was the box they used to run the test.  Because it is actually not a config that I’d recommend for RDSH-based workloads or users.  This 4×8/512 R820 box is much better suited for VDI workloads in terms of the best bang for your buck – a dual socket box with say 12 cores (like the R710) or a quad socket box with probably half the memory are much better “sweet spots” in terms of cost & performance for RDS workloads.  So I thought that was interesting that VMW chose this monster box and only got 300 users or so on it…they didn’t provide any host metrics (please do provide at least CPU and memory ESXi metrics in the future!), but I can guess that the box was probably CPU-bound and they ended up not using about half of the memory in each ESXi host.  Just thought this was worthy to mention since it shows VMW’s roots coming from the VDI world and I’d probably never recommend this exact box for XA workloads unless I was forced to or I got all that memory for free.
  • CPU Over-Subscription Ratio.  VMW found that a 2:1 over-subscription ratio was the optimal config (basically 8 guest VMs @ 8 vCPU on each host, which equates to 64 logical vCPUs and the box only has 32 physical cores or CPUs).  And this actually doesn’t bother me that much since we have used 2:1 configs at times in the past, especially when we want to get a little bit more aggressive due to cost or maybe we have a lighter or less-critical workload.  But I just wanted to point it out since I’ve said to use 1.5:1 in the past and this result is obviously slightly different.  This optimal CPU over-subscription ratio depends on a number of factors and the only way to figure it out is through testing, just like VMW did.  So, again, props to VMW on this one for figuring this out the right way.  I think the sweet-spot for CPU over-subscription with most workloads these days on modern hypervisors is between 1.5-2x.

Anyway, that’s all I got.  Some pretty fascinating stuff in there and I’m actually glad VMW has joined the party and published some test results.  It definitely is a great thing for the Community and we can certainly learn from each other in the future if we’re willing to put our small differences aside.

-Nick

Nicholas Rintalan

Lead Architect, Americas

Citrix Consulting