My colleague Miguel Contreras and I have done quite a bit of testing with the new PVS Cache in RAM with Hard Disk Overflow feature with some amazing results!  If you haven’t already read it, I would recommend that you check out Part One of this series.

In part two of this series I will walk you through some more detailed analysis and results from testing in our labs as well as testing in real world customer environments. I will also provide you some recommendations on how much memory you should plan to allocate in order to take advantage of this new feature.

The exciting news is that we can reduce the IOPS for both XenApp RDS workloads and Windows VDI workloads to less than 1 IOPS per VM, regardless of the number of users!  When properly leveraged, this feature will eliminate the need to buy fast & expensive SSD storage for XenApp and VDI workloads.  Storage and IOPS requirements have been a major pain point in delivering VDI on a large scale and now with PVS and this new feature, we eliminate IOPS as an issue without buying any SSD storage!

IOPS Overview

Before I jump into the tests and numbers, I think it is important to give a quick overview of IOPS (I/O Operations per Second) and how I present some of the numbers.  Whenever I try to explain IOPS, I always tell people that IOPS calculations are a bit of a voodoo art and calculated using very “fuzzy” math.

For example, a VM might boot in 2 minutes and consume an average of 200 IOPS during boot when placed on a SATA disk.  That same VM when placed on an SSD might consume 1,600 IOPS and boot in 15 seconds.  So how many IOPS so I need to boot the VM?  The reality is that I need about 24,000 TOTAL IO Operations, but the number per second will vary greatly depending upon what the acceptable boot time is for the VM.

Using the previous example, if a 4 minute boot time is acceptable and the VM needs 24,000 I/O to boot, then the VM requires access to 100 IOPS during the boot phase.  It is important to run more than one VM simultaneously and to use a tool such as LoginVSI to determine at what point a VM is pushed to the point of no longer providing an acceptable user experience in order to determine the required minimum number of IOPS.  This is why I used LoginVSI to run multiple concurrent sessions and why I also provide the total IO operations used during heavy phases such as boot and logon in addition to providing the IOPS.

If you want some more gory details about IOPS then I recommend you check out the following BriForum 2013 presentation delivered by myself and Nick Rintalan; find it on YouTube here.

Additionally, I would recommend you also check out Jim Moyle’s IOPS white paper he wrote several years ago.  It does a great job of explaining IOPS; get the PDF here.

Since XenApp with RDS workloads and Windows VDI workloads have completely different usage profiles, I have separate results and recommendations for each.  I will start with Windows VDI below.

Windows 7 and XenDesktop VDI

In testing the new PVS feature with VDI workloads, we ran the tests with three different scenarios. For all scenarios we used LoginVSI with a Medium workload as the baseline test.  For the first test we used Machine Creation Services (MCS) as the provisioning technology.  MCS does not offer any enhanced caching capabilities, so this test would give us the baseline number of IOPS that a typical Window 7 desktop would consume.

Here are some more details on this baseline test…

Windows 7 VDI Baseline with MCS on Hyper-V 2012 R2

  • Single Hyper-V host with hyper-threaded Quad Core CPU and 32 GB RAM
  • A single dedicated 7200 RPM SATA 3 disk with 64 MB cache was used for hosting the Windows 7 VMs
  • Windows 7 x64 VMs: 2 vCPU with 2.5 GB RAM
  • UPM and Folder Redirection were properly configured and fully optimized such that the profile was less than 10 MB in size.  Refer to this blog post or watch the BriForum 2013 presentation delivered by Nick Rintalan and yours truly on You Tube

Below are the Boot IOPS numbers for the MCS test.

# of VMs Boot Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
1 VM 2 minutes 24,921 per VM 213 184 29 86% / 14%
5 VMs 6 minutes 25,272 per VM 70 60 10 86% / 14%

As you can see from the above table, whether booting 1 VM or 5 VMs, approximately the same number of total I/O operations is consumed for each VM.  The IOPS per VM are less when booting multiple VMs because the VMs are sharing the same disk, and the amount of time required to boot each VM also increases proportionally.  For this baseline test I used 5 VMs because that is the approximate number of traditional Windows 7 VMs you can run on a single SATA 3 disk and still get acceptable performance.  Also, my definition of “boot” is not simply how long it takes to get to the Control-Alt-Delete logon screen, but how long it takes for most services to fully startup and for the VM to successfully register with the Citrix Desktop Delivery Controller in addition to displaying the logon screen.

The next baseline MCS test I ran was to determine the IOPS consumed during the logon and initial application start-up phase. This phase includes the logon and initial launch of several applications.

# of VMs Logon Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
1 VM 25 seconds 4,390 per VM 175 103 72 59% / 41%
5 VMs 90 seconds 4,249 per VM 48 23 25 48% / 52%

Just like the logon phase, the total I/O operations generated are pretty much the same whether logging on 1 user or 5 users.  The overall logon duration was a little longer because the 5 session launches were spread out over 60 seconds.  Each individual logon session averaged 30 seconds to complete the logon and launch the first applications.

The final baseline MCS test run was to determine the steady state IOPS generated during the LoginVSI Medium workload.

# of VMs Session Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
1 VM 45 minutes 22,713 per VM 8.5 3 5.5 35% / 65%
5 VMs 45 minutes 20,009 per VM 7.5 2 5.5 27% / 73%

The steady state IOPS number is the one that is typically of most interest and what is used in most sizing equations when dealing with a large number of VMs.  It is not that we do not care about IOPS during boot or logon phases; however, these phases typically represent only a very small percentage of the overall load generated throughout the day during production time.  If properly designed, well over 80% of all VDI boot phases should occur during non-production time such as overnight or very early in the morning.  Additionally, as long as we have properly designed and implemented our profile strategy, we can keep the impact of logon IOPS to both a short period of time and manageable amount as well.

Now that we have the baseline number of IOPS that a standard Windows 7 desktop will consume, let’s see how PVS 7.1 with the new caching feature performs.

Windows 7 – PVS 7.1 RAM Cache with 256 MB on Hyper-V 2012 R2

This test was configured just like the MCS baseline test and run on the same hardware.

  • Single Hyper-V host with hyper-threaded Quad Core CPU and 32 GB RAM
  • A single dedicated 7200 RPM SATA 3 disk with 64 MB cache was used for hosting the write cache disk for the Windows 7 VMs
  • Windows 7 x64 VMs: 2 vCPU with 2.5 GB RAM
  • PVS 7.1 Standard Image with RAM Cache set at 256 MB (PVS on separate host)
  • Windows Event Logs were redirected directly to the write cache disk so that they persist and their I/O would not be cached in RAM
  • The profile was fully optimized with UPM and Folder Redirection (profile share on separate host)

For this test I went straight to booting 11 VMs, which was the maximum number of VMs I could run on my host with 32 GB RAM.

# of VMs Boot Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
11 VMs 3.5 minutes 536 per VM 2.5 1.4 1.1 56% / 44%

These results are amazing!  With only 256 MB of RAM used per VM for caching, we are able to reduce the dreaded boot storm IOPS to only 2.5 sustained IOPS!  We have always known that PVS essentially eliminates the read IOPS due to the PVS server caching all the read blocks in RAM, but now we also have the PVS target driver eliminating most of the write IOPS as well!

Now let’s see what happens during logon phase.

# of VMs Logon Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
11 VMs 3 minutes 101 per VM .56 .05 .51 9% / 91%

I launched the 11 session at 15 second intervals.  It took only 15 seconds for each session to fully logon and launch the initial set of applications.  As a result the total duration for which I tracked the logon IOPS for the 11 VMs lasted 180 seconds.  During this time, we generated less than one IOPS per VM!

Now let’ see what happened during the steady state.

# of VMs Session Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
11 VMs 45 minutes 290 per VM .1 .001 .1 1% / 99%

Yes, you are reading it correctly.  We generated one tenth of one IOP per second per VM.  Total IOPS generated by all 11 VMs was only 1.1!  The read IOPS were so low that they could actually be considered zero.  With 11 VMs actively running for 45 consecutive minutes, we generated a total of 1 read I/O per minute.

I know that some of you are probably thinking that a fully optimized profile solution where the profile is only 10MB in size and everything is redirected might be hard to implement. Sometimes customers get stuck keeping the AppData folder in the profile instead of redirecting it which will significantly increase logon load and could also overrun the RAM cache. For this reason, I reran the tests with a bloated and non-redirected AppData folder to see how it would impact the results.  I stopped redirecting the AppData folder and bloated the AppData folder in the UPM share for each user to be over 260 MB in size containing 6,390 files and 212 subfolders.  Additionally, I disabled UPM profile streaming so that the logon would have to fully wait for the profile to entirely download and that 100% of it would download at logon.  Since the RAM cache is limited to 256 MB and the users’ profiles are now over 270MB in size, this would guarantee that we would overrun the RAM cache before the logon completed and the LoginVSI tests began.

Since my VMs were running on Hyper-V with the Legacy Network Adapter (100 Mb limit) the logons definitely increased in duration with the bloated profile.  It took approximately 3:30 seconds per user to complete the logon process and launch the first applications.  For this reason, I staggered the sessions to logon at a rate of 1 per minute.

Here are the logon and steady state results for this test with the bloated profile.

# of VMs Logon Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
11 VMs 16 minutes 4,390 per VM 4.56 .22 4.34 5% / 95%

Even with a bloated profile we are less than 5 IOPS per user during the logon phase.  Below are the steady state results.

# of VMs Session Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
11 VMs 45 minutes 2,301 per VM .85 .02 .83 2% / 98%

The steady state numbers with the bloated profile were also higher than the optimized profile; however we still maintained less than 1 IOPS per VM!

Now let’s see what happens when we run a similar test using a much larger server with VMware as opposed to my meager lab server running Hyper-V.

Windows 7 – PVS 7.1 RAM Cache with 512 MB on VMware vSphere 5.5

Here are some details on the server and environment.

  • 48 Core AMD Server with 512 GB RAM
  • NFS LUN connected to SAN
  • vSphere 5.5
  • 150 Windows 7 x64 VMs on host: 2 vCPU with 3 GB RAM
  • McAfee Anti-virus running within each Windows 7 guest VM
  • Write Cache disk for each VM on NFS LUN
  • PVS 7.1 Standard Image with RAM Cache set at 512 MB (PVS on separate host)
  • Profiles fully optimized with UPM and Folder Redirection

I simultaneously initiated a boot of all 150 Windows 7 VMs on the host.  For those of you that have tried booting a lot of VMs simultaneously on a single host, you know that this typically crushes the host is not something we would normally do. However, I was feeling brave so I went for it!  Amazingly, all 150 VMs were able to be fully booted and registered with the XenDesktop Delivery Controller in just under 8 minutes!  It took 8 minutes to boot due to the CPUs on the host being pegged.

Here are the IOPS results during the 150 VM boot phase.

# of VMs Boot Duration Total IO Operations IOPS per VM Read IOPS per VM Write IOPS per VM Read/Write Ratio
150 VMs 8 minutes 655 per VM 1.36 .5 .86 37% / 63%

For the Logon test, I configured LoginVSI to launch a session every 12 seconds.  At that rate it took just over 30 minutes to logon all of the sessions. Here are the logon results.

# of VMs Logon Duration Total IO Operations IOPS per VM Read IOPS Write IOPS Read/Write Ratio
150 VMs 32 minutes 1,144 per VM .59 .01 .58 2% / 98%

Now let’s look at the steady state.

# of VMs Session Duration Total IO Operations IOPS per VM Read IOPS Write IOPS Read/Write Ratio
150 VMs 30 minutes 972 per VM .535 .003 .532 1% / 99%

As you can see from both the logon and steady state phases above, we are able to keep the total IOPS per Windows 7 VM to less than one!

The results of this the new PVS 7.1 RAM Cache with Hard Disk Overflow feature is simply amazing.  Even with a RAM Cache amount of only 256 MB, we are able to significantly reduce IOPS even in situations where users have large bloated profiles.

So, what kind of benefits does this new feature provide for XenApp workloads?  I ran several tests on both Hyper-V and vSphere as well, so let’s see how it worked.

XenApp 6.5 on Windows 2008 R2

For XenApp I ran tests very similar to the ones I ran for VDI.  I based all the tests on the LoginVSI Medium workload and I used the same user accounts with the same profile settings.  For the XenApp tests I only tested the fully optimized profile scenario.  We typically have fewer VMs with larger memory configurations on our Hypervisor hosts with XenApp, so I tested the PVS RAM cache feature with 1 GB, 3GB, and 12 GB of RAM allocated for caching.  The results are below.

XenApp 6.5 2008 R2 hosted on Hyper-V 2012 R2

I used the same Hyper-V host in my personal lab for testing XenApp that I used for the VDI tests.  My host was configured as follows.

  • Hyper threaded Quad Core CPU with 32 GB RAM
  • Hyper-V 2012 R2
  • A single dedicated 7200 RPM SATA 3 disk with 64 MB cache was used for hosting the write cache disk for the XenApp VMs.
  • 2 Windows 2008 R2 XenApp 6.5 VMs configured as:
    • 4 vCPU with 14 GB RAM
    • 60 launched LoginVSI Medium sessions (30 per VM)

Test#1 PVS XenApp Target with 1 GB RAM Cache

For this test I configured LoginVSI to launch 1 session every 30 seconds.  The logon duration lasted 30 minutes.

# of Users Logon Duration Total IO Operations IOPS Read IOPS Write IOPS Read/Write Ratio
60 Users 30 minutes 19,687 10.9 .1 10.8 1% / 99%

The average IOPS during the logon phase was less than 11 IOPS total for 60 users! That was a little over 5 IOPS per XenApp VM during the peak logon period.

Here are the steady state values.

# of Users Session Duration Total IO Operations IOPS Read IOPS Write IOPS Read/Write Ratio
60 Users 45 minutes 16,411 6 3.7 2.3 61% / 39%

Our average IOPS during the 45 minute steady state was 6, which means we averaged 3 IOPS per XenApp VM which means only .1 IOPS per user.

These are very impressive results for only using 1 GB of Cache on a VM that has 14 GB of RAM.  At the peak point in the test where each XenApp VM had 30 active LoginVSI Medium sessions running, the VM had only committed a little over 10 GB RAM.  So I decided to increase the amount of RAM Cache to 3 GB and rerun the test.

Test#2 PVS XenApp Target with 3 GB RAM Cache

I ran the 3 GB RAM test with the same settings as the previous test with the exception of having LoginVSI launch a session every 20 seconds instead of every 30 seconds.

# of Users Logon Duration Total IO Operations IOPS Read IOPS Write IOPS Read/Write Ratio
60 Users 20 minutes 1,673 1.4 .1 1.3 7% / 93%

Here are the steady state results.

# of Users Session Duration Total IO Operations IOPS Read IOPS Write IOPS Read/Write Ratio
60 Users 45 minutes 7,947 2.95 .05 2.9 2% / 98%

The IOPS results with only 1 GB of Cache per XenApp VM were quite impressive; however, increasing the cache to 3 GB we were able to reduce the IOPS even more.  We were able to get the average IOPS for an entire XenApp VM hosting 30 users to less than 2 total IOPS.

Now that we see the results from the small Hyper-V server in my personal lab, what happens when we run a larger XenApp workload on a real production quality server?

XenApp 6.5 2008 R2 hosted on VMware vSphere 5.5

  • 48 Core AMD Server with 512 GB RAM
  • NFS LUN connected to SAN
  • vSphere 5.5
  • 10 Windows 2008 R2 XenApp 6.5 VMs on host configured as:
    • 6 vCPU with 48 GB RAM
    • Write Cache disk for each VM on NFS LUN
    • PVS 7.1 Standard Image with RAM Cache set at 12 GB (PVS on separate host)
    • Profiles fully optimized with UPM and Folder Redirection

For this test I had 10 XenApp VMs on the host and each VM had 48 GB RAM with the PVS RAM Cache configured to allow up to 12 GB for caching.  I launched 300 LoginVSI Medium workload sessions so that each VM hosted 30 sessions.  I configured LoginVSI to launch 1 user every 8 seconds so the total login time took 40 minutes.

Here are the results for the logon phase.

# of Users Logon Duration Total IO Operations IOPS Read IOPS Write IOPS Read/Write Ratio
300 Users 43 minutes 18,603 7.2 .1 7.1 1% / 99%

Here are the results for the steady state.

# of Users Session Duration Total IO Operations IOPS Read IOPS Write IOPS Read/Write Ratio
300 Users 45 minutes 16,945 6.27 .02 6.25 1% / 99%

As you can see from the numbers above, the logon and steady state phases are identical with approximately 7 IOPS being the total amount generated for 300 users across 10 XenApp VMs.  That is less than 1 IOP per XenApp VM.  It is obvious from these results that 100% of the I/O remained in cache as the PVS RAM Cache never overflowed to disk.

Summary and Recommendations

As you can see from the results, the new PVS RAM Cache with Hard Disk Overflow feature is a major game changer when it comes to delivering extreme performance while eliminating the need to buy expensive SAN I/O for both XenApp and Pooled VDI Desktops delivered with XenDesktop.  One of the reasons this feature gives such a performance boost even with modest amounts of RAM is due to how it changes the profile for how I/O is written to disk.  A XenApp or VDI workload traditionally sends mostly 4K Random write I/O to the disk. This is the hardest I/O for a disk to service and is why VDI has been such a burden on the SAN.   With this new cache feature, all I/O is first written to memory which is a major performance boost.  When the cache memory is full and overflows to disk, it will flush to a VHDX file on the disk.  We flush the data using 2MB page sizes.  VHDX with 2MB page sizes give us a huge I/O benefit because instead of 4K random writes, we are now asking the disk to do 2MB sequential writes.  This is significantly more efficient and will allow data to be flushed to disk with fewer IOPS.

With the exception of the 150 Windows 7 VM test above, there was no virus protection or other security software running on my VMs.  Additionally, there was no 3rd party monitoring software.  3rd party monitoring, virus protection and other software packages are sometimes configured to write directly to the D: drive or whatever letter is assigned to the PVS write cache disk.  This is so that the I/O is not lost between reboots.  If you configure such software in your environment, you need to calculate the additional IOPS that are generated and factor it into your planning.

Here are the key takeaways.

  • This solution can reduce your IOPS required per user to less than 1!!!
  • You no longer need to purchase or even consider purchasing expense flash or SSD storage for VDI anymore. This is a HUGE cost savings!  VDI can now safely run on cheap tier 3 SATA storage!
  • Every customer using PVS should upgrade and start using this new feature ASAP no matter how much RAM you have. This feature is actually required to fix issues with ASLR.
  • This feature uses non-paged pool memory and will only use it if it is free.  If your OS runs low on RAM, it will simply not allocate any more RAM for caching.  Also, a great new feature with this cache type is that it gives back memory and disk space as files/blocks in cache are deleted. In previous versions of PVS once a block was committed to RAM or Disk, it was never freed.  This is no longer the case!  The risk associated with running out of RAM if you size properly is very low.
  • For VDI workloads even a small amount of RAM can make a HUGE difference in performance.  I would recommend configuring at least 256 MB of cache for VDI workloads.
    • For Windows 7 32-bit VMs you should allocate 256 MB RAM for caching.
    • For Windows 7 x64 VMs with 3 – 4 GB RAM, I would recommend allocating 512 MB RAM for caching.
    • For Windows 7 x64 VMs with more than 4 GB, feel free to allocate more than 512 MB.
    • For XenApp workloads I would recommend allocating at least 2 GB for caching.  In most configurations today, you probably have much more RAM available on XenApp workloads, so for maximum performance I would allocate even more than 2 GB RAM if possible.  For example, if you have 16 GB RAM in your XenApp VM, you should safely be able to allocate at least 4 GB for the PVS RAM Cache. Of course, you should always test first!

I wish you luck as you implement this amazing new PVS feature and significantly reduce your IOPS while boosting the user experience!

Cheers,

Dan Allen