When implementing a Provisioning Services infrastructure the decision about the Write Cache location is one of the most important and therefore one of the most discussed. As I already wrote two blogs about this topic (you can find them here and here) I’ll not bore you with any more theory. Instead this time I’d like to share some practical real world data, of an environment where the PVS Write Cache of 6 virtual XenApp servers is written to the local disks of a single XenServer. This configuration is discussed quite often as it is a very cost effective solution (Alex Danilychev wrote a nice blog about it), but it also holds the risk of the local disks being a bottleneck. So in order to understand whether this configuration works for your environment as well, you need to perform some in-depth testing and analysis. Check out CTX130632 for some guidance.

For the customer in my example this configuration works pretty well, since about 3 years now. Initially the customer  has chosen this configuration for cost reasons, as a configuration with shared storage would have required them to upgrade the SAN controllers in addition to adding a large number of disks to the existing shared storage infrastructure.

The Environment

–       All XenApp servers are virtualized using XenServer 5.6 SP2

–       Rack-Mounted HP 2U Servers

  • 4 x 140GB 15k SAS disks
  • Smart Array P410i – RAID Controller
  • 512MB Battery Backed Write Cache
  • RAID 5

–       Running XenApp 6

–       30 concurrent users per XenApp server during the office hours

–       More than 1.500 concurrent users in total

–       Ratio of 6 XenApp servers per XenServer Host

Gathering the Performance Data

We gathered the performance data using the standard Linux command iostat. In order to filter out the unnecessary data right away we piped the output into the awk command. This gave us the following command string:

iostat -xk 15 5760 | awk ‘BEGIN {print “Time Device rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util”} /c0d0/ {print strftime(“%H:%M:%S”),$0}’ >iostat.csv

In detail this command means:

–       iostat –xk 15 5760 – Runs iostat with extended statistics (in KB format) at a 15 second interval 5760 times (= 24 hours)

–       /c0d0/ – Filters the output for block device c0d0

–       >iostat.csv – Writes the iostat output into the file iostat.csv

After running this command we got back a 1MB csv file which could be imported into Excel. From there it was pretty easy to create the following graphs (please note that we captured the data over night, so the middle of the graph is midnight):

The Performance Graphs

The first graph show shows the IOPSs caused by the PVS write cache activity of the virtual XenApp servers. As we can see the majority of the IO activity is writes, which is the same within virtual desktop environments. (Please click on the graphs to enlarge them)

The second graph shows the read write ratio in even more impressive detail.

The third graph shows the throughput in Kbyte/s. So this is the actual amount of data that is read and or written:

Finally we need some data to put the graphs into a relation. So here is the utilization chart of the local disk subsystem:

As we can see the disk subsystem is nowhere near its saturation, although it has to handle more than 800 I/Os per second during the peak logon period. This is only possible because the RAID controller has been equipped with battery backer write cache. This allows the RAID controller to commit the write operation back to the XenServer almost immediately. The actual write to disk is performed whenever utilization permits. While I don’t have any data without such a cache, I’d assume we would see a utilization of at least 90% if not even higher.

Maintenance approach

I’m sure some of you will ask now how does the customer do hardware / hypervisor maintenance and how do they cope with outages? The reason behind this question (for those who couldn’t follow) is that in such a configuration the virtual XenApp servers are tied to an individual XenServer. This is because the virtual disk attached to the virtual machines cannot be moved. So a XenMotion is not possible and if the XenServer needs to be powered down all resident VMs need to go down as well. The customer solved this issue with a very pragmatic approach. The just bought two XenServers more than required to hold 100% of the load. So in case of any maintenance work they just disable the logons of the respective XenApp servers. When the last user has been logged of, they are able to do whatever needs to be done without impacting any user. Of course this is not as flexible as using XenMotion, but given the cost savings mentioned earlier it was ok for them to have that level of “in-flexibility”.

Customer happiness

Besides showing the pure technical metrics it is very important answer if the customer is happy with this configuration. Basically the answer I got was “Yes, we’re happy with this configuration and we would recommend this solution other customers as well. But of course this depends on the requirements of the individual environment”.

So I think this shows that a PVS write cache on local disks configuration is a valid solution for XenApp environments, that not only works in theory. I would like to prove the same for a XenDesktop environment, but I could not get my hands onto some real world data where I also got the permission to publish it. So in case you would like to help me here, please leave a comment below or mail me directly.