Citrix XenServer – Setting more than one vCPU per VM to improve application performance and server consolidation e.g. for CAD/3-D graphical applications

Many rich graphical applications such as CAD applications may benefit from allocating more than 1 vCPU per VM (1 vCPU is the default) via XenCenter or the command line. This will be of interest to many of those evaluating XenDesktop and XenApp HDX 3D Pro GPU pass-through and GPU sharing technologies and can lead to noticeable performance enhancements. The performance gains possible though are highly application specific and users should evaluate and benchmark for their particular application and user workflows.

Overprovisioning

Virtualisation allows users to define more virtual vCPUs than there are physical CPUs (pCPUs) available. A key driver for server consolidation is to virtualise workloads which spend much of their time idle. When virtualising, the expectation is often that when one of the VMs requires increased processing resources, several other of the VMs on the same host will be idle. This type of statistical multiplexing enables the total number of vCPUs across all VMs on a host to be potentially much greater than the number of physical cores.

In the context of virtualising graphical workloads (such as those that benefit from GPU-accelerations) this means you really should look at the demographics and usage patterns of your user base – I blogged about this a few months ago. If you are streaming video or similar constantly to all users there may be little scope to overprovision but if your users use applications in ways that use the CPUs in bursts there is often significant scope to consolidate resources.

Reservation of Capacity for Dom0

Before proceeding, it is worth noting that the XenServer scheduler ensures that dom0 cannot be starved of processing resources, given that it is crucial for normal operation of the platform. The scheduler ensures that dom0’s CPUs are allocated at least as much real CPU cycles as they would receive if they had a dedicated physical core. In other words, today, we ensure that dom0 uses the entirety of 4 physical cores, if it needs them. The maximum amount of CPU resource that dom0 can use is the equivalent of half the number of physical cores on the system.

This means that on a heavily loaded host, capacity for VMs will be the number of physical cores, minus 4.

vCPU Consolidation Ratio

As a rule of thumb, the ideal consolidation ratio will mean that ~80% of physical CPU utilisation is achieved under normal operation, to leave a 20% margin for bursts of activity. Therefore, the consolidation ratio that can be achieved is highly variable. It is conceivable that a 10:1 consolidation ratio is not _theoretically_ unreasonable.

Having said this, very high consolidation ratios can result in unexpected performance impacts. For example, when attempting to use 150 vCPUs on a 54 physical core box, 4 cores will go to dom0. This then means that we have a consolidation ratio of 3:1 (quite low). However, if all the CPUs are reasonably heavily loaded, this means that each vCPU will receive a 30 millisecond time slice, then have to wait for 90 milliseconds before its next time slice. The higher the consolidation ratio, the longer the interval between the time slices.

This waiting can impact aspects such as TCP network throughput, because packets appear to take far longer than expected to arrive at the VM. In the worst case, with very high consolidation ratios, if a VM receives a time slice extremely infrequently (once every few seconds) it will appear to freeze.

There is no absolute limit/number for the vCPU consolidation ratio. Clearly the greater the ratio, the greater the expectation is that the VMs are going to spend a greater fraction of their lives idle. The way to understand whether a consolidation ratio is appropriate on a running system is to examine the overall physical utilisation of the host.

cores-per-socket

Windows client operating systems support a maximum of two CPU sockets e.g. Windows 7 can only use 2 sockets. XenServer’s default behaviour is to present each vCPU with 1 core per socket; i.e. 4 vCPUs would appear as a 4 socket system. Within XenServer you can set the number of cores to present in a socket. This setting is applied per guest VM. With this setting you can tell XenServer to present the vCPUs as a 4 core socket.

[root@xenserver ~]# xe vm-param-set uuid=<vm-uuid> platform:cores-per-socket=4

VCPUs-max=4 VCPUs-at-startup=4

Further background can be found in http://support.citrix.com/article/CTX126524

Licensing

Users will need to assert whether any additional constraints are imposed by the licensing conditions of their graphical software and operating systems.

Maximum number of vCPUs per VM

The underlying Xen Hypervisor ultimately limits the number of vCPUs per guest VM, see here. Currently the Citrix XenServer limit is lower and defined by what is auto and regression tested, this limit is currently 16 vCPUs, you should reference the XenServer Configuration Guide for the version of XenServer you are interested in, e.g. for XS6.2, see here. It is therefore likely that higher numbers of vCPUs will work which maybe of interested to unsupported users, however for those with support we advise you stick to the supported limits. If there was sufficient demand raising the supported limit is something we would evaluate.

The XenCenter GUI currently allows users to set up to 8 vCPUs, for higher numbers users must use the command line CLI.

Anecdotal information

I’ve collected some of the feedback I’ve had from those involved with vGPU deployments and although it cannot be considered official Citrix endorsed best practice, I felt it was useful to share:

Most of the major CAE analysis apps will probably benefit from >2 cores although in some cases it depends on licensing. So Ansys, Abaqus, SolidWorks analysis module etc. all usually take as many cores as they find. Some applications will limit you to 2 cores unless you license an HPC pack or similar though.
Similarly ray tracing apps even when they are GPU accelerated will often use multiple cores (in addition to the CPU) although in current versions they may not take all cores since they want to leave some for interactivity.
We’ve had some feedback from customers doing analysis they expect applications use all cores so it runs faster, and they have noticed in a virtualized environments when CPU cores get over subscribed (please be aware of the need to understand overprovisioning above).
Our general rule is that to do good 3D interactive graphics you need a minimum of 4 vCPU cores, and for workstation cases 8 vCPU cores not because 8 will be used during graphics but because most high-end workstations are 8 cores and users and apps expect it.
Likewise during graphics all 4 vCPU cores may not be 100% busy but it’s likely that 2 probably are between the various needs of application code, graphics driver executions, operating system, virus checkers etc. So if you only have 2 vCPU cores we frequently observe a noticeable impact on performance.

How to investigate vCPU usage and server consolidation

One useful tool for investigating the performance of a system is xentop,details of xentop and other useful tools are available here. This allows you to understand the overall load on the system, and hence whether you are below the 80% load recommended above.

Over the last few versions of XenServer we’ve been increasing the number of metrics available and also improving the documentation and alerting mechanisms around these metrics. These new metrics complement those comprehensively documented in Chapter 9 of the XenServer 6.2 Administrators Guide.If you are interested in measuring the vcpu overprovisioning from the point of view of the host, you can use the host’s cpu_avg metric to see if it’s too close to 1.0 (rather than 0.8, i.e. 80%): If you are interested in measuring the vcpu overprovisioning from the point of view of a specific VM, you can use the VM’s runstate_* metrics, especially the ones measuring runnable, which should be less than 0.01 or so. These metrics can be investigated via the command line or XenCenter.

Investigating cores per socket and thread per core

The Xen hypervisor utility xl can be of some use for debug and diagnosis in XenServer environments, particularly the inquiry options such as info [-n, --numa], by which you can query hardware information such as cores_per_socket and threads_per_core and similar data that you might want to log or keep in benchmarking assessments. Further details and links to this utility are detailed on this page, alongside other useful tools for use in XenServer environments.

Fine tuning

This type of fine tuning is indeed fiddly, and we are open to user feedback on how this could be improved, including the documentation of this type of information. In fact our documentation team are specifically watching this post to see what comments it solicits. A large number of you have already written blogs and articles of your experiences of best practice in your own “real life” XenServer deployments and they are of interest to us and many customers, see:

Further Improvements for graphical applications – CPU pinning and GPU pinning

For many of these applications a pinning strategy that considers the NUMA architecture of the server can also lead to performance enhancements. It is highly application and server specific and you really do need to investigate this with your particular applications, user demographic profile and usage patterns. There is some information linked to at the bottom of this page, but really it’s one best addressed in a future blog….. in the meantime I’d be interested to hear user anecdotes if anyone has tried this though.

Topics

Products