Few IT infrastructure projects are funded with unlimited money. Therefore they inevitably have to include a few compromises to ensure the infrastructure can provide the desired service while not exceeding the project budget.
Peak load in a VDI environment is generated by the booting or rebooting of virtual desktops (which I will refer to VDAs hereafter). The boot phase of the VDA generates the vast majority of the disk IO and CPU load. A no-compromise VDI infrastructure could be deployed, where the hypervisor and storage layers have sufficient capacity to cope with the expected load from the boot phase of multiple VDAs. However this would result in the hardware being under-utilised for much of the day, and cannot be reasonably justified. Therefore the VDI designer will have to balance the peak load during the VDA boot phase and the average performance required to support the user base once all the users are logged in.
With this in mind, I recently finished working on an interesting support case for an Enterprise customer who was experiencing repeated XenDesktop “boot storm” issues. I thought it would be useful to document the changes we made to their XenDesktop infrastructure that stabilised the infrastructure and allowed it to scale as originally planned.
The customer environment was as follows:
- NetApp storage for the VMFS datastores
- Hewlett Packard Blade servers
- VMware ESX 3.5 Update 4 as the hypervisor
- VCentre 3.5 U5 for hypervisor management
- XenDesktop 3 as the broker
- Windows Server 2003 SP2 as the server operating system
- Windows XP SP3 as the VDA operating system
- Wyse terminals as the client device
- The customer had 8 Desktop Delivery Controllers (DDCs) in the XenDesktop farm
- The customer had approximately 2000 VDAs, with plans to increase this to 2500
We first performed a full run-down of the infrastructure and the issues. We identified a number of areas that needed to be investigated:
- get some reporting from the NetApp controllers so that we have some measurement of the actual disk IO and to find out of the disk was the bottleneck
- investigate the benefits of updating VCentre to Update 6 as the customer suspected there was a known issue with Update 5
- figure out if we can reduce the load that XenDesktop places on the hypervisor
- look at the roles performed by the DDCs in the XenDesktop farm
Analysis of the performance of the NetApp controllers showed that at peak logon times the controllers were being very heavily loaded. We also found that the CPU utilisation on the ESX hosts was also high. Clearly we needed to tackle these items urgently.
Idle Pool, Maximum Transition Rate and Registration Timeout
Viewing the activity in the vSphere client we could see that there was a huge number of VDAs being powered on simultaneously during the peak logon time of 8am to 9am. The question was what we could we do to control it?
The idle pool setting exists on each Desktop Group to ensure that users logging onto the XenDesktop environment receive the best experience by being connected to a VDA that has already booted and is idly waiting for an incoming connection. In this way the user can be connected to their desktop in just a few seconds. As incoming users log on to idle VDAs, the Pool Management service on the farm master DDC will ask the hosting infrastructure to power on more VDAs in order to maintain the idle pool at the value the administrator has specified.
The customer knew the working patterns of the user community and we agreed that the idle pool settings were optimal for the expected user load. However there is another setting which is related to idle pool that we next reviewed: MaximumTransitionRate.
The MaximumTransitionRate setting is a semi-hidden setting that is configured in “C:\Program Files\Citrix\VmManagement\CdsPoolMgr.exe.config”.
This setting defines how many concurrent commands are sent to the hypervisor by the Pool Management Service. If no value is defined in this configuration file, the default value of 10% of the total pool size is used. With a desktop pool of 2000 desktops (and growing) potentially the Pool Management Service will send 200 concurrent commands to the VCentre server. This then places a high load on the CPU cycles of the ESX servers, and a high IOPS load on the storage.
We knew that we needed to power on 2000 VDAs between 8am and 10am, and we knew that each VDA took about 3 minutes to boot and register (see RegistrationTimeout below). Therefore we knew we had 40 “power-on” slots in the 2 hour window. A bit of simple maths then told us that if we set the MaximumTransitionRate to 50, we would meet the needs of the incoming users (giving the best user experience) but also stop loading the hypervisor and storage.
When the Pool Management service sends a power-on command for a VDA it watches for that VDA to register with one of the DDCs in the farm. If the VDA does not register within 3 minutes, XenDesktop assumes that the VDA has had a problem booting or registering and will power on another VDA in its place. As you can imagine, in a scenario when the Hypervisor and storage are under heavy load, the time taken to power on and register a VDA increases to the point where the VDA no longer registers before the timer expires, and another VDA is powered on in its place. You can see how this will rapidly snowball into what some call a “boot storm” scenario where the entire environment becomes essentially un-usable.
We knew we had to tune the RegistrationTimeout value to give the VDAs a bit more time to boot so as to avoid un-necessary commands being sent to the hypervisor.
We measured the boot and registration time of a VDA at a time when the hypervisor and storage was lightly loaded. This measured consistently as around 3 minutes. Note that we must include time for the VDA to register with the DDCs, as it is VDA registration that we watch for, not just the VDA coming up on the network.
We didn’t want to set this value too high, because of course we might then mask a problem where VDAs genuinely are failing to register with the DDCs. Therefore we set this value at 4 minutes.
That’s all for part 1! In the next part of this series I will discuss how we fixed various DDC farm roles to stabilise the environment. In the final part of this series I will discuss a few of the other more minor items we changed to completely resolve this boot storm problem.
You can view part 2 of this series here
and part 3 of this series is here