Within my second post about redundancy mistakes I’d like to discuss the peer benchmarks of the Desktop Transformation Accelerator – Storage and Provisioning layer. One major aspect of this layer is the Provisioning Services (PVS) technology, which is used for desktop image delivery by the majority of organizations using the Accelerator (as indicated below):
(In case you’re not familiar with PVS yet, check out our Provisioning Services Product Overview)
Let’s take a look at what the Accelerator tells us about the redundancy options chosen in real-world implementations.
Chart 1 – Number of Provisioning Servers per site
In 19% of projects, sites are implemented with only a single PVS server.
Why is this bad?
If there is an outage of this system, all active target devices within that site will stop responding (a popup will be shown indicating that the system is paused until the connection has been reestablished). Furthermore, no inactive target can be booted. One could argue that the target devices could fail over to an online PVS server within another site of the same farm. Unfortunately that’s a common misconception.
PVS was architected such that a site within a farm represents a data center. Furthermore it was assumed that there is a low bandwidth WAN connection in between the data centers. Since streaming a vDisk is quite bandwidth intense you’d never want to stream across a WAN connection (although it technically works). Therefore a site is also a failover boundary and target devices will never automatically failover across sites.
That’s why it is highly recommended to implement a minimum of two Provisioning Servers per site and to enable load balancing algorithm (there is no failover if a vDisk is provided by a single server only).
Chart 2 – Database redundancy
In 31% of projects no fault tolerance has been implemented and in 19% of projects the fault tolerance is weak (VM-level HA).
What’s the problem here?
All configuration data of a PVS farm is stored within the farm database. An active connection to this database is required for normal operation of PVS, because PVS servers do not cache any information locally by default. So in case the DB connection is down active target devices will continue to function, but neither new targets can be booted nor can active targets be failed over to another server. In addition no management functions are available.
In order to mitigate this risk, PVS offers a feature called “Offline Database Support”. This option allows PVS to use a snapshot of the database in the event that the connection to the database is lost. This option is disabled by default and is only recommended for use with a stable farm running in production. It is not recommended when running an evaluation environment or when reconfiguring farm components ‘on the fly’. However this option is not a complete safety net, since the following features, options, and processes remain unavailable when the database connection is lost, regardless if the Offline Database Support option is enabled:
- AutoAdd target devices
- vDisk updates
- vDisk creation
- Active Directory password changes
- Stream Process startup
- Image Update service
- Management functions; PowerShell, MCLI, SoapServer and the Console
Therefore it is best practice to implement either SQL Clustering or SQL Mirroring, in order to ensure automatic failover and continuous service. While VM-level HA can be seen as a fault tolerance solution, its major downside is that it only kicks in if the whole SQL server fails (i.e. Blue Screen) but not if “just” the SQL service fails. Furthermore, the SQL service is down until the automatic reboot of the SQL server has been completed. More information about this topic can be found in eDocs – Managing for High Available Implementations.
Chart 3 – Backup of the SQL DB, by % of projects
In 10% of projects, the PVS DB is not backed up.
Why is this an issue, and do I need to backup my DB if I use SQL clustering?
As with all Citrix products the DB is a vital piece of the infrastructure, which needs to be protected and handled with care. While clustering or mirroring the DB helps keeping the service up in case of a single server outage, it does not protect from logical errors. So in order to be able to recover in case a hotfix or a SQL script damages the contents of the DB, it is required to have a recent backup. Citrix best practice is to perform a full backup every day and keep the backup for up to six months by following the Grandfather-Father-Son Principle.
If you’re about to start a XenDesktop project and you would like to accelerate your decision-making process, create a project in the Desktop Transformation Accelerator and benefit from the input of your peers.