As technologies advance, it becomes easier to shrink the footprint of large-scale desktop virtualization solutions. Processors evolve to deliver more compute power; innovations in server, storage, and networking connectivity increase I/O speeds and throughput; and software advances bring greater efficiencies and scale. On the whole these improvements can help to reduce reference architecture footprints as solutions scale to support large user populations.
Recently our team of Cisco, Citrix and EMC solution engineers built and validated a new turnkey desktop virtualization reference architecture that scales to 1000 mixed-use seats. What’s amazing is that the end-to-end solution takes less than a single rack — only 34 rack units — conserving both power and data center floor space.
The solution installs Citrix XenDesktop 7.5 on Cisco UCS B200-M3 blade servers, EMC VNX5400 storage, and the VMware vSphere ESXi 5.5 hypervisor platform. Since Citrix XenDesktop 7.5 unifies the functionality of earlier XenApp and XenDesktop releases, the architecture easily supports a mix of pooled hosted virtual Windows 7 desktops (30%) and hosted shared desktops (70%). Cisco UCS blade servers, combined with the latest generation EMC VNX arrays, create a compact, powerful, and reliable platform for the desktop virtualization mix.
To validate the architecture, we conducted extensive performance and stress-testing, and documented the solution as a Cisco Validated Design (CVD). This blog gives a brief overview of the architecture, the testing we performed, and the test results. To read more about the architecture, its components, and our validation tests, you can read the complete CVD here.
Building A Compact and Scalable Architecture
The reference architecture incorporates latest generation hardware and software technologies: Citrix XenDesktop 7.5 and Provisioning Services 7.1, VMware ESXi 5.5, the Cisco Unified Computing System™ (UCS) B-Series Blade Servers, and an EMC storage system. The Cisco Unified Computing System integrates state-of-the-art x86 servers with storage interfaces and networking fabric in a fully converged data center platform. Wire-once cabling and flexible configuration capabilities ease deployment, management, and infrastructure changes that frequently take place after an initial install. The solution we tested used Cisco UCS B200 M3 blades with dual 10-core Intel® Xeon® E5-2680v2 (“Ivy Bridge”) processors and 384GB of memory. The blade servers booted via iSCSI from an EMC VNX5400 storage array. The infrastructure was 100% virtualized on VMware ESXi 5.5.
To support enterprise-level service levels, the architecture follows best practices to deliver a highly available design. As shown in the diagram below, the solution features redundant Cisco UCS blade chassis, access switches, and fabric interconnects for high throughput and greater availability. Blades are configured using an N+1 design to support XenDesktop and infrastructure servers in the case of a single blade failure. The dual chassis were populated with 2 blades for infrastructure servers, 3 blades for the VDI workload (supporting 300 users), and 5 blades for the RDS hosted shared desktop workload (supporting 700 users).
The CVD defines a compact, affordable, and flexible solution that hosts Citrix XenDesktop 7.5 software. In addition to enhancements that improve the user experience for hosted Windows apps on mobile devices, XenDesktop 7.5 combines the functionality of previous XenApp and XenDesktop releases. It provides a single management framework and common policies to deploy both hosted virtual desktops (HVDs) and hosted shared desktop sessions (HSDs). Citrix Provisioning Server 7.1 supplies a single wizard that administrators can use to define and provision both types of desktop images.
To validate the architecture, we conducted a series of performance and stress tests. We used Login VSI 3.7 software from Login VSI Inc. (http://www.loginvsi.com) to generate load within the test environment. The software generates desktop connections, simulates application workloads, and tracks application responsiveness. During each test run, we capture metrics across the end-to-end virtual desktop lifecycle: during virtual desktop boot and user desktop login (ramp-up), user workload simulation (steady state), and user log-off.
To begin the testing, we start performance monitoring scripts to record resource consumption for the hypervisor, virtual desktop, storage, and load generation software. We then take the desktops out of maintenance mode, start the virtual machines, and wait for them to register. The Login VSI launchers initiate the desktop sessions and begin user logins, which constitutes the ramp-up phase. Once all users are logged in, the steady state portion of the test begins in which Login VSI executes the application workload (the default Login VSI Medium workload). The Medium workload represents office productivity tasks for a “normal” knowledge worker and includes operations with Microsoft Office, Internet Explorer with Flash, printing, and PDF viewing.
Login VSI loops through specific operations and measures response times at regular intervals. The response times determine Login VSIMax, the maximum number of users that the test environment can support before performance degrades consistently. Because baseline response times can vary depending on the virtualization technology used, using a dynamically calculated threshold based on weighted measurements provides greater accuracy for cross-vendor comparisons. For this reason, we also configure the Login VSI software to calculate and report a VSImax Dynamic response time.
We conducted both single server and multiple server scalability tests, performing three test runs during each test cycle to verify the consistency of our results. The test phases included:
1. Determining single server scalability limits. This phase calculates Login VSIMax for each scenario (HVDs or HSDs) on a single blade. In each case, user density scales until Login VSIMax is reached, which occurs when CPU utilization reaches 100%.
2. Validating single server scalability under a maximum recommended density with HVD or HSD loads. The maximum recommended load for a single blade occurs when the Login VSI Average Response and VSI Index Average do not exceed the baseline plus 2000 ms and CPU utilization averages no more than 90-95% during steady state.
3. Validating multiple server scalability on a combined workload. After determining the maximum recommended density for each workload type, we combine the workloads on the full test configuration to achieve a full-scale, mixed workload result.
Phase 1: Single Server Scalability Tests
In the first set of tests — the single server scalability tests — we determined Login VSImax for first hosted virtual desktops (HVD) and then hosted shared desktop sessions (HSD) on a single blade. The table below summarizes the VM configurations for the HVD and HSD tests.
To find Login VSImax for hosted virtual desktops on a single blade, we used a test workload of 202 users running Windows 7 SP1 sessions under a Medium workload (including Adobe Flash content). As Figure 1 shows, Login VSIMax was reached at 187 users.
Figure 1: Hosted Virtual Desktops, Single Server Results
We then looked at the scalability of hosted shared desktops (using Windows Server 2012 desktop sessions) on a single blade. We launched 256 user sessions for the scalability test of hosted shared desktops on a single Cisco UCS B200-M3 blade and achieved a VSImax score of 235 users (Figure 2).
Figure 2: Hosted Shared Desktops, Single Server Results
With any optimally configured scalability test, CPU resources are typically the limiting factor, which is what we experienced in both the HVD and HSD single server tests. Compared to past tests of previous generation Intel Xeon “Sandy Bridge” processors, the same Cisco UCS B200 M3 blade servers with dual 10-core 2.7 GHz Intel Xeon E5-2680v2 “Ivy Bridge” processors supported about 25% greater user density. Thus, Cisco blades with the newer Intel Xeon E5-2680v2 processors enable a compact solution that supports up to 1000 users.
Phase 2: Single Server Scalability, Maximum Recommended Density
After testing single server scalability, the next step was to find the maximum recommended density for a single blade. This density represents the load that a single blade can sustain in the event of a server outage while still delivering a positive end-user experience.
For the HVD workload, the maximum recommended density was 160 hosted virtual desktops on a single Cisco UCS B200 M3 blade with 384GB of RAM. Figure 3 shows that VSIMax was not reached.
Figure 3: Hosted Virtual Desktops, Single Server Results under Recommended Load
Hosted shared desktops (HSD) run on a single installed instance of a server operating system, such as Microsoft Windows Server 2012, which is shared by multiple users simultaneously. Each user receives a desktop “session” and works in an isolated memory space. For hosted shared desktops, the maximum recommended workload was 200 users for each Cisco UCS B200 M3 blade (with dual Intel Xeon E5-2680v2 processors and 256GB of RAM). On each blade we configured eight Windows Server 2012 virtual machines, each with 5 vCPUs and 24GB RAM to host 25 user sessions. Figure 4 shows that VSIMax was not reached.
Figure 4: Hosted Shared Desktops, Single Server Results under Recommended Load
Phase 3: Full-Scale, Mixed Workload Scalability Testing
In the final phase of testing, we validated the solution at scale by launching Login VSI sessions against both HVD and HSD clusters concurrently. Cisco’s testing protocol requires that all sessions must be launched within 30 minutes and that all launched sessions must become active within 32 minutes.
Our validation testing imposed aggressive boot and login scenarios for the 1000-seat mixed desktop workload. All HVD and HSD VMs booted and registered with the XenDesktop 7.5 Delivery Controllers in under 15 minutes, demonstrating how quickly this desktop virtualization solution could be available after a cold-start. Our testing simulated a login storm of all 1000 simulated users, yet all users logged in and started running workloads (achieving steady state) within 30 minutes without exhausting CPU, memory, or storage resources.
Figure 5 shows the VSImax results of the mixed workload test across the full configuration. Figures 6, 7, and 8 show CPU, memory, and network metrics collected on a representative HVD server, respectively. Figures 9, 10, and 11 display CPU, memory, and network metrics for a representative HSD server, respectively (Storage metrics are discussed in a subsequent section.)
Figure 5: Full-Scale, Mixed Workload Scalability Results for 1000 users
Figure 6: Representative UCS B200 M3 XenDesktop 7.5 HVD Blade CPU Utilization
Figure 7: Representative UCS B200 M3 XenDesktop 7.5 HVD Blade Memory Utilization
Figure 8: Representative UCS B200 M3 XenDesktop 7.5 HVD Blade Network Utilization
Figure 9: Representative UCS B200 M3 XenDesktop 7.5 HSD Blade CPU Utilization
Figure 10: Representative UCS B200 M3 XenDesktop 7.5 HSD Blade Memory Utilization
Figure 11: Representative UCS B200 M3 XenDesktop 7.5 HSD Blade Network Utilization
The EMC VNX5400 Storage Array is a flash-optimized hybrid array. It combines disk spindles and solid state drives (SSDs) and automates tiered data movement to optimize performance. EMC Multicore FAST Cache (MCx™) technology distributes VNX data services across all of the cores available in the system, boosting storage performance for transactional applications (like desktop virtualization) at a low price point. Using MCx has an impact during the boot and login phases, dramatically decreasing IOPS.
For the test environment, the storage configuration included an EMC VNX 5400 dual controller storage system with 4 disk shelves containing a mix of 200GB SSD, 600GB SAS 15K RPM, and 2TB NL-SAS 7.2K RPM drives. User home directories were configured on CIFS shares while PVS write caches for the VDI and RDS workloads were hosted on NFS volumes.
During the full scale testing of combined VDI and RDS workloads for 1000 desktop users, we recorded average IOPS during boot, login, steady state and log off, shown below. Overall, the storage configuration easily handled the 1000-user density, showing an average read latency of less than 3 ms and write latency less than 1 ms.
After initial boot, Citrix Provisioning Services (PVS) generates the vast majority of the I/O workload as it records changes for the OS and applications in the write cache vDisks. The average write cache I/O was 8k in size, with more than 90% of write cache I/Os being writes. The PVS write cache showed a peak average of 10 IOPs per desktop during the login storm, with the steady state showing 15-20% fewer I/Os. The addition of CIFS profile management had an effect of taking some workload off of the write cache.
The testing validated this Cisco, Citrix, VMware, and EMC reference architecture under a mixed XenDesktop workload of 300 VDI and 700 RDS users. The tested configuration easily supported 1000 seats without overwhelming CPU, memory, network, and storage resources. The architecture defines a highly reliable and fault-tolerant design, using N+1 blade servers for hosted virtual desktops, hosted shared desktops, and infrastructure services. Since the design takes advantage of new and evolving technologies, the entire solution can deliver enterprise levels of performance and availability for 1000 users within a space-saving footprint — just 34 rack units.
To read more about the architecture and test results, see the full CVD posted on the Cisco web site. Stay tuned! Follow our blogs to learn more about how technology innovations are helping to produce more scalable and cost-effective Citrix and Cisco desktop virtualization solutions!
— Rob Briggs, Principal Solutions Architect with Citrix Worldwide Alliances
o Mike Brennan, Manager Technical Marketing, Cisco Desktop Virtualization Solutions Team
o Frank Anderson, Senior Technical Marketing Engineer, Cisco Desktop Virtualization Solutions Team
o Ka-Kit Wong, Consultant Solutions Engineer, EMC Strategic Solutions Engineering