Developing Products for Citrix Hypervisor

Explore SDKs and Tools

This article was initially published by Felipe Franciosi in July, 2014:

OVERVIEW

The latest development builds of XenServer (check out the Creedence Alpha Releases) enjoy significantly superior storage performance compared to XenServer 6.2 as already mentioned by Marcus Granado in his blog post about Performance Improvements in Creedence. This improvement is primarily due to the integration of tapdisk3.

This blog post will introduce and discuss this new storage virtualisation technology, presenting results for experiments reaching about 10 GB/s of aggregated storage throughput in a single host and explaining how this was achieved.

INTRODUCTION

A few months ago I wrote a blog post on Project Karcygwins, which covered a series of experiments and investigations we conducted around storage IO. These focused on workloads originating from a single VM and applied to a single virtual disk. We were particularly interested in understanding the virtualisation overhead added to these workloads, especially on low latency storage devices such as modern SSDs. Comparing different storage data paths (e.g. blkback, blktap2) available for use with the Xen Project Hypervisor, we explained why and when any overhead would exist as well as how noticeable it could get. The full post can be read here: http://xenserver.org/blog/entry/karcygwins.html

Since then, we expanded the focus of our investigations to encompass more complex workloads. More specifically, we started to focus on aggregate throughput and what circumstances were required for several VMs to make full use of a storage array’s potential. This investigation was conducted around the new tapdisk3, developed in XenServer by Thanos Makatos. Tapdisk3 was written to have a simpler architecture, implemented entirely in user space, and leading to substantial performance improvements.

WHAT IS NEW IN TAPDISK3?

There are two major differences between tapdisk2 and tapdisk3. The first one is in the way this component is hooked up to the storage subsystem: while the former relied on blkback and blktap2, the latter connects directly to blkfront. The second major difference lies in the way data is transferred to and from guests: while the former used grant mapping and “memcpy”, the latter uses grant copy. For further details, refer to the section “Technical Details” at the end of this post.

Naturally, other changes were required to make all of this work. Most of them, however, are related to the control plane. For these, there were toolstack (xapi) changes and the appearance of a “tapback” component to connect everything up. Because of these changes (and some others regarding how tapdisk3 handles in-flight data), the dom0 memory footprint of a connected virtual disk also changed. This is currently under evaluation and may see further modifications before tapdisk3 is officially released.

 

PERFORMANCE EVALUATION

In order to measure the performance improvements achieved with tapdisk3, we selected the fastest host and the fastest disks we had available. This is the box we configured for this measurements:

 

•   Dell PowerEdge R720

•                     64 GB of RAM

•                     Intel Xeon E5-2643 v2 @3.5 GHz

•                                       2 Sockets, 6 cores per socket, hyper threaded = 24 pCPUs

•                     Turbo up to 3.8 GHz

•                     Xen Project Hypervisor governor set to Performance

•                                       Default is set to "On Demand" for power saving reasons

•                                       Refer to Rachel Berry's blog post for more information on governors

•                     BIOS set to Performance per Watt (OS)

•                     Maximum C-State set to 1

•   4 x Micron P320 PCIe SSD (175 GB each)

•   2 x Intel 910 PCIe SSD (400 GB each)

•                     Each presented as 2 SCSI devices of 200 GB (for a total of 4 devices and 800 GB)

•   1 x Fusion-io ioDrive2 (785 GB)

 

After installing XenServer Creedence Build #86278 (about 5 builds newer than Alpha 2) and the Fusion-io drivers (compiled separately), we created a Storage Repository (SR) on each available device. This produced a total of 9 SRs and about 2.3 TB of local storage. On each SR, we created 10 RAW Virtual Disk Images (VDI) of 10 GB each. One VDI from each SR was assigned to each VM in a round-robin fashion as in the diagram below. The guest of choice was Ubuntu 14.04 (x86_64, 2 vCPUs unpinned, 1024 MB RAM). We also assigned 24 vCPUs to dom0 and decided not to use pinning (see XenServer 6.2.0 CTX139714 for more information on pinning strategies).

We first measured what aggregate throughput the host would deliver when the VDIs were plugged to the VMs via the traditional tapdisk2-blktap2-blkback data path. For that, we got one VM to sequentially write for 10 seconds on all VDIs (at the same time). We observed the total amount of data transferred. This was done with requests varying from 512 bytes up to 4 MiB. Once completed, we repeated the experiment with an increasing number of VMs (up to ten). And then we did it all again for reads instead of writes. The results are plotted below:

In terms of aggregate throughput, the measurements suggest that the VMs cannot achieve more than 4 GB/s when reading or writing. Next, we repeated the experiment with the VDIs plugged with tapdisk3. The results were far more impressive:

This time, the workload produced numbers on a different scale. For writing, the aggregate throughput from the VMs approached the 8.0 GB/s mark. For reading, it approached the 10.0 GB/s mark. For some data points in this particular experiment, the tapdisk3 data path proves to be faster than tapdisk2 by ~100% when writing and ~150% when reading. This is an impressive speed up on a metric that users really care about.

 

TECHNICAL DETAILS

To understand why tapdisk3 is so much faster than tapdisk2 from a technical perspective, it is important to first review the relevant terminology and architectural aspects of the virtual storage subsystem used with paravirtualised guests and Xen Project Hypervisors. We will focus on the components used with XenServer and generic Linux VMs. Note, however, that the information below is very similar for Windows guests when they have PV drivers installed.

Traditionally, Linux guests (under Xen Project Hypervisors) load a driver named blkfront. As far as the guest is concerned, this is a driver for a normal block device. The difference is that, instead of talking to an actual device (hardware), blkfront talks to blkback (in dom0) through shared memory regions and event channels (Xen Project’s mechanism to deliver interrupts between domains). The protocol between these components is referred to as the blkif protocol.

Applications in the guest will issue read or write operations (via libc, libaio, etc) to files in a filesystem or directly to (virtual) block devices. These are eventually translated into block requests and delivered to blkfront, being normally associated with random pages within the guest’s memory space. Blkfront, in turn, will grant dom0 access to those pages so that blkback can read from or write to them. This type of access is known as grant mapping.

While the Xen Project developer community has made efforts to improve the scalability and performance of grant mapping mechanisms, there is still work to be done. This is a set of complex operations and some of its limitations are still showing up, especially when dealing with concurrent access from multiple guests. Some notable recent efforts were Matt Wilson's patches to improve locking for better scalability.

In order to avoid the overhead of grant mapping and unmapping memory regions for each request, Roger Pau Monne implemented a feature called “persistent grants” in the blkback/blkfront protocol. This can be negotiated between domains where supported. When used, blkfront will grant access to a set of pages to blkback and both components will use these pages for as long as they can.

The downside of this approach is that blkfront cannot control, which pages are going to be associated with requests that come from the guest’s block layer. It therefore needs to copy data from/to these requests to this set of persistently granted pages before passing blkif requests to blkback. Even with the added copy, persistent grants is a proven method for increased scalability in concurrent IO.

 

Both approaches presented above are entirely implemented in kernel-space within dom0. They also have something else in common: requests issued to dom0’s block layer refer to pages that actually reside in the guest’s memory space. This can trigger a potential race condition when using network-based storage (e.g. NFS and possibly iSCSI); if there is a network packet (which is associated to a page grant) queued for retransmission and an ACK arrives for the original transmission of that same packet, dom0 might retransmit invalid data or even crash (because that grant could either contain invalid data or have already been unmapped).

To get around this problem, XenServer started copying the pages to dom0 instead of using grants directly. This was done by the blktap2 component, which was introduced with tapdisk2 to deliver other features such as thin-provisioning (using the VHD format) and Storage Motion. In this design, blktap2 copies the pages before passing them to tapdisk2, ensuring safety for network-based back ends. The reasoning behind blktap2 was to provide a block device in dom0 that represented the VDI as a full-provisioned device despite its origins (e.g. a thin-provisioned file in an NFS mount).

As we saw in the measurements above, this approach has its limitations. While it works well for a variety of storage types, it fails to scale in terms of performance with modern technologies such as several locally-attached PCIe SSDs. To respond to these changes in storage technologies, XenServer Creedence will include tapdisk3 which makes use of another approach: grant copy.

With the introduction of the 3.x kernel series to dom0 and consequently the grant device (gntdev), we were able to access pages from other domains directly from dom0’s user space (domains are still required to explicitly grant proper access through the Xen Project Hypervisor). This technology allowed us to implement tapdisk3, which uses the gntdev and the event channel device (evtchn) to communicate directly with blkfront. However, instead of accessing pages as before, we coded tapdisk3 to use a Xen Project feature called “grant copy”.

Grant copying data is much faster than grant mapping and then copying. With grant copy, pretty much everything happens within the Xen Project Hypervisor itself. This approach also ensures that data is present in dom0, making it safe to use with network-attached backends. Finally, because all the logic is implemented in a user-space application, it is trivial to support thin-provisioned formats (e.g. VHD) and all the other features we already provided such as Storage Motion, snapshotting, fast clones, etc. To ensure a block device representing the VDI is still available in dom0 (for VDI copy and other operations), we continued to connect tapdisk3 to blktap2.

Last but not least, the avid reader might wonder why XenServer is not following the footsteps of qemu-qdisk, which implements persistent grants in user space. In order to remain safe for network-based backends (i.e. with persistent grants, requests would be associated with grants for pages that actually lie in guests’ memory space -- just like in Approach 2 above), qemu-qdisk disables the O_DIRECT flag to issue requests to a VDI. This causes data to be copied to/from dom0’s buffer cache (hence guaranteeing safety as requests will be associated with pages local to dom0). However, persistent grants imply that a copy has already happened in the guest and the extra copy in dom0 is simply adding on the latency of serving a request and CPU overhead. We believe grant copy to be a better alternative.

 

CONCLUSIONS

In this post I compared tapdisk2 to tapdisk3 by showing performance results for aggregated workloads from sets of up to ten VMs. This covered a variety of block sizes over read and write sequential operations. The experiment took place on a modern and fast Intel-based server using state-of-the-art PCIe SSDs. It showed tapdisk3’s superiority in terms of design and consequently performance. For those interested in what happens under the hood, I went further and compared the different virtual data paths used in Xen Project Hypervisors with focus on XenServer and Linux guests.

This is also a good opportunity to thank and acknowledge XenServer Storage Engineer Thanos Makatos’s brilliant work and effort on tapdisk3 as well as everyone else involved in the project: Keith Petley, Simon Beaumont, Jonathan Davies, Ross Lagerwall, Malcolm Crossley, David Vrabel, Simon Rowe, and Paul Durrant.

Additional Citrix Developer Learning Resources

Create your Citrix Developer account today
An account gives you access to all of the benefits of the Citrix Developer community.

You built a great solution integrating with Citrix APIs, now continue the next step of your journey with Citrix Ready.