In a recent post I posted some data to show that we are getting terrific performance results for XenServer and Intel Nehalem based servers. In the first formal set of tests we found that the bottleneck on performance lies in the fact that the hypervisor still has to perform I/O on behalf of all guests, and so the system scaling limit is the rate at which we can scale the internal I/O stack. I postulated that we would get some impressive numbers for Nehalem based platforms using IOV enhanced 10Gb/s NICs, and contacted our friends at Solarflare, asking if they would help to run some numbers using their 10Gb/s NICs, which offer a powerful direct hardware-to-guest acceleration path that avoids the necessity for the hypervisor to process I/O on behalf of the guests – allowing guests to interact with the hardware directly.Below is a summary of the initial findings for the the Nehalem tests using XenServer 5.0 and Solarflare I/O acceleration. Thanks to Steve Pope of Solarflare for his help. It turns out that with a smart I/O architecture such as the Solarflare offload stack, when guests interact directly with I/O safe hardware, we can dramatically change the system performance, and basically saturate a 10Gb/s link, in both directions at the same time! :
Here’s how the experiment is set up. We have 2 physical servers, A and B, connected back to back with Solarflare 10G Ethernet gear. Each server is running XenServer 5.0 Update 3 with a single 8 logical core Nehalem CPU.
To create a traffic workload between the servers we ran NetPerf TCP_STREAM pairs between Linux RHEL 5 guests (each pair spans server A and server B) and measured the aggregate throughput both with and without acceleration.
The configuration used 4 guests transmitting from A to B and 4 guests from B to A. Raw results were:
- (A -> B) 1094 + 1068 + 1046 + 1128 = 4336 Mbps
- (B -> A) 1019 + 1028 + 1050 + 1021 = 4118 Mbps
Bottleneck: Hypervisor CPU
In other words, we confirmed the hypothesis that there is plenty more system capacity but that the hypervisor is I/O bottlenecked on behalf of the guests.
As previously, the configuration used 4 guests transmitting from A to B and 4 guests from B to A. Raw results were:
- (A->B) 2355 + 2318 + 2296 + 2289 = 9258 Mbps
- (B->A) 2285 + 2295 + 2315 + 2350 = 9245 Mbps
In the accelerated scenario we have basically maxxed out bidirectional I/O on a single 10Gb/s link, with only 4 guests! This is awesome. I should mention also that the Solarflare architecture is remarkably clean and avoids much of the pain of dealing with SR-IOV (which deserves a full post in its own right, and I’m half way through noodling on).