Some of the questions most often asked by cloud admins are:
- How many resources can one create in a cloud managed by Citrix CloudPlatform?
- How far can I stretch my cloud and yet provide quality of service for my clients?
- How many Virtual Machines can I create with a set of hosts?
- How many accounts can be managed?
- How many zones can I have?
- How many VPCs can CloudPlatform handle?
- What will be the response time to list say 10000 Virtual Machines spread across 500 hosts?
And so on. You get the idea.
This blog series will address some of these questions over the course of several installments. The idea is to provide hints and information to Cloud Admins to configure the cloud such that CloudPlatform can efficiently orchestrate resources and at the same time address incoming API requests with acceptable response time.
The first part focuses on the performance of most common and basic use cases in a scaled up environment. Some examples include deploy virtual machine time and response time for important and commonly used ListAPI queries in a scaled up environment up to 2000 hosts.
For a high scaled setup with about 2000 hosts, it’s obviously impractical to arrange for such an infrastructure. Hence, the most dependent way to test a cloud of this scale for performance is to use the in-built CloudPlatform Simulator. The Simulator can be used to mock resources including hosts, storage pools, virtual machines, etc., and behaves just like the actual resources in most cases.
As far as the CloudPlatform Management server is concerned, there is no major difference between an actual resource and a simulated resource. For most tests which are hypervisor independent, this serves our purpose.
The configuration considered here will be a scaled up environment with about 2000 simulator Hosts and more than 4000 accounts. Let’s consider the Redundant Virtual Router offering so that we get two routers per network.
Use Case 1
Given the above configuration what is the time taken to deploy a virtual machine?
I had deployed a total of 12000 Simulator Virtual Machines and monitored the time taken to deploy the VMs for the first VM to the last VM. This test was done on 4.3.0 version of CloudPlatform and uses the following configurations for Management Servers:
Metric: Time to deploy Virtual Machine
This is the time taken for the deploy VM Async Job to complete and bring the virtual machine to running state.
Here’s a chart which shows the trend of the time taken to deploy 12000 Virtual machines. It shows the Time Taken in seconds to Deploy Virtual Machines – starting from the first to 12000th.
As seen from the graph above, the management server takes about 5-10 seconds to choose a deployment destination and to deploy the VM. The spikes seen in the initial part of the graph is accounted for the time taken for the virtual routers (two routers per network since RVR offering is chosen) to deploy in each network.
The other observation seen is that beyond around 10K VMs, the time taken is higher as compared to the first few VMs. But this is expected since most of the hosts are already full with the VMs running and the management server spends some more time looking for available hosts. And the result is quite within the baselines established earlier.
Metric: Time for deployVirtualMachine API response
I’ve also measured the response time of the deploy VM API. This is different from the Async Job response time in the sense that the API response time is essentially the time taken for the management server to do the initial processing of the API and respond with a job id.
Here’s a chart which shows that response time in seconds for the deployVirtualMachine API from the first to the 12000th VM
Use Case 2
Another important use case in a scaled up environment is the time taken for the different ListAPIs to give a response. This also directly impacts the UI performance given that this is the most common API that will get triggered when users are viewing the UI.
A set of important List APIs were considered for this test and here’s the result.
The graph below shows the Response Time in seconds for various List API queries.
The above data is for different values of pagesize according to the listAPI query.
For example there were 12K VMs, 4K Accounts, 8K Routers, 20K Events, 2K Hosts, 4K Users and 12K Volumes in the test setup. The list queries were done without any maximum limit on the pagesize so that it fetches all the objects.
In order to achieve setup a cloud of this scale using CloudPlatform, it is important to note that there are few configurations which needs to be tuned so that CloudPlatform can effectively orchestrate the cloud.
Few of the tuning parameters are mentioned below.
- For instance, the test setup had 3 Management servers, each Management server is a 16G RAM server with Dual Core 4 Processors and a total of 3 such management servers. It is recommended to add one management server for every 6-7K VMs since we should also take into account load adjustment in case of any fail overs
- The Cloud Database is hosted on a remote server with 32G RAM and 8 processor server
- The Database buffer pool size (innodb_buffer_pool_size) was set to 80% of the RAM
- Also note that to deploy so many hosts and VMs and for the management servers to orchestrate these many resources, we need to set the java heap size to at least 8 GB.
- cloud.maxActive connections was set to 1000 in db.properties
This brings us to the end of the first part of this blog series. Watch out for the upcoming parts which have more metrics on the latest CloudPlatform Performance, tuning and hints to make your cloud perform better!