In Part 1 of my blog, we looked at how Citrix CloudPlatform can perform well in high-scale environments and how certain configurations can be tuned to achieve desired response time. In this installment, we will look at few more advanced scenarios.
Batch VM Deployment
Batch VM deployment is quite a common scenario in a cloud deployment, specially in a “Desktop as a Service” use case. A huge number of Virtual machines will be deployed/started continuously in a very short span of time. It’s quite common for example, in a case where, people start up their desktops when they come in to office in the morning. So how does CloudPlatform deal with this?
Let’s consider the 2000 Hosts, 12000 Virtual Machine deployment, in say batches of 1000 each and see how our system resources respond. So what are our top concerns?
- How is the CPU utilization and load?
- Do the DB Connections spike?
- Does CPU Load increase beyond what it can handle?
These important questions are to be addressed and taken care of so that CloudPlatform parameters are appropriately configured and also we have servers which have good enough configuration to deal with these kinds of load.
My test setup had 2000 simulator hosts spread across 250 clusters and 120 Pods. I tried to capture the CPU Utilization on one of the management servers while the 12000 VM deployment was in progress and few hours post the deployment. Below are two charts showing CPU Utilization and Load on one of the Management Servers.
As seen from the charts, during Deployment of 12000 Virtual Machines, (in batches of 1000) the CPU Utilization is well within acceptable range given that I was using a 4 core processor for the management servers.
Steady state – this is the state where no external APIs are fired and the management servers are busy just orchestrating the cloud. This is an important bit. The management servers are not serving external requests all the time, but rather orchestrating the cloud (all the time). How efficiently this is done impacts the performance of external APIs directly.
- At least 5GB RAM for the java process (to be set in /etc/cloudstack/management/tomcat6.conf)
- 4 cores with 2GHz CPU
- cloud.maxActive should be set appropriately as the number of compute nodes increase. We normally increase by 250 for every 2000 hosts
Hardware failures are inevitable. Always bound to happen at the wrong time. But the good news for CloudPlatform users is that it’s easy and seamless to manage failures and failovers are automatic. It’s always advisable to have at least two management servers running for the purpose of sharing load of course, but also to manage failovers.
For a cloud of this scale – 2K Hosts, 15K running VMs, managed by 2 management servers, how long does it take for one of the Management servers to take over if one goes down? Here’s what I saw:
- When one of the Management Servers went down, it took close to 5 minutes for the hosts to rebalance to the other MS
- Once the MS is brought back up online again, there’s rebalancing of hosts again between the 2 management servers, this took around 13 minutes.
Note here though that the cloud is always up and running serving requests and the only operations impacted will be if the API hits a particular host which is rebalancing. Even in that case, the job is retried and completed. Everything else runs just fine and good in the cloud even during Failovers.
List APIs with user-defined parameters
As a final topic in this installment, I would like to discuss a little more advanced metrics of listAPI response time than what we discussed in the last blog. In the Part 1 of the blog series, we saw the response time for various ListAPI calls with pagesize set to maximum objects of that resource in the cloud. But a more common scenario compared to this is calling listAPIs with specific parameters depending on the user requirements.
- Listing all VMs running in a host
- Listing all storage pools in a cluster
- Listing the networks of an account
and so on…
In a large scale set up such as the one we are discussing, these could throw up potential issues. I ran some tests on this and here are the results:
As we can see the results look fantastic with CloudPlatform! Most of them took around 1 sec to respond. It is not just about a plain old listing of resources, but also retrieving resources based on various conditions depending on the use case. A large part of the CloudPlatform UI also relies on these filtered APIs. So these directly impact the UI Performance too.
As is evident from the above tests and charts, you can stretch your cloud any way you want and CloudPlatform makes it seamless and easy. And we at Citrix here, strive to make that possible in every release!
Watch this space for the next part of this series where I’ll talk about Network Performance specifically about the performance of the ever-so-popular Virtual Router. We’ll discuss various fine tuning options and configurations to make your VR shine!