Hi all. Since this is my first post, let me introduce myself: I Dave Stone, a coder here at Citrix for the past 10 years or so. In that time I worked on the Windows CE client, Program Neighborhood, the initial versions of NFuse (now Web Interface), CGP and Session Reliability, and many other features and products – some successful, some…not so much. I also spent a lot of long days in front of a debugger figuring out why Winframe occasionally hanged on logoff, why the ICA protocol stream gets corrupted (only sometimes, of course) during shadowing stress tests, etc. Ah, the memories.
Anyway I thought it would be a good idea to share some of my technical knowledge with the Citrix community and get some interaction and feedback with customers and resellers, which I usually miss out on. I going to write a series of posts on one of the murkier CPS topics: Load Balancing. Load Balancing has been part of Presentation Server for a long time, and was one of the first features by which we tried to define CPS (or Winframe, or Metaframe) as an enterprise solution in contrast to simpler products and technologies for desktop remoting. Load Balancing is theoretically covered by its own document (Load Manager Administrator Guide), but frankly lacking in some pretty relevant details…it may be fine as a high-level introduction but for anyone curious about how things work more specifically or under the covers, it isn sufficient. I worked on a number of features that interact with CPS Load Balancing and also fixed bugs in that subsystem, I probably know as much about it as anyone. In this series of posts, I hope to explain some of the confusing aspects of our Load Balancing implementation, shine some light on some of the finer details, and expose how things work under the hood. I also pose some questions about how to improve Load Balancing to those who have more experience than I do in setting up and managing CPS deployments. This first installment will be an overview of the Load Balancing system, though hopefully one with some details that might not be found elsewhere. In later installments I cover more specific CPS Load Balancing topics and issues, and dig deeper into how the Load Balancing implementation works.
CPS Load Balancing is implemented as an IMA subsystem – the plugin is named Lmsss.dll. As is obvious, the purpose of CPS Load Balancing is to distribute CPS Sessions among available and applicable CPS member servers in a CPS farm. Load Balancing typically takes place when a user launches a published application – for example by clicking on the icon in a Web Interface page. Around Citrix engineering this point in time is usually called app resolution time, to distinguish it from point in time later when the ICA connection is actually made to the target server and the end-user account is logged on. Web Interface (or PN Classic, or PN Agent) a connection the Citrix service on a designated member server. Using an XML-over-HTTP protocol, WI asks XTE which server the client should connect to in order to run the desired published application. XTE in turn queries the subsystem within IMA.
It is therefore ultimately the subsystem responsibility to determine which member server the ICA client should connect to. This decision is actually very straightforward: first member servers that possibly the published application are excluded – for instance the published application may simply not be published to a member server, or a member server may be temporarily disabled due to a Health Check. Among the remaining set of candidate member servers, the one that is least loaded is selected. This server name is returned to XTE, which in turn sends it back to WI via the XML protocol. WI places this name in a .ica file and returns it to the client browser, which launches the ICA client and connects to the specified server.
how does the Load Management subsystem which candidate member server is least loaded? Clearly any time consuming technique – for example making remote calls to each member server to determine CPU Utilization level – is not feasible. CPS Application launches take too long as it is. The solution is to refer to pre-calculated load levels. Each server load is represented by a single number, ranging from 0 (no load) to 10,000 (full load) called the Load Index. The Load subsystem simply picks the smallest Load Index (from the set of servers) and directs the ICA client to that member server with that Load Index. The range 0 to 10,000 is of course arbitrary – the point is that load is reduced to a single unitless number for a quick load balancing decision.
can see a server Load Index in the CMC, or using the Citrix qfarm utility. If you like me you prefer the latter:
|Server Name||Server Load|
output of qfarm shown above indicates that DAVIDS_SERVER1 is more lightly loaded than DAVIDS_SERVER2. More directly, you can query the CPS dynamic store table containing each server load statistics using the queryds utility:
border=”0″ cellspacing=”0″>| name | : 414c-000c-000000c6 |
The load indices are represented in hexadecimal above. I explain some of the other fields in a future post.
As described above, Load Balancing seems pretty trivial. The complexity comes in when we ask the questions prompted by the simple algorithm described:
- How does a CPS server calculate its Load Index?
- When and how does a CPS server update its Load Index in the dynamic store?
The first question determines how CPS sessions will be balanced in a farm; the second question impacts the scalability of the Load Balancing subsystem. I hope to dig into more of the details of these questions in future posts; for now I just start with the basic source of load as CPS defines it – Load Rules.
CPS Load Rules track some underlying source of information – most commonly a performance counter – and translate its state into a Load number. Multiple load rules can be used at the same time on the same server, though the way they are combined has always seemed arbitrary to me. The set of available load rules is evident in the and listed in the documentation, however both these sources are light on detail – for example for some annoying reason they don mention exactly which performance counter a given load rule is based on (it may be obvious for some, but not for others). So, I repeat them here with my comments:
- CPS-specific – These load rules measure characteristics of server related to its function as a CPS member server.
- Server User Load – This is the most basic of the load rules. It the server load the count of Terminal Services sessions (not just ICA sessions but RDP sessions too!) on the server. Disconnected sessions are counted the same as active sessions.
- Application User Load – This load rule derives the server load from the count of CPS sessions running a particular published application. Sessions not running that are ignored. This rule can be useful for applications that make use of hardware in a particular way – for example say a application need exclusive access to device on the server, and there exactly 3 gizmos attached to the server. You could use this rule to ensure no more than 3 gizmo-writer run concurrently. More conventionally an app may have a high and very predictable need for RAM; the Application User Load Rule could be used to ensure that each instance of such an app would get its minimum requirement.
- Logon Throttling – This load rule was introduced recently, in CPS 4.5. It is rather unfortunately named – what it actually throttles is concurrent logons and so it really should be named Throttling Session initialization be very demanding on a CPS server and in some scenarios leads the sinister-sounding problem I get into the details of the black-hole problem in a later post, but at a high level the Logon Throttling Load Rule just distributes new sessions evenly across the farm, rather than distributing sessions evenly throughout the farm like the Server User Load load rule.
- General System Load – These load rules based counters that measure general system activity
- Context Switches – This load rule is based on the Switches/sec counter on the performance object. It probably pretty tricky to tune this load rule for a number of reasons. For one, it measures context switches across all CPUs, so it hard to come up with an absolute number that corresponds to a system. Similarly, it doesn differentiate between switches between threads in the same process and switches between threads in two different processes.
- CPU Utilization – This load rule is based on the Processor Time counter of the performance object. This counter is against the set of all logical processors on the system – because of this systems with different numbers of CPUs can be reasonably compared to each other this counter. This is a pretty straightforward Load Rule, though I do have a question about how it is used in deployments which I ask below.
- Disk Data I/O – This load rule is based on the Bytes/sec counter of the Disk performance object (totaled for all disks in the system). This counter tracks disk accesses from paging activity as well as normal application activity, so it can be used to measure bottlenecks.
- Disk Operations – This load rule is based on the sum of the Reads/sec and Writes/sec counters of the Disk performance object (totaled for all disks in the system). Again, counters track accesses from paging activity as well as normal application activity, so can be to measure bottlenecks.
- Memory Usage – This load rule is based on the Committed Bytes In Use counter of the performance object. The counter measures the percentage of total available virtual memory on the system that is currently allocated. This is a topic a lot of people get confused about: the total available virtual memory on the system is limited by the sum of RAM and the maximum size of the paging file(s). This is an important load rule because if this limit is reached, subsequent memory allocations in any application will simply fail – and few applications even fail gracefully in that case.
- Page Faults – This load rule is based on the Faults/sec counter of the performance object. The counter includes both soft faults (in which the requested page is still resident in RAM and can be quickly remapped) and hard faults (in which the requested page must be read in from disk, and usually victim page it replaces must be written out to disk). Soft page faults are fairly innocuous while hard page faults can present a real performance bottleneck – so this Load Rule is of questionable utility.
- Page Swaps – This load rule is based on the counter of the performance object. The counter includes only hard page faults and so is more useful than the Faults load rule. Just to be pedantic, I note that the counter counts an in-page and an out-page separately, so Swaps isn quite accurate since a includes both operations…
- Utility Load Rules – These Load Rules implement their functionality in the Load Balancing subsystem only as a convenience; logically they belong in the CPS Policy engine. These load rules crank a server load all the way to 10,000 when their conditions are met, thereby removing the server from consideration for the ICA session being load balanced.
- IP Range – This load rule clients with certain IP addresses establishing on CPS server.
- Scheduling – This load rule allows servers to reject all new connections at certain times of the day or week.
I wrap this up for now…it was a bit longer than I intended. Next time I get into the details of exactly how a server Load Index is calculated based the Load Rules assigned to it (including the mysterious Load Throttling load rule) and details of when and how servers load indices are updated the dynamic store. For now I like to ask a couple of questions to folks who have experience setting up and managing CPS deployments in the field:
- Are there any obvious Load Rules we are missing? Would it make sense to create a Load Rule that acts on an arbitrary performance counter?
- This one is a question about using CPU utilization measurements to measure and manage quality of service on a server. I don quite understand why having CPU utilization at say 90% a problem, at least in and of itself. CPU utilization of 90% means that the system is idle 10% of the time, which implies all threads that need to run are able to run, and almost immediately. Wouldn Processor Queue Length (the number of threads waiting to run) be a more direct and meaningful measurement of degrading service on a system?