Session Reliability, Frozen Screens and The Hourglass of Death

Introduction

I’ll start by saying I should have probably published this article a long time ago. But I didn’t because I usually only get this question about twice a year…and it was easy enough to dig up the last email from my archives and send it to the person who was asking about Session Reliability this time. But then just this past week, I got asked about it twice. And that’s when I take my response “public” and write an article about it. So here it is…the intent of this article is to talk about the Common Gateway Protocol (CGP aka Session Reliability), when it should be enabled, when it should be disabled and why you probably hate staring at hourglasses and frozen screens.

The History of CGP

I always like to understand the history of things in order to understand them better, so I thought a brief trip down memory lane was in order before we dive into CGP. As Jeff Muir describes in his “Two Port ICA” article, we developed CGP over a decade ago when Citrix was originally looking at extending the ICA protocol. Specifically, we needed a way to wrap ICA traffic and maintain the session if a network link fails. As it turns out, network speeds and connections were pretty crappy over 10 years ago and our customers were tired of constantly being disconnected from their session and having to reconnect whenever there was any type of network blip. So we requested a port from IANA, they assigned us 2598, we wrote CGP (and Secure Gateway) and the rest is history.

How CGP Works

The intent of this article is not to explain how CGP or Session Reliability works – we do a pretty good job of explaining how it works in this technote that’s been around for 9 years. So I’d review that article but I’ll also give you the quick version. When the SR feature is enabled, the ICA Client or Receiver tunnels ICA traffic inside CGP via TCP 2598. The infamous “Citrix XTE Service” is the server-side piece of code that acts as the relay, stripping away the CGP layer and then forwarding the ICA traffic to the ICA listener on 1494. The XTE Service is responsible for “buffering” traffic if the network link between the client and XA or XD server is broken or momentarily interrupted. And when the network link is down and we’re buffering traffic, we indicate this to the end-user by showing a SPINNING HOURGLASS or FROZEN SCREEN. (That was probably a mistake on our part but we’re starting to change that now with the latest releases of Receiver by using informative pop-up messages or tooltips – more on that later.) Finally, once the network connection is restored, we stop showing that wonderful “hourglass of death” (as I’ve lovingly called it for years), we flush the buffer and the end user can continue working as normal. So we solve the problem of dropped sessions and having to constantly reconnect if there’s a network blip, but what did we do in the process by enabling CGP? Should you always enable CGP since it’s a DEFAULT setting? Or should it only be used sparingly when it makes sense? What are those scenarios where it makes sense to enable? These are the questions I get most often and what I really want to talk about in this article.

Drawbacks of CGP and Why You Might Disable the Feature

Don’t get me wrong – there is a time and place for SR and I’ll talk about those scenarios in a minute. But I’ll also have you know that there are some (largely) undocumented downsides to using SR or enabling CGP. And our Consulting team ends up disabling it (or recommending to disable it) from time to time. Why?

Enabling SR increases network traffic by sending keep-alive packets much more frequently compared to the standard ICA keep-alive feature. My colleague, Brendan Lin, and I are in the process of capturing some Wireshark traces with various Receivers and XA and XD versions and we’ll do a follow-up article once we have all the data. But the last time we took a trace a few years ago, we found that enabling CGP actually sent keep-alives every 4 seconds compared to 60 seconds (default setting of ICA keep-alive). Sure, these keep-alives aren’t huge and that’s not that much extra traffic if you have 100 users. But if you have 5000 users, that can really add up and we’ve seen it be in the Mb range…not Kb range as you might expect. We’ve also seen CGP’s overhead introduce even more reconnect events on high latency networks than with it disabled! I also want to re-iterate that things are changing in the latest versions of Receiver and XD (where we re-wrote the XTE Service), so this network traffic overhead is becoming negligible – stay tuned for Brendan’s follow-up article with more details but we are kind of amazed by the initial results.
The XTE Service consumes additional CPU cycles on the host or workstation when CGP is enabled and especially when it’s “buffering”. I’ve seen several instances where the XTE Service has practically pegged a box by itself, and after looking into it, CGP was the culprit (which leads me to my next bullet). We’ve also published technotes over the last couple years after XD debuted with guidance like “use 2 vCPUs instead of 1 vCPU if CGP/SR is enabled on the VDAs”, etc. The bottom line is it adds a bit of overhead and these extra CPU cycles can lead to decreased scalability and poor performance in certain situations.
Users don’t like pop-ups, hourglasses or frozen screens. 😉 And while I joke around about that, the real point here is that by enabling CGP we often end up masking REAL network problems. I’ve seen people mistakenly set TcpMaxDataRetransmissions too high and it’s almost impossible to detect when CGP is enabled. I’ve also seen “bullet proof” networks with redundant links and multiple ISPs have issues that are ridiculously difficult to troubleshoot or detect when CGP is enabled. The fact of the matter is Citrix often gets a bad name because users complain of “freezing” and they are staring at hourglasses while CGP is kicking in (and working by design!). And most of the time it’s not a Citrix problem at all – it very well might be a REAL network problem that needs to be taken care of. Disabling CGP usually exposes the problem much faster and results in a quick and proper resolution. Thank God we are moving away from frozen screens and spinning wheels to more informative tooltip messages in the latest versions of Receiver!

So, in general, if a customer has a seemingly stable network and most users are on the LAN, it might be best to disable CGP. And if you’re using an older version of XA and legacy ICA Clients (i.e. XA 4.5 with 10.x, etc.), I might also feel stronger about disabling it. We’ve been doing this for years at some of our largest customers without issue. But that doesn’t mean there aren’t scenarios where it’s beneficial or where we wouldn’t recommend enabling it. What are some of those scenarios?

Benefits of CGP and the Specific Scenarios Where it Should be Enabled

Roaming users on unreliable wireless/3G/4G connections. This is the most obvious use case or scenario and why we invented it – for police offers that roam between different cellular networks while driving…or nurses who use COWs in hospitals to move to different rooms with different WiFi networks. Having the data on the screen at all times is important in these scenarios (even if there is an hourglass covering part of the screen). And the overhead of CGP outweighs being constantly disconnected and reconnected for these users. NOTE/UPDATE: Before saying to yourself, “I have WAN or wireless scenarios so I should enable CGP”, please read my response to Shawn Bass below in the Comments section. Shawn brings up a good point about wWANs becoming more and more common, and while I agree, I think there are some things you can do before simply defaulting to CGP.
Anonymous access. This is a use case my colleague, Dan Allen, brought up that is quite interesting. He’s enabled CGP when he had to use anonymous published apps in a couple environments. Without CGP, reconnects were not possible since the connections were anonymous. But with CGP, we can identify a previously disconnected session and reconnect to the right session.
Seamless AGEE failover. This was a hotly debated topic internally a few months back…but the bottom line is if you want no downtime whatsoever when a NetScaler fails over, you need CGP to buffer the connections. Without CGP, users that are being proxied through the gateway will have to reconnect. I must say that this is a pretty rare event once you have your NetScalers up and running in production, but it does happen from time to time, and if this is critically important, then CGP may be for you.
Multi-Stream ICA. Maybe one of the worst names we’ve ever come up with since it requires CGP (without Repeater anyway). It should be called “Multi-Stream CGP” instead! But if you’re going to use this new feature to splice virtual channels and provide true QoS, then CGP is required. This may change in the future, but as of now, it does not work with regular ICA unless you have Repeater.
Latest Versions of Receiver and XenDesktop. More details will be available in Brendan’s article, but we have really improved the protocol in the latest versions of XD and Receiver – i.e. XD 5.6, 7 and Receiver 4. If you are using our latest products, we believe the overhead is almost negligible and the informative tooltips provide a better user experience if there is a network issue.

So if you are in one of the above situations, then CGP does offer some value and you probably should keep it enabled (remember, it’s enabled by default on both XA and XD so you have to explicitly disable it!). But if you don’t find yourself in one of those scenarios, it might be best to eliminate the “middleman…and go straight to the source” as Jeff Muir puts it, and disable CGP to save precious resources.

Conclusion / Wrap-Up

I hope this article helps clear up some of the confusion around Session Reliability. I’m also (selfishly) glad I finally captured all this information on paper so I can simply point to a URL instead of digging through my email archives in the future. But please let me be clear and allow me to make one final point – I am not recommending to disable CGP. In fact, one of the other reasons I wanted to write this article was because I saw several of our field Consultants “blindly” recommending to disable CGP. And I would always fire back at them and say that we should not be preaching that, especially without proper justification. There is a time and place for CGP as I said earlier (and as many of you have already stated in the Comments section below). And we have really improved CGP over the last year in particular. So I hope this article creates some awareness and makes people THINK about whether CGP should be used or not. I also hope this makes more of our customers TEST with and without CGP to see how it works and if it makes sense to enable in their particular environment. That’s my dream anyway – a world where people think critically and test properly. 😉

Thanks for reading and stay tuned for a follow-up article where we dig into some traces with CGP enabled vs. disabled…I had originally planned to publish that information as part of this article, but after our first few traces, things have changed a bit and CGP looks a bit different than it did a couple years ago. Much of this has to do with the IMA-less XD architecture (5.x+) where we completely re-wrote the XTE service and supporting protocols.

Cheers, Nick

Nick Rintalan

Senior Architect, Citrix Consulting

Topics

Products