Citrix’ rapid move to the cloud brings with it an accelerated adoption of practices and processes that are needed to support many of the cloud services that we provide. One of the many practices that is core to a cloud culture is automation, and alongside is monitoring. Both are intimately linked and mutually dependent — you can’t automate what you can’t monitor, and you can’t monitor without automation.
Having said that, it’s obvious how important monitoring is not just to automate the services we consume internally, but also to automate the infrastructure that powers our customer’s sites at scale and guarantee proper Service Level Agreements (SLAs).
As all of you know, central to Citrix’ mobility strategy is XenMobile, a fully functioning cloud service with many customers already enlisted. For us to be able to scale in the cloud and be proactive, we need an automated monitoring strategy that can alert us when a customer is having difficulties in real-time. In this post, we’ll take a 10,000 ft view of XenMobile’s cloud service monitoring strategy.
If you’re interested in the XM deployment architecture and how to sign up, please read Justin’s blog post. Our monitoring strategy is made up of a lot of moving pieces, a combination of open-source, homegrown, and third-party solutions to support customers of different tiers. In this post, I’ll be talking about a few of these tools.
At the center of it all is our command executor service. The executor is a fully distributed, highly available web application that provides hooks for pluggable modules, which allow us to easily add any custom checks like ICMP, SSL certificates, port checks, LDAP, etc. Think of this as our first line of defense that probes a customer’s infrastructure. I’m leaving a lot of details out in this post, but suffice to say that we check everything from the reachability of the XenMobile servers and NetScaler traffic, to a customer’s internal AD/LDAP connectivity into Citrix Cloud; this list grows with every release, as we continue to map out every single potential point of failure. This diagram will show a super-simplified architecture of the components this post focuses on:
Every customer’s environment in the cloud is properly locked down for security reasons, so how could we reach into a customer’s internals to gather core metrics on their infrastructure? Our second source of information originates from within the customer’s environment via a Local Instance Collector Agent (LIMA). LIMA runs a service on XenMobile servers and reports telemetry information out to our centralized log analyzer in the cloud. After examining this information and comparing it against proper thresholds, set forth by engineering, we can determine if a service is in an up, down, or degraded state and alert our cloud teams and the customer, respectively. The monitoring system sends every single incident occurred to our log analyzer for historical purposes, as it allows us to visualize, troubleshoot, and measure trends over periods of time. Here’s a small snippet of the dashboard our Cloud Operations engineers use:
So, at a high level, we have check commands being executed against every customer, whose results get stored into a centralized logging system for processing. The executor service also contains a component that provides the brains of monitoring depending on this data. More on this in future posts, I’m only just scratching the surface here. Stay tuned, as I continue to unveil the ins and outs of our XenMobile Service monitoring strategy in subsequent posts.