Citrix Analytics uses machine learning (ML) to detect anomalies in user behavior and flags potentially risky users. To data scientists and cybersecurity experts, this is not rocket science. Machine Learning has made it easy to spot patterns and solve problems that would have otherwise taken years to solve. As with any other hot innovation, machine learning is now touted as the panacea to everything from fighting crime to finding partners. Scratch the surface a little and you walk away with a plethora of esoteric algorithm names.
Many infrastructure vendors, especially in the storage and networking space, jumped on the machine learning bandwagon a few years ago. They hoped to use their proximity to data to provide value added services to their customers. While many of them have done a good job of creating and capturing value through ML, most pay little more than lip service. In this post — my second in a series on Citrix Analytics — I talk about how we are different from other analytics vendors from a data processing perspective.
BIG DISCLAIMER: I am not going to name names.
WARNING: I’ve used metaphors quite liberally to simplify my message. If you want access to our ML blueprints, join Citrix.
ADDITIONAL WARNING: If you perceive any shade in my writing, check your conscience.
The First Hurdle: Getting the Data
Building a good machine learning-based solution needs a large, dynamic, and realistic dataset. Machine learning models are trained using this dataset. Getting a large dataset is hard, but what’s harder is finding a realistic dataset. In the infrastructure space, the reason why many networking vendors have been the first to implement ML is their access to vast volumes of real data. As traditional analytics vendors need sensors and agents to gather data, some of them are acquiring or building networking capabilities to enter the lucrative business analytics market via the WAN or networking route.
Most early movers used logs and other readily available system data to train their models. Over the years, standardization of log formats, cheap computing and storage, and maturity of machine learning models have made ML-based systems very reliable. Hence, the claim to fame for most analytics vendors is their ability to ingest logs – and “do some ML” while they are at it.
The Tough Hurdle: Needle in the Haystack
ML is all about spotting and studying patterns and trends — much like looking for a needle in a haystack. For an ML-based system to be accurate and fast, data needs to be “clean.” Data points should be easily identifiable and not masked by noise, i.e., more needle and less hay. Most analytics vendors only have access to noisy data with extraneous data points. To analyze this data, they resort to brute force means, like using sophisticated machine learning algorithms and burning up a ton of compute and storage — and charging their customers for it!
Many ML-based analytics tools in the infrastructure space have the following problems:
Accuracy: Many tools have a “the boy who cried wolf” problem, meaning they raise too many false positives. This is largely because they rely of large volumes of logs, traces, crash dumps, etc. These data formats, while very valuable for troubleshooting, are very noisy. They contain too many data points, many of them irrelevant to the problem at hand — few needles, lots of hay.
Time to value: This follows from the previous point. Analyzing logs and traces is can be time-consuming, both for training models and running them on live data. Most analytics vendors take weeks to build a baseline, and even those baselines are often inaccurate leading to false positives.
Cost: The two preceding points leads to an inevitable outcome: cost. Ingesting large logs and traces is expensive. Many vendors charge a handsome premium for ingesting and storing customer data. (Yes, I see those heads nodding in agreement.)
The Best (and Most Expensive) Option: Custom-built Data Pipelines
For infrastructure vendors entering the machine learning-based analytics space, the best approach is to build custom data pipelines for their ML stacks. This gives them complete control over the data payload and hence quick time to value, accuracy, and low operating costs. Building a custom data pipeline is not simple or cheap. Products need to be re-instrumented in secure manner and telemetry platforms need to be built or re-designed. This is what we did a couple of years ago. Citrix Analytics is a labor of love and a lot more. We spent a lot of time redefining our payloads and rebuilding our data platform from scratch. The typical Citrix Analytics payload is a “tiny”, 5-10 KB Azure event encoded in JSON. Unlike large log files, this payload is not just lighter; it is also more efficient to process. Every event contains at least 5 well-defined data points that the ML algorithms can process quickly. The deterministic nature of the payload eliminates the needle in the haystack problem. This also eliminates the need for compute-intensive and sophisticated ML algorithms.
The “tiny” payload is cost-effective for our customers as well. On an average, we ingest a few hundred MBs per customer per day compared to several GBs by most other analytics vendors. As a result, customer accrue huge savings on their internet bills.
The bottom line here is: don’t let fancy algorithm names fool you. If you hear the words “log analysis” in a sales pitch, stop and do the math. If your analytics vendor is simply ingesting logs and applying fancy ML, chances are you will run out of your storage quota long before you see any value.
We welcome you to give Citrix Analytics a try — and give us feedback (we want to know what you think!) We can assure you that we do a few things, and we do them really well, and in a secure and cost-effective way.