Big Data… making sense of the information explosion

Industry analysts routinely tout big data as one of hottest technologies enterprises need to leverage to be competitive in a world awash in information.

By Madhav Chinta and Bruce Franson

The amount of data in the world has been expanding rapidly and is expected to continue to grow exponentially for the foreseeable future. Along with cloud and mobility, big data is now one of the buzzwords permeating the IT industry air waves in recent years.  Industry analysts like IDC, Gartner, and Forrester routinely tout big data as one of hottest technology areas that enterprises need to understand and leverage to be competitive and successful in a world awash in information.  In this article we explore the world of big data, including the business value, market landscape, technology associated with big data, and what it all means for Citrix.

So, just what is Big Data?

Big Data is often defined as the real-time collection, analysis, and visualization of vast amounts of the information being generated in today’s digital world. One may ask what is different about the traditional concept of Business Intelligence and Data Mining that have been around for years and what we now call Big Data. It’s really about the sheer volume of data that companies now have access to, the increasing speed at which that data are accumulating, and the huge variety of data types that are being generated. In addition, the traditional technologies and tools from years back were simply not able to handle this new data reality.

Coupled with the explosive increase of information, the convergence of vastly cheaper data storage arrays and the continually increasing compute power that can drive sophisticated analytical software, a new era of real-time, predictive data analysis and data visualization is emerging. Data volume, velocity, and variety are the key characteristics that industry experts often use to describe big data:

  • Volume  IDC claims that more data is now traversing the internet every second than were stored in the entire internet 20 years ago.
  • Velocity – The speed at which data is created is often more significant than the amount of data and extracting value from real-time or near real-time data can provide huge business value to corporations.
  • Variety  As an example, the various types of unstructured data from social network users like photos, videos, tweets, status updates, and location information are relatively new and increasing at a staggering pace.

In the IDC Predictions 2012: Competing for 2020 report, Frank Gens, IDC Chief Analyst comments:

"Big Data will join mobile and cloud as the next "must have" competency as the volume of digital content grows to 2.7ZB (1ZB = 1 billion terabytes) in 2012, up 48% from 2011, rocketing toward 8ZB by 2015. There will be lots of Big Data-driven mergers and acquisitions (M&A) activity."

IDC suggests that going forward, over 90% of this data will be largely unstructured data… millions of connected people, billions of sensors, and trillions of transactions are now creating unprecedented amounts of data. While this often unstructured data is full of valuable business information, it can be highly challenging to find, understand, and analyze. This is where the technologies, advanced analytic tools, and services around Big Data can turn this information overload into competitive advantage for companies and highly valuable intelligence for business, science, medical research, and meteorology.

Big Data Market Landscape

Big Data drives rapid infrastructure changes & $232B in IT spending thru 2016 – Gartner

At the Gartner Symposium / ITxpo 2012 conference this past October, Gartner projected that Big Data will drive $28 billion of worldwide IT spending in 2012, growing to $34 billion of IT spending in 2013. They further commented that most of the current spending is around adapting traditional solutions to the big data demands (rapidly growing volume, unpredictable velocity, and wide variety) rather than net new software sales. Gartner Research VP Mark Beyer further comments:

"Despite the hype, big data is not a distinct, stand-alone market, but represents an industry-wide market force that must be addressed in products, practices and solution delivery. In 2011, big data formed a new driver in almost every category of IT spending. Because big data's effects are pervasive, they will evolve to become a standardized requirement in leading information architectural practices, forcing older practices and technology into early obsolescence. As a result, big data will once again become 'just data' by 2020 and architectural approaches, infrastructure and hardware/software that does not adapt to this 'new normal' will be retired. Organizations resisting this change will suffer severe economic impacts."

Who are the players?

As Gartner and others have noted, Big Data is not a monolithic market space, but rather a dynamic mash-up of related disciplines and technologies that include business intelligence, data mining, data analytics, data visualization, data storage, operational infrastructure, and structured databases.

IDC predicts that the market for Big Data technology and services will reach $16.9 billion by 2015, up from $3.2 billion in 2010. This represents a 40% a year growth rate, which is around seven times the overall information technology and communications business estimated growth rate.

What value can Big Data actually deliver?

In its extensive report, Big data: The Next Frontier for Innovation, Competition, and Productivity, the McKinsey Global Institute projects that organizations who master leveraging big data will generate significant financial and competitive value across a variety of industry and public sectors.

Big Data… some real world use cases

The National Oceanic and Atmospheric Administration (NOAA) National Weather Service has been gathering and using big data for years and manages 30 petabytes of new data each year. It collects more than 3.5 billion observations per day from ships, satellites, aircraft, buoys and other sensors. NOAA uses these measurements of atmospheric, oceanographic and terrestrial data in conjunction with complex, predictive modeling techniques to generate weather-related guidance and warnings, which are used each day by private and public sector forecasters, including government agencies like the US Department of Defense and NASA.

During the recent US Presidential election, one “geek” has been widely recognized for his use of “big data” and advanced analytics to correctly predict the 2012 election outcome for all 50 states in his FiveThirtyEight blog. Nat Silver, a New York Times columnist and formerly a successful baseball statistician, also accurately predicted the outcome of the 2008 presidential election. He writes about his predictive analytic models and use of big data in his recent book: The Signal and the Noise: Why So Many Predictions Fail-but Some Don't

There is an ever growing number of medical, scientific, and business use cases emerging around big data… these include fine-tuning retail prices and inventories down to the store and customer level, optimizing the timing and routes for product shipments, predicting crime outbreaks for local police forces, analyzing product support metrics, and predicting storm damage for insurance companies.

Following are some additional examples of use cases that rely on big data concepts and technologies:

  • Financial institutions are constructing more elaborate fraud prevention models. When you receive an alert from a credit/debit card company about “unusual activity” on their cards, the call is arriving because these firms are churning through billions of transactions looking for user anomalies that are potentially associated with fraud or identity theft.
  • Pharmaceuticals are tracking prescription drug side effects… when data reveals severe safety concerns like Merck’s Vioxx recently did, the drug is withdrawn from the market.
    Genetic scientists are cleansing and processing vast amounts of genomics data to understand human genome to better detect, prevent, and cure life threatening disease.
  • Electronics manufacturers can store and analyze usage data from the devices they ship to better understand product issues occurring in the field.
  • Every time Google suggests a spelling change in a search query, it’s because millions of previous queries on the same subject used different spellings that were found more useful.
  • Netflix takes all of its customers’ viewing habits and movie ratings and runs them through a sophisticated algorithm to generate the 5-star recommendation system tailored for each subscriber.
  • Retailers are building learning models that predict individual customers’ buying preferences and make recommendations… Amazon.com does this all the time related to the items you view and your purchase history.

Data rich companies like Amazon, Google, and Facebook long have benefited from leveraging big data. As these lessons are learned, more and more companies will see productivity and competitive advantages by investing in big data analytics capabilities and easy-to-use decision support tools.

Big Data Technology Overview

There are a growing number of Big Data technologies used to aggregate, manipulate, manage, and analyze big data. Due to the complexity and the variety of Big Data, there is no “one-size-fits-all” solution. The choice of the technology depends a lot on the nature of the data that needs to be stored and the kind of analysis that needs to be performed.

Apache Hadoop Ecosystem

Apache Hadoop has been the driving force behind the growth of the big data industry. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers. Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It is highly fault-tolerant and is designed with an assumption that hardware failure is a norm rather than the exception. It stores each file as a sequence of blocks distributed in different nodes of the cluster and the blocks are replicated for fault-tolerance. It is tuned to support very large files and for high throughput streaming access to such files.

Map Reduce is a programming model for processing large data sets introduced by Google. In Hadoop, MapReduce is a software framework that supports this programming model and makes it easy to develop applications that process vast amounts of data in-parallel on large clusters in a fault-tolerant manner. The computation on the large data set is split into independent chunks and distributed across the node in cluster and the sub-results are aggregated to come up with the final result. If any node is unable to perform its task due to any failure, the framework restarts the task on some other node. This paradigm is being used by a number of companies to achieve massively parallel computations on their data sets.

Apache Pig is a popular platform that makes it easy to develop applications that analyze large data sets. It consists of a high-level language known as Pig Latin for expressing data analysis and an infrastructure for evaluating these programs. Pig’s infrastructure consists of a compiler that converts a Pig Latin program into a sequence of Map-Reduce programs. It also performs optimizations under the covers allowing developers to focus on semantics rather than on the intricacies of Map Reduce for efficiency.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. The HiveQL language makes it easy for people with SQL background to run queries and be able to extract patterns and meaningful information from huge volumes of data.

The Apache Mahout project provides free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform. Mahout is a work in progress and the number of supported algorithms is growing over time. Currently, Mahout mainly supports four use cases: (a) Recommendation mining takes user’s behavior and from that tries to find items users might like. (b) Clustering takes items (e.g., text documents) and arranges them into groups related by a topic. (c) Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category. (d) Frequent item-set mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

NoSQL Databases

NoSQL (“Not Only SQL”) databases are new class of database management systems that rose in prominence in the recent times to address the short comings of traditional relational database systems for storing and processing Big Data. Some of the early NoSQL databases were developed by internet companies like Google, Amazon, and Facebook, which had challenges in dealing with massive amounts of data using traditional RDBMS solutions.

Unlike traditional relational databases, NoSQL databases typically don’t give ACID (Atomicity, Consistency, Isolation, Durability) guarantees, and provide only eventual consistency. Other common characteristics of NoSQL databases include being distributed in nature, fault-tolerant and great support for scale-out. NoSQL databases typically suit well for storing and managing large amounts of data when performance is more important than consistency.

Some of the popular NoSQL databases are MongoDB, Apache CounchDB, Apache Cassandra, Apache HBase, Redis, etc.

NewSQL Databases

NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems while still maintaining the ACID guarantees of a traditional single-node database system. NewSQL databases can be loosely categorized into three categories: (a) New Architectures – These are completely brand new database platforms. These are designed to operate in a distributed cluster of shared-nothing nodes. E.g., VoltDB, NuoDB, Clustrix, and VMWare’s SQLFire. (b) MySQL Engines – These are highly optimized storage engines for MySQL. These provide the same interface as MySQL, but scale better. For example, TokuDB, Akiban. (c) Transparent Sharding – These systems provide a sharding middleware layer to automatically split databases across multiple nodes. For example, ScaleDB, ScaleBase.

Cloud Services for Big Data

For someone looking for quickly processing massive amounts of data, but don’t want to worry about the cost or complexity of managing large clusters of machines and technologies like Hadoop, there are several cloud services available, which include the following.

Google Big Query

Google BigQuery is a cloud web service that supports interactive analysis of massive data sets, up to billions of rows. It works in conjunction with Google Storage. BigQuery is based on Google’s Dremel technology and supposedly can analyze billions of rows in seconds. With this service, one can upload massive datasets and perform interactive queries using a SQL-like language.

Amazon Elastic MapReduce

Amazon Elastic MapReduce (EMR) is a PaaS offering Hadoop as a managed service. The managed Hadoop framework runs on Amazon EC2 and the data is stored in Amazon S3. With the Amazon EMR service, one can focus more on utilizing Hadoop to perform data analysis and not worry about setting up, managing, or tuning the Hadoop environment.

Windows Azure HDInsight

HDInsight Service is a Windows Azure service offering Hadoop as a Service. HDInsight Service also supports integration with Microsoft Office Excel 2013 and other Microsoft Business Intelligence tools.

Beyond Hadoop

Hadoop is often regarded as the de facto standard for Big Data. Although is well suited for many kinds of analytical applications, it falls short in some areas. While many large organizations are turning to Hadoop, Google (the creator of technologies that inspired Hadoop), and others have already moved on to newer technologies. Following is a brief description of some of the areas where newer big data technologies are emerging.

Real-time query on Hadoop

Hadoop was designed primarily for batch processing and not really suited for real-time, ad-hoc queries. Google built a system called Dremel that supports interactive analysis of very large datasets over a shared cluster of commodity machines. According to Google’s paper on Dremel, one can run queries on multiple petabytes of data in a matter of seconds. It can work in complementary to Hadoop.

Google Dremel has inspired the development of open-source technologies like Cloudera Impala and Apache Drill which also support ad-hoc, interactive real-time queries on large data sets.

Incremental Processing

Analyzing data using Hadoop typically requires streaming over the entire data set. This is very inefficient and poses a challenge if the dataset gets constantly updated. Google’s Percolator system supports large-scale incremental processing and is used by Google for their revamped search indexing system. Some open-source technologies that are considered as comparable alternatives to Google Percolator are HBase Coprocessors and HAcid.

Big Data “In Motion”

Efficiently processing extremely large streams of continuous data and providing insights in real-time is critical for many businesses. Storm is an increasingly popular open-source technology in this space. It is a distributed real-time computation system that makes it easy to reliably process unbounded streams of data. Storm is scalable, fault-tolerant and guarantees the processing of the data. Some use cases of Storm include realtime analytics, online machine learning, continuous computation, etc. Twitter acquired the company BackType, which built Storm originally. Other comparable technology in this space is Apache S4, originally developed by Yahoo.

Graph Processing

Graphs are everywhere. The Internet can be considered as a big graph where web pages are vertices and links from one page to another are edges. In the context Social Networking companies like Facebook, Twitter, LinkedIn, etc., social graphs are of great importance, where people and their relationships are typically represented as graphs. Graphs are also commonly used in recommendation engines. The sizes of these graphs have been growing rapidly in the recent times and efficient processing of these graphs is essential for providing relevant and personalized information for users.

Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, etc.) are becoming quite popular. Google Pregel was one of the first and most popular graph processing systems. It follows a bulk synchronous processing model of computation for processing petabyte scale graphs on distributed commodity machines. However Google Pregel is not open-source and is not available for use by other companies. Some other open-source technologies inspired by Google Pregel are Apache Giraph, Apache Hama, and GoldenOrb. Apache Giraph leverages existing Hadoop infrastructure for graph processing.

Citrix and Big Data

There are applications for Big Data in pretty much every industry sector and government organization where a mountain of different kinds of data is being accumulated. The benefits of Big Data are not only limited to Internet or Social Networking companies like Google, Amazon, Facebook or Twitter… numerous companies and organizations are seeing benefits.

At Citrix, we have several products, services, and internal business functions that deal with vast amounts of data. Today, we store and analyze structured data in traditional relational databases. However, lot of valuable insights that could possibly be extracted from some of our semi-structured or unstructured data assets…these include log files, network trace files, customer case notes, customer forums, customer communications, user preferences, user actions, user profile information, etc.

Using Big Data techniques and technologies to derive insights from these various kinds of data assets, we could better understand our customers and take data-driven decisions and actions quicker to improve our business and evolve our products and services.

While the following list is not by any means exhaustive, some areas where we and/or our customers could benefit from Big Data include:

  • Improving customer support
  • Better understanding customer:
    • Pain points
    • Usage patterns
    • Unmet product requirements
    • Preferences, etc.
  • Building a complete customer profile
  • Common trends and patterns across multiple customers
  • Helping customers optimize Citrix product deployments in their environments
  • Better personalized Recommendations to our customers
  • Customer churn analysis
  • Marketing
  • Optimizing Data center Operations
  • Improving Social and Collaboration products

On a day-to-day basis, we work in different functional areas and come across many different data assets. Big Data Analysis could be used to enhance existing products, services, improving our internal functions, and also for building new features and give us a competitive advantage.

Answering a few questions and putting them in the context of Big Data techniques and technologies can help us and our customers a lot across different functional areas:

  • What all do we want to know about the customer(s)?
  • What all do we want to know about our product/service usage?
  • What data assets do we have (structured, semi-structured and unstructured)?
  • How are we leveraging the data assets today?
  • Could there be any possible trends and patterns in these data assets? Trend / pattern analysis using Big Data help could help!
  • Could predicting the future help? Try Prediction Modeling using Big Data!
  • What risk do we deal with? Risk analysis using Big Data could help mitigate the risk!
  • Does the data asset have customer sentiment in it? Does better understanding of customer sentiment help? Customer Sentiment Analysis using Big Data could help!
  • Are the data assets searchable? A more personalized and relevant search experience could be provided using Big Data
  • Is it possible to provide recommendations to customers? Recommendations engines built using Big Data could help!
  • Are there possible relationships amongst seemingly unrelated things in our Data Assets? Big Data Graph Analysis could help!
  • What other data do we need to capture?

If you have any thoughts or ideas on how to make data work for our benefit and would like to discuss further, the authors of this article, Madhav Chinta and Bruce Franson would love to hear from you… please write us an email!