Big Data Technologies Ecosytem

IN TERMS of ‘forces’ affecting the CIO Agenda, Big Data Technologies are vitally important. This is due to explosive growth in number of data source types: applications, digital media, mobiles, users, customers, unstructured data sets, sensors, emails, blogs etc.

Big Data: Using Smart Big Data, Analytics and Metrics to Make Better Decisions and Improve Performance

Data is complex and in mixed formats (text, video, audio), on-demand infrastructure scalability (including massively scalable storage) is needed to deliver Big Data capabilities, as are robust analytics and visualisation tools and techniques for distributed, parallel systems. Increasing bandwidth availability has also led to exponential data growth rates and capabilities e.g. social networks, video and microblogging.

 

Big Data Ecosystem

Figure 1: A (simplified) Big Data Technologies Ecosystem by Steve Nimmons

Where do you start in formulating a reference architecture for Big Data and sourcing suppliers for a Big Data ecosystem?

Should you believe the Big Data Hype?

MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems

The Gartner Hype Cycle places Big Data on ‘the upslope’ towards the ‘peak of inflated expectations’. Big Data is of course already underpinning many of the web giant’s architectures (typically because necessity has been the mother of invention).

Gartner Hype Cycle

Figure 2: Gartner Hype Cycle for Emerging Tech (2011), Source: Gartner

  • Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. There are two clusters, a 1100-machine cluster with 8800 cores and about 12 PB raw storage and a a 300-machine cluster with 2400 cores and about 3 PB raw storage.
  • Yahoo! deploys more than 100,000 CPUs in > 40,000 computers running Hadoop. The biggest cluster has 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to support research for Ad Systems and Web Search and to do scaling tests to support development of Hadoop on larger clusters
  • eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB), Java MapReduce, Pig, Hive and HBase
  • Twitter uses Hadoop to store and process tweets, log files, and other data generated across Twitter. They use Cloudera’s CDH2 distribution of Hadoop. They use both Scala and Java to access Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and Cassandra.

Other Hadoop users include:  1&1, A9.com, About.com, Amazon.com, American Airlines, AOL, Apple, Booz Allen Hamilton, Cerner, ChaCha, comScore, EHarmony, Federal Reserve Board of Governors, foursquare, Fox Interactive Media, Freebase, Hewlett-Packard, IBM, InMobi, ImageShack, ISI, Joost, Last.fm, LinkedIn, Microsoft, Meebo, Mendeley, Metaweb, Netflix, The New York Times, Ning, Outbrain, Playdom (now part of Disney Interactive Media Group), Powerset (now part of Microsoft), Rackspace, Razorfish, StumbleUpon and Twitter.

Hadoop: The Definitive Guide

Hadoop Overview

 

Big Data tools: Hadoop Ecosystem

Figure 3: Apache Hadoop Ecosystem

The Apache Hadoop Ecosystem is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop has Commons, MapReduce and Distributed File System capabilities (HDFS) as well as sub-projects: HBase, Cassandra, Avro, Hive, Mahout, Pig, ZooKeeper and Chukwa.

Given the pervasive nature of Hadoop, this is a strong contender for any Big Data implementation. HBase is the Hadoop database. Cassandra is also a NoSQL database. Mahout is a data mining and machine learning component, Hive and Pig are querying components, Zookeeper a coordination component.

Apache Hadoop Distributions, such as that from Cloudera, bundle Apache Hadoop with other Open Source tools to create a more feature rich ‘platform’. The Cloudera distribution is definitely one to evaluate.

A simple Big Data Reference Model

In terms of implementing ‘Big Data’ architectures there are a number of choices, particularly in the visualisation and analytics space (refer to Figure 1). A (highly) simplified reference model is provided in Table 1.

Function

Candidate Options

Storage NoSQL Databases – e.g. Cassandra, HBase, Voldemort, Membase
Processing MapReduce
Query Hive, Pig (assuming Hadoop is being used)
Analytics & Visualisation Refer Figure 1 (and Mahout for Data Mining)

Table 1: Simplified Big Data Reference Model.

Data Loaders (e.g. Sqoop) and log management (e.g. Flume, Scribe) could also be included in the reference model / ecosystem.

Interesting Tools

Apache Flume: Distributed Log Collection for Hadoop

Further Reading