IN TERMS of ‘forces’ affecting the CIO Agenda, Big Data Technologies are vitally important. This is due to explosive growth in number of data source types: applications, digital media, mobiles, users, customers, unstructured data sets, sensors, emails, blogs etc.
Data is complex and in mixed formats (text, video, audio), on-demand infrastructure scalability (including massively scalable storage) is needed to deliver Big Data capabilities, as are robust analytics and visualisation tools and techniques for distributed, parallel systems. Increasing bandwidth availability has also led to exponential data growth rates and capabilities e.g. social networks, video and microblogging.
Where do you start in formulating a reference architecture for Big Data and sourcing suppliers for a Big Data ecosystem?
Should you believe the Big Data Hype?
The Gartner Hype Cycle places Big Data on ‘the upslope’ towards the ‘peak of inflated expectations’. Big Data is of course already underpinning many of the web giant’s architectures (typically because necessity has been the mother of invention).
Figure 2: Gartner Hype Cycle for Emerging Tech (2011), Source: Gartner
- Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. There are two clusters, a 1100-machine cluster with 8800 cores and about 12 PB raw storage and a a 300-machine cluster with 2400 cores and about 3 PB raw storage.
- Yahoo! deploys more than 100,000 CPUs in > 40,000 computers running Hadoop. The biggest cluster has 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to support research for Ad Systems and Web Search and to do scaling tests to support development of Hadoop on larger clusters
- eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB), Java MapReduce, Pig, Hive and HBase
- Twitter uses Hadoop to store and process tweets, log files, and other data generated across Twitter. They use Cloudera’s CDH2 distribution of Hadoop. They use both Scala and Java to access Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and Cassandra.
Other Hadoop users include: 1&1, A9.com, About.com, Amazon.com, American Airlines, AOL, Apple, Booz Allen Hamilton, Cerner, ChaCha, comScore, EHarmony, Federal Reserve Board of Governors, foursquare, Fox Interactive Media, Freebase, Hewlett-Packard, IBM, InMobi, ImageShack, ISI, Joost, Last.fm, LinkedIn, Microsoft, Meebo, Mendeley, Metaweb, Netflix, The New York Times, Ning, Outbrain, Playdom (now part of Disney Interactive Media Group), Powerset (now part of Microsoft), Rackspace, Razorfish, StumbleUpon and Twitter.
Figure 3: Apache Hadoop Ecosystem
The Apache Hadoop Ecosystem is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Apache Hadoop has Commons, MapReduce and Distributed File System capabilities (HDFS) as well as sub-projects: HBase, Cassandra, Avro, Hive, Mahout, Pig, ZooKeeper and Chukwa.
Given the pervasive nature of Hadoop, this is a strong contender for any Big Data implementation. HBase is the Hadoop database. Cassandra is also a NoSQL database. Mahout is a data mining and machine learning component, Hive and Pig are querying components, Zookeeper a coordination component.
Apache Hadoop Distributions, such as that from Cloudera, bundle Apache Hadoop with other Open Source tools to create a more feature rich ‘platform’. The Cloudera distribution is definitely one to evaluate.
A simple Big Data Reference Model
In terms of implementing ‘Big Data’ architectures there are a number of choices, particularly in the visualisation and analytics space (refer to Figure 1). A (highly) simplified reference model is provided in Table 1.
|Storage||NoSQL Databases – e.g. Cassandra, HBase, Voldemort, Membase|
|Query||Hive, Pig (assuming Hadoop is being used)|
|Analytics & Visualisation||Refer Figure 1 (and Mahout for Data Mining)|
Table 1: Simplified Big Data Reference Model.
Data Loaders (e.g. Sqoop) and log management (e.g. Flume, Scribe) could also be included in the reference model / ecosystem.
- Processing: http://processing.org. advanced visualizations
- Protovis: http://vis.stanford.edu/protovis/
- Gephi: http://gephi.org/ focused on Social Network Analysis
- Tableau: http://www.tableausoftware.com/public/
- ManyEyes: http://www-958.ibm.com/software/data/cognos/manyeyes/, from IBM.
- Big Data: A Revolution That Will Transform How We Live, Work and Think
- Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die
- Data Smart: Using Data Science to Transform Information into Insight
- Predictive Analytics, Data Mining and Big Data: Myths, Misconceptions and Methods (Business in the Digital Economy)
- Doing Data Science: Straight Talk from the Frontline