In the hype surrounding “Big Data” and data analytics, Hadoop gets much attention as it has fast become the technology of choice. However at times little space is given to describing the product itself or analysing the impact it has on the Big Data market and the realities behind the hype.
Hadoop’s initial purpose was parallel processing, however it has developed into a series of projects that do more than just manage large datasets. With these projects, Hadoop can now be used to write complex database queries for real-time BI applications or to write data analysis scripts in a variety of languages.
Examples of organisations that have benefitted from Hadoop include BT (formerly known as British Telecom), Blackberry and The National Cancer Institute in Maryland, US. BT has used Hadoop within its data architecture to reduce costs and for data mining new insights about customers, while Blackberry similarly uses Cloudera’s Hadoop distribution, CDH, within their data architecture. The National Cancer Institute’s Frederick National Laboratory also uses CDH to conduct biostatistical research on relationships between genes and cancers and how this impacts drug trials.
Nowadays when people refer to Hadoop often they mean the Hadoop Ecosystem, i.e. the term given to the network of various projects involved with Hadoop. This includes the standard distribution, commercial distributions, additional projects such as Hive and HBase and alternatives to Hadoop.
Where It Started – Hadoop Core
Hadoop Core consists of HDFS and MapReduce. Hadoop Core is found in the Apache Standard Distribution and all commercial distributions of Hadoop.
HDFS comprises the file system under which Hadoop manages Big Data into clusters and administers user specified programs. HDFS divides files into blocks of data, typically 64MB in size, then stores them across a cluster of data nodes, which can be set up across multiple machines. Increasing data capacity is generally cheaper with Hadoop than with traditional databases, as new Hadoop clusters can be installed on ordinary x86 machines instead of dedicated RMDBS servers.
MapReduce is Hadoop’s main data processing framework, written in Java. In its most basic form, MapReduce consists of a Mapper and Reducer program. For instance if an online bookstore wants to quickly compile a list of books most poorly reviewed by users, but excluding books not reviewed often, first MapReduce divides up the data. Next a Mapper process is applied to filter out meaningful data (ie bad reviews) while the Reducer process makes sense of it (ie. counts and sorts books by number of bad reviews). The basic process is outlined below:
Beyond MapReduce – Related Products
Initially Hadoop only consisted of MapReduce and HDFS, with MapReduce programs written in Java. Given the complexity of MapReduce and difficulty in finding Java programmers (particularly amongst data scientists), Hadoop has evolved to include specialist projects. For instance, Hadoop Streaming enables users to implement MapReduce in languages other than Java, Hadoop Mahout provides a machine learning library and Hadoop YARN offers an alternative processing environment to MapReduce.
Hadoop Streaming allows MapReduce programs to be written in scripting language. This is particularly useful for data scientists that are more comfortable with data friendly scripting languages like R or Python, but not proficient in Java.
Hadoop Hive allows SQL-like queries to be run using MapReduce, via a language known as HQL (Hive Query Language), which works on structured databases. HQL is constantly being updated and the latest language manual is available on the Apache website. Hadoop Pig is an alternative data warehousing solution to Hive which uses Pig Latin, a Hadoop based language developed by Yahoo. Unlike Hive, Pig can handle unstructured data, and is designed as a procedural language as opposed to being query based.
Hadoop HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop in addition to updates, inserts and deletes. Tables in HBase can serve as the input and output for MapReduce jobs and HBase is particularly useful for realtime read/write access to Big Data, one reason why EBay and Facebook use HBase heavily.
Data Loading & Serialisation
Hadoop Avro is a framework which uses JSON for data serialisation and procedure calls. This has two uses in Hadoop – data input/output formatting and communication between nodes in Hadoop clusters or between client programs and Hadoop services. Using Avro, Big Data can be exchanged between programs written in any language.
Hadoop Oozie is a workflow scheduler system that allows users to define a series of jobs and link them together. These jobs can be written in multiple languages such as Map Reduce, Pig and Hive and using Oozie users can specify conditions relating to the order and timing of tasks e.g. a job might only be initiated after any jobs it relies on for data are completed.
Commercial Distributions and MapReduce Alternatives
A standard open source Hadoop distribution (Apache Hadoop) includes the Hadoop MapReduce framework, HDFS and Hadoop Common, a set of libraries and utilities used by other Hadoop modules. Vendor distributions tend to add value to customers through prompt fixes/patches, technical assistance and additional specialist tools. Major commercial distributors include Cloudera, Hortonworks and MapR.
Amazon’s AWS Elastic MapReduce (EMR) was one of the first commercial Hadoop offerings on the market. EMR is Hadoop in the cloud leveraging Amazon’s EC2 computing capabilities and Amazon S3’s storage abilities. In addition to being used by commercial businesses, EMR is an ideal tool for learning Hadoop programming given that issues such as product updates are taken care of by Amazon at cloud level and that the EMR contains copies of most Hadoop projects.
Storm was developed by Nathan Marz at Twitter and specifically operates on streaming data, particularly useful for Twitter which generates approximately 500 million tweets per day as of December 2014. Like with Hadoop, applications in Storm can be written in any language. Storm includes the ability to pass messages between tasks, which is not a feature seen in Hadoop. Also Storm is fault tolerant. This is due to Storm’s process management system, which IBM and Twitter consider to be more intelligent than Hadoop’s.
S4 was initially released by Yahoo! Inc. in October 2010 and is an Apache Incubator project since September 2011, i.e. it has not yet been fully endorsed by Apache. However, S4 has been deployed in production systems at Yahoo! to process thousands of search queries per second. A key design feature is that nodes are symmetric with no centralised service and no single point of failure, resulting in simpler deployments and cluster configuration changes than in standard Hadoop. It is also scalable and extensible (applications can easily be written and deployed using a simple API) and fault tolerant.
Apache and the Hadoop Market
Although Hadoop appears to be the product of choice within Big Data, the market is an immature one. The following shows an infographic by Forrester Research on the state of the Hadoop market as of February 2014.
As this shows there is a huge variety of quality products and competition, with Hadoop being developed significantly. A simple examination of the key Hadoop distributor, Apache Software Foundation, is quite revealing.
Apache Software Foundation is an American not for profit organization that supports the Apache project. Major sponsors include Microsoft, IBM, Facebook, Google, Yahoo and Intel in addition to Hadoop software vendors Cloudera and HortonWorks. The Apache Software Foundation consists of a decentralised community of developers that develops free open source software distributed under the under the terms of the Apache License. Board members are elected annually and mainly consist of software developers at key technology firms, including Doug Cutting the creator of Hadoop who is currently employed by Cloudera.
The level of interests from such a broad variety of top-tier technology companies is comforting to a degree as it suggests it isn’t in any company’s interests to develop a major rival product or for the project to fork i.e. for various companies to create very distinct versions of Hadoop. Indeed IBM have made it clear they do not want Hadoop to fork. But that’s under current market conditions and there is nothing to rule out any firm enacting a major change in investment strategy in the years to come.
Hadoop, Big Data and the Future?
Hadoop’s long-term success will depend upon how the ecosystem develops as many organisations have Big Data solutions that are only partially implemented in Hadoop, while operations such as real-time analytics are implemented elsewhere (e.g. using more traditional RMDBS databases).
Big Data and Hadoop can cause unnecessary operating costs if indiscriminately implemented without assessing their need and, while enormous amounts of new knowledge can be extracted from unstructured data, it is not necessarily all useful and can result in diminishing marginal returns in terms of revenues from knowledge extracted. Indeed in many cases machine learning might not be necessary and alternative ways to increase revenues may exist e.g. randomly recommending can produce about the same impact as a recommender if the number of products is small.
For many firms Big Data is a new idea and Hadoop is a new technology, so there remains uncertainty over what constitutes a standard Big Data business solution. The main questions that arise from this are what percentage of the market will use Big Data once the market matures and when will growth in data usage plateau? Such questions are difficult to answer mainly as the need for Big Data isn’t necessarily restricted to large firms. The other danger is that a much more efficient and streamlined alternative to Hadoop is developed. As discussed above, although major technology firms don’t appear to want forking to occur with Hadoop, this stance can always change over long timeframes.
Given these risks, from an employment perspective anyone that uses Hadoop would be best to develop as many portable skills as possible. For instance, a data scientist that becomes an expert in applying machine learning and statistical techniques in industry is much more useful in other sectors than one that becomes an expert in Hadoop-specific programming techniques. Similarly, giving IT support using a diverse amount of technologies is a vastly superior skill set to becoming an expert only on Hadoop cluster management and troubleshooting problems with Hadoop nodes and memory failures.
Hadoop’s evolution from a Java specialised software to an enterprise software appears to be quite succesful. While it is difficult to predict how Big Data will develop or how many businesses will benefit, Hadoop’s success and the strong support for the Apache Project from a number of leading technology firms means that Hadoop will in the short-term continue to develop in its role as the software of choice for Big Data solutions.
- ^ Cloudera Customers
- ^ Hive Language Manual
- ^ Hadoop – The Power of the Elephant
- ^ HBase at Facebook
- ^ Process real-time big data with Twitter Storm
- ^ MapR’s Hadoop Offering the Strongest, Forrester Says, 28 February, 2014
- ^ Flashbook: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data
- ^ Small Businesses Need Big Data, Too
Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.