Hadoop, the technology behind most of the big data hype, is driving change in many areas of business. The impact is only just starting to be felt by business intelligence and data warehouse professionals. It is short sighted to think of Hadoop as just another data source that will feed into the enterprise data warehouse as it is fundamentally challenging the way that organizations think about data management and business analytics. Here are three examples:
- Ubiquitous data capture – Hadoop, in combination with advances in hardware, allows the capture, storage and retrieval of data at an unprecedented scale. The digitization of many business processes leaves a digital exhaust trail that can be stored without any immediate benefit because the cost of doing so is so low compared to the potential future benefit of doing so.
- The rise of data science – New software, frameworks and methodologies are quickly evolving for experimentation with both structured and unstructured data. The scale and variety of data available with Hadoop makes this a much more worthwhile activity for more businesses and organizations.
- The democratization of data – A wider variety of professionals and entrepreneurs are taking an interest in data, knowing that big data analytics can give their business a competitive edge. So it is no longer just statisticians and quantitative analysts, but regular business staffers and managers demanding access to the data that their digital business processes and customers are generating.
Underlying these sweeping changes are some aspects of Hadoop which make it a very different technology than the relational databases that are the current backbone of enterprise data architecture. First off, Hadoop is not a single technology but an ecosystem of interconnected open source software components. The Apache software foundation is supporting Hadoop and two young software companies, Cloudera and Hortonworks, have taken the lead with commercial versions. The big enterprise software vendors, notably IBM and EMC, have Hadoop based offerings or, as in the case of Oracle and Microsoft, have partnered up to complement their traditional relational databases. What is interesting is that as of yet no single vendor has been able to “own” the big data / Hadoop space, in the the way that Red Hat captured commercial Linux or VMWare (now a subsidiary of EMC) came to dominate virtualization in enterprise data centers.
The fluid nature of Hadoop is also its strength, as it has allowed key components, like Spark and Kafka, to evolve quickly. Even the more stable parts (HDFS, YARN and MapReduce) are constantly being tweaked. With solid sponsorship, lead by the Apache, and a wide variety of contributors and users, Hadoop will be a fast moving software for at least a few more years. All of this innovation provides an opportunity to modernize the data warehouse in several ways:
- Hadoop can be an effective landing zone for data, sometimes known as a bit bucket or data lake. Think of this like a pre-staging area for the data warehouse.
- It can serve as an archive for historical data that is queried rarely or is not immediately useful.
- A data staging area for batch processing that is faster than in a relational database.
- A platform for combining structured and cleansed data from the data warehouse with voluminous raw and unstructured data for experimental data science, data mining and advanced analytics.
- Opening up large data sets to a broader audience of users, both internal and external (customers, suppliers and partners.)
Taking advantage of these opportunities is not without challenges and the BI/DW team has a key role to play in overcoming the technical and organizational roadblocks. The data warehouse has structured and cleansed data that will find a new purpose when combined with the messy, unstructured data collected and dumped into Hadoop. Likewise, the varied and voluminous data provided by Hadoop will breath new life into the mature BI reporting and dashboarding tools.
It is crucial to get ahead of the conformance and interoperability challenges. The BI/DW team is in the best position for defining a strategy and the technical architecture that will integrate the data warehouse with big data. Hadoop is mature enough now to develop proof of concept projects that can quickly deliver the business value of big data. It is also the perfect opportunity to work with the business and become their best ally in leveraging the power of Hadoop. The alternative is a data warehouse headed for obsolescence.
photo credit: flickr.com
David Currie has been helping businesses get the most out Cognos Business Intelligence software since 1999, first as a Cognos employee and since 2008 as an independent consultant. He develops the solution architecture to satisfy complex business reporting and analytics requirements, sourcing data from operational databases, data warehouses and now big data repositories. He blogs about business intelligence and big data at davidpcurrie.com. Connect with him through the blog, LinkedIn or Twitter.