Excerpt from ebook, Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases, by Ben Sharma and Alice LaPlante.
If you use your data for mission-critical purposes—purposes on which your business depends—you must take data management and governance seriously. Traditionally, organizations have used the EDW because of the formal processes and strict controls required by that approach. But as we’ve already discussed, the growing volume and variety of data are overwhelming the capabilities of the EDW. The other extreme—using Hadoop to simply do a “data dump”—is out of the question because of the importance of the data.
In early use cases for Hadoop, organizations frequently loaded data without attempting to manage it in any way. Although situations still exist in which you might want to take this approach—particularly since it is both fast and cheap—in most cases, this type of data dump isn’t optimal. In cases where the data is not standardized, where errors are unacceptable, and when the accuracy of the data is of high priority, a data dump will work against your efforts to derive value from the data. This is especially the case as Hadoop transitions from an add-on-feature to a truly central aspect of your data architecture.
The data lake offers a middle ground. A Hadoop data lake is flexible, scalable, and cost-effective—but it can also possess the discipline of a traditional EDW. You must simply add data management and governance to the data lake.
Once you decide to take this approach, you have four options for action.
Address the Challenge Later
The first option is the one chosen by many organizations, who simply ignore the issue and load data freely into Hadoop. Later, when they need to discover insights from the data, they attempt to find tools that will clean the relevant data.
If you take this approach, machine-learning techniques can sometimes help discover structures in large volumes of disorganized and uncleansed Hadoop data.
But there are real risks to this approach. To begin with, even the most intelligent inference engine needs to start somewhere in the massive amounts of data that can make up a data lake. This means necessarily ignoring some data. You therefore run the risk that parts of your data lake will become stagnant and isolated, and contain data with so little context or structure that even the smartest automated tools—or human analysts—don’t know where to begin. Data quality deteriorates, and you end up in a situation where you get different answers to the same question of the same Hadoop cluster.
Adapt Existing Legacy Tools
In the second approach, you attempt to leverage the applications and processes that were designed for the EDW. Software tools are available that perform the same ETL processes you used when importing clean data into your EDW, such as Informatica, IBM InfoSphere DataStage, and AB Initio, all of which require an ETL grid to perform transformation. You can use them when importing data into your data lake.
However, this method tends to be costly, and only addresses a portion of the management and governance functions you need for an enterprise-grade data lake. Another key drawback is the ETL happens outside the Hadoop cluster, slowing down operations and adding to the cost, as data must be moved outside the cluster for each query.
Write Custom Scripts
With the third option, you build a workflow using custom scripts that connect processes, applications, quality checks, and data transformation to meet your data governance and management needs.
This is currently a popular choice for adding governance and management to a data lake. Unfortunately, it is also the least reliable. You need highly skilled analysts steeped in the Hadoop and open source community to discover and leverage open-source tools or functions designed to perform particular management or governance operations or transformations. They then need to write scripts to connect all the pieces together. If you can find the skilled personnel, this is probably the cheapest route to go.
However, this process only gets more time-consuming and costly as you grow dependent on your data lake. After all, custom scripts must be constantly updated and rebuilt. As more data sources are ingested into the data lake and more purposes found for the data, you must revise complicated code and workflows continuously. As your skilled personnel arrive and leave the company, valuable knowledge is lost over time. This option is not viable in the long term.
Deploy a Data Lake Management Platform
The fourth option involves solutions emerging that have been purpose-built to deal with the challenge of ingesting large volumes of diverse data sets into Hadoop. These solutions allow you to catalogue the data and support the ongoing process of ensuring data quality and managing workflows. You put a management and governance framework over the complete data flow, from managed ingestion to extraction. This approach is gaining ground as the optimal solution to this challenge.
Download the full ebook here.