Up Your Game: How to Rock Data Quality Checks in the Data Lake

Posted by Adam Diaz on Feb 7, 2017 2:52:06 PM

Common sense tells us one can’t use data unless its quality is understood. Data quality checks are critical for the data lake – but it’s not unusual for companies to initially gloss over this process in the rush to move data into less-costly and scalable Hadoop storage especially during initial adoption. After all isn't landing data into Hadoop with little definition of schema and data quality what Hadoop is all about? After landing data in a raw zone in Hadoop the reality quickly sets in that in order for data to useful both structure and data quality must be applied. Defining data quality rules becomes particularly important depending on what sort of data you’re bringing into the data lake; for example, large volumes of data from machines and sensors.  Data validation is essential because it is coming from an external environment and it probably hasn’t gone through any quality checks.

Read More

Topics: Hadoop, Big Data Ecosystem, Bedrock, Data Lake Solutions, Data Warehouse, Data Lake, Metadata Management

The Business Case for Data Lakes

Posted by Ben Sharma on Jan 3, 2017 12:58:46 PM

Excerpt from ebook, Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases, by Ben Sharma and Alice LaPlante.

Read More

Topics: Big Data Ecosystem, Data Lake Solutions, Data Warehouse, Data Lake

Train Your (Hadoop) Elephant with Fewer Data Lake Management and Governance Tools

Posted by Greg Wood on Nov 30, 2016 3:43:47 PM

In the past year, the focus of big data has expanded from creating new streaming and computing frameworks into creating ways to manage and control these frameworks. Unfortunately, none of the tools for these frameworks provide a complete enough set of governance and management functionality to operate alone.

Read More

Topics: Hadoop, Big Data Ecosystem, Bedrock, Data Lake Solutions, Data Management, Data Governance

Migrating On-Premises Data Lakes to Cloud

Posted by Kannan Rajagopalan on Oct 10, 2016 3:34:08 PM

Migration Objectives

In the first blog of this series, we discussed some of the key drivers for a Cloud Data Lake such as:

Read More

Topics: Hadoop, Data Lake Solutions, Next-Gen Data Lake

Next-Gen Data Lake: How to Make Cloud Work for You

Posted by Kannan Rajagopalan on Sep 26, 2016 9:11:55 AM

Read More

Topics: Hadoop, Data Lake Solutions, Next-Gen Data Lake

Chlorine for your Data Swamp: Four Key Areas for Automation

Posted by Adam Diaz on Sep 22, 2016 3:10:38 PM

Maybe we’re talking more about algaecide and not chlorine, but microbiology aside, a data lake often gets rather cloudy and disorganized shortly after being opened for use. Hadoop’s promise of schema on read lures many in, but often ends up forcing a soul-searching reevaluation of one’s principles related to data management -- not to mention a new strategy (and cost) for cleaning up a swampy data lake.

Read More

Topics: Hadoop, Big Data Ecosystem, Data Lake Solutions

Top Streaming Technologies for Data Lakes and Real-Time Data

Posted by Greg Wood on Sep 20, 2016 10:52:47 AM

More than ever, streaming technologies are at the forefront of the Hadoop ecosystem. As the prevalence and volume of real-time data continues to increase, the velocity of development and change in technology will likely do the same. However, as the number and complexity of streaming technologies grow, consumers of Hadoop must face an increasing number of choices with increasingly blurred delineation of functionality.

Read More

Topics: Hadoop, Big Data Ecosystem, Data Lake Solutions

Bedrock DLM: Big Data Lifecycle Management for the Data Lake

Posted by Scott Gidley on Sep 14, 2016 1:59:49 PM

Apache knows there’s an urgent need for data lifecycle management for big data – and now offers Heterogeneous Storage for different storage types, as well as Hadoop Archive Storage with hot, warm, cold and other storage categories.

Read More

Topics: Big Data Ecosystem, Product Updates, Bedrock, Data Lake Solutions

Data Fracking: Going Deep into the Data Lake Using Drill

Posted by Greg Wood on Sep 14, 2016 10:25:07 AM

Your data lake is finally live. After months and months of planning, designing, tinkering, configuring and reconfiguring, your company is ready to see the fruits of your labor. There’s just one issue: the quarter close is coming up, and data analysts are asking for their functionality yesterday, not next week. That means there’s no time to go through the motions of setting up workflows, rewriting queries to function on Hive or HBase, and working through the kinks of a new architecture. The data lake may be the best, most flexible, and most scalable architecture available, but there is one thing it is not: quick to deploy. How can all of your hard-won socialization and hype for the data lake be saved? Enter Apache Drill.

Read More

Topics: Big Data Ecosystem, Data Lake Solutions, All Open Source, Examples of Code

Zaloni, NetApp Partner for Data Lifecycle Management Solution

Posted by Scott Gidley, VP of Product on Sep 8, 2016 4:15:02 PM

Right-size the enterprise data lake with policy-driven, highly scalable and cost-effective data lifecycle management and cloud tiering.

Today we announced a new solution developed in partnership with NetApp: Zaloni Bedrock DLM and NetApp StorageGRID. The solution addresses the growing need for big data lifecycle management as more enterprises deploy data lakes for managing the growing variety and volume of data, including mobile, cloud-based apps and Internet of Things (IoT) data.

Read More

Topics: Product Management, Data Lake Solutions