Data Fracking: Going Deep into the Data Lake Using Drill

Posted by Greg Wood on Sep 14, 2016 10:25:07 AM

Your data lake is finally live. After months and months of planning, designing, tinkering, configuring and reconfiguring, your company is ready to see the fruits of your labor. There’s just one issue: the quarter close is coming up, and data analysts are asking for their functionality yesterday, not next week. That means there’s no time to go through the motions of setting up workflows, rewriting queries to function on Hive or HBase, and working through the kinks of a new architecture. The data lake may be the best, most flexible, and most scalable architecture available, but there is one thing it is not: quick to deploy. How can all of your hard-won socialization and hype for the data lake be saved? Enter Apache Drill.

Read More

Topics: Big Data Ecosystem, Data Lake Solutions, All Open Source, Examples of Code

Kafka in action: 7 steps to real-time streaming from RDBMS to Hadoop

Posted by Rajesh Nadipalli on Aug 23, 2016 10:25:42 AM

For enterprises looking for ways to more quickly ingest data into their Hadoop data lakes, Kafka is a great option. What is Kafka? Kafka is a distributed, scalable and reliable messaging system that integrates applications/data streams using a publish-subscribe model. It is a key component in the Hadoop technology stack to support real-time data analytics or monetization of Internet of Things (IOT) data. 

Read More

Topics: Hadoop, Big Data Ecosystem, Data Lake Solutions, All Open Source, Examples of Code

Using HBase to Create an Enterprise Key Service

Posted by Garima Dosi on Mar 8, 2016 1:10:22 PM

Systems often use keys to provide unique identifiers for transactions or entities within the firm and provide savings via compact storage and reduced computing needs. HBase's ability to service millions of requests with good concurrency meets the needs of an enterprise key service. Assuming that you know HBase or ETL, you can work with the example in this post as long as you understand the following key points:

Read More

Topics: Big Data Ecosystem, All Open Source, Examples of Code

Keeping it In-Memory: A Look at Tachyon/Alluxio

Posted by Siddharth Agarwal on Feb 24, 2016 11:38:04 AM

Overcoming output limitations

A core tenet of the Hadoop cluster architecture is the replication of data blocks across nodes. The reasoning behind this is that it protects against the failure of an individual node by having backup data elsewhere. An issue with this is that slow write speeds to the HDFS can hurt the performance of a job with multiple steps, especially when the size of the output of a step is close to the size of the input. Even with frameworks such as Spark, which take advantage of in-memory computation for individual jobs, sharing of the output is usually limited by network bandwidth or disk throughput.

Read More

Topics: Hadoop, Big Data Ecosystem, All Open Source