The Best Ways to Get Started with HCatalog

Posted by Adam Diaz on Nov 16, 2016 9:29:24 AM

HCatalog, also called HCat, is an interesting Apache project. It has the unique distinction of being one of the few Apache projects that were once a part of another project, became its own project, and then again returned to the original project Apache Hive.

Read More

Topics: Hadoop, Big Data Ecosystem, Examples of Code

Part One: Hadoop Clusters from an Audit Perspective

Posted by Rupam Bora on Oct 25, 2016 9:12:20 AM

A fundamental component of Hadoop clusters and security models is Accounting. Along with identification, authentication and authorization functions for users and services, it is with Audit log capabilities that the security ecosystem is complete. Hadoop components handle accounting differently depending on the purpose of the component. Components such as HDFS and HBase are data repositories whereas MapReduce, Hive, Impala are query engines and processing frameworks. So, the auditable events are unique for different elements.

Read More

Topics: Hadoop, Big Data Ecosystem, Examples of Code

Pig vs. Hive: Is There a Fight?

Posted by Monoj Gogoi on Oct 5, 2016 2:57:20 PM

Pig and Hive came into existence out of the sole necessity for enterprises to interact with huge amounts of data without worrying much about writing complex MapReduce code. Though it was born out of necessity, they have come a long way to run even on top of other Big Data processing engines like Spark. Both these two components of the Hadoop ecosystem provide a layer of abstraction over these core execution programs. Hive was invented to give people something that looked like SQL and would ease the transition from RDBMS. Pig has more of a procedural approach and it was created so people didn’t have to write MapReduce in order to manipulate data.

Read More

Topics: Hadoop, Big Data Ecosystem, Examples of Code

Data Fracking: Going Deep into the Data Lake Using Drill

Posted by Greg Wood on Sep 14, 2016 10:25:07 AM

Your data lake is finally live. After months and months of planning, designing, tinkering, configuring and reconfiguring, your company is ready to see the fruits of your labor. There’s just one issue: the quarter close is coming up, and data analysts are asking for their functionality yesterday, not next week. That means there’s no time to go through the motions of setting up workflows, rewriting queries to function on Hive or HBase, and working through the kinks of a new architecture. The data lake may be the best, most flexible, and most scalable architecture available, but there is one thing it is not: quick to deploy. How can all of your hard-won socialization and hype for the data lake be saved? Enter Apache Drill.

Read More

Topics: Big Data Ecosystem, Data Lake Solutions, All Open Source, Examples of Code

Kafka in action: 7 steps to real-time streaming from RDBMS to Hadoop

Posted by Rajesh Nadipalli on Aug 23, 2016 10:25:42 AM

For enterprises looking for ways to more quickly ingest data into their Hadoop data lakes, Kafka is a great option. What is Kafka? Kafka is a distributed, scalable and reliable messaging system that integrates applications/data streams using a publish-subscribe model. It is a key component in the Hadoop technology stack to support real-time data analytics or monetization of Internet of Things (IOT) data. 

Read More

Topics: Hadoop, Big Data Ecosystem, Data Lake Solutions, All Open Source, Examples of Code

Stream Data Processing - the Next ‘Big Thing’ in Big Data

Posted by Siddharth Agarwal on Apr 28, 2016 1:18:02 PM

Stream data processing seems to be the next ‘big thing’ in big data. With a couple open source projects advertising streaming engines - Flink, Beam, and Apex - we decided to jump in and test one of them out for our data lake customers. Flink seems to be the most mature in the segment, having just announced its 1.0.0 release.

Read More

Topics: Big Data Ecosystem, Examples of Code

Using HBase to Create an Enterprise Key Service

Posted by Garima Dosi on Mar 8, 2016 1:10:22 PM

Systems often use keys to provide unique identifiers for transactions or entities within the firm and provide savings via compact storage and reduced computing needs. HBase's ability to service millions of requests with good concurrency meets the needs of an enterprise key service. Assuming that you know HBase or ETL, you can work with the example in this post as long as you understand the following key points:

Read More

Topics: Big Data Ecosystem, All Open Source, Examples of Code