Search This Blog

Friday, April 15, 2016

Learning so called Hadoop: where to start?

It is confusing what book to read, what tutorial or courses to take. Right now the system has been split in different modules. Picking material on Hadoop will often straight away take you to HDFS and MapReduce/Yarn and programming. This can be confusing for analyst community as you are trying to learn more about analysis and not system/infrastructure maintenance.

So then the question is where does an analyst start? In my opinion you can look at the following blocks and start on any query tools.

In the current Hadoop ecosystem, HDFS is still the major storage option. On top of it snappy, RCFile, Parquet and ORCFile could be used for storage optimisation. Core Hadoop MapReduce released a version 2.0 called Yarn for better performance and scalability. Spark and Tez as solutions for real-time processing are able to run on the Yarn to work with Hadoop closely. Base is a leading NoSQL database, especially when there is a NoSQL database request on the deployed Hadoop clusters. Swoop is still one of the leading and matured tools for exchanging data between Hadoop and relational databases. Flume is matured distributed and reliable log-collecting tool to move or collect data to HDFS. Impala and Presto query directly against the data on HDFS for better performance.

So if you are an analyst like me then Hive, Pig, Impala, Presto, Sqoop and HBase can be a good flow to start taming the beast. Just like in the good ol days you can become an analyst first and then depending on your interest in infrastructure and admin side you can jump into other systems.

To start learning Hive - one needs to install it. So I would recommend following this URL (this one is the best of the couple available out there)