My Cheat Sheets: My 2¢s worth on Big Data

I recently left my role at an Internet market research company. As this company I was helping manage the pre-sales and post-sales for enterprise web-analytics platform business. I worked with unstructured data (collected from web using GET request) for last 9 years and understand the business need and data collection methods in depth. In the past few years “Big Data” has become a buzzword in the field of technology.

To keep this post relevant – I am going to avoid writing things, which you can read on other sources.

Genesis: The idea of big data and projects in this area got popular after Google published a paper on their distributed file system and how it could be used to collect, store and analyze large volumes of data on commodity hardware.

http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Purpose: The need to store data in large quantities has been around since banking, telecommunications, airline and power transmission have digitized their data on computers. Typically these cash rich companies would spend on mainframes and ensure high availability costly machines to host this data. These were costly machines and could result in couple of million-dollar worth of hardware, software license and personnel cost. What made it still okay to spend so much was that the data was essential to be stored as each transaction had a commercial value or was recorded for regulatory compliance and thus missing the data was not an option.

http://www.vm.ibm.com/devpages/jelliott/pdfs/zhistory.pdf (Short history of the IBM Mainframes)

Early 2000s: Saw the rise of the internet and online applications where the end user actually was interacting with computers. Thus the purpose of having an computer went beyond record keeping. This also resulted in explosion on the volumes of data generated. While the data in the logs was useful but every single action wasn’t of commercial value; instead there was value in understanding what collection of these logs would tell more about the customer and their behavioral journey.

So this was the challenge that large internet companies like Yahoo!, Google, MSN et. al. were trying to solve. This resulted in creation of systems similar and including the GFS. These systems allowed using commodity hardware for store and querying of data. Thus reducing the cost of maintaining a data collection and analysis system.

My encounter with big data and challenge in learning: As a web analytics consultant I helped companies to collect, ingest and analyze the web traffic logs using software built by companies like Adobe (Omniture/WebSideStory/Visual Sciences), WebTrends, Coremetrics, comScore and Google (analytics). These applications worked nicely to satisfy the reporting needs of the executives and worked as system on the side without interfering the primary ways in which the main core of the web services would work.

Change in recent years: In the recent years internet has become inevitable part of the lifestyle and thus making companies like Facebook, Twitter, Linkedin, Google major part of ones day. This also means that these companies have access to 1bn+ online audiences who can be fed online advertising and thus fueling the online commerce channels. Instead of paying web analytics companies for an analytics system the engineers at these tech companies have resorted to the use of Hadoop(a.k.a. big data) systems to collect, store and analyze the traffic logs.

What it sparked: Since now there is a way to collect, store and analyze hoards of data the application engineers also figured out other ways to store data from clinical research, operations or any other activity which could result in collection of data which was purely logging activity. This data is then mined to perform statistical analysis, predictive analysis, natural language processing, artificial intelligence and machine learning. Such applications provide data analysts with a magnifying glass to look at large volumes or data and find out macro trends and insights, which weren’t possible earlier as there weren’t cheaper ways of performing the analysis and the value of the insights, didn’t generate savings/profits greater than the cost of the systems.

Where its going: Internet of things (IOT) and Mobile technology has facilitated automation of collection of data and thus further fuelling a growth in the collection of more data.

What exactly is big data? There is lot of hoopla about what big data is and what it is not. In simple words it’s a way to store large amount of data, process, query and analyze it using cost efficient hardware system. The software that has become unanimous with big data is Hadoop and other utilities that allow manipulating or querying the data.

How do you explain Hadoop? Hadoop is sort of a misnomer for collection of software and anyone who is knowledgeable about the components will actually be willing to speak specifics about the components. People who are bullshitting their way around will stop at the keyword ‘hadoop’.

• Hadoop Common: The common utilities that support the other Hadoop modules.

• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

• Hadoop YARN: A framework for job scheduling and cluster resource management.

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Once you have installed the HDFS you have a cluster where the file system looks like one big volume/drive but is actually sharded across various units that form your cluster. The tasks written, usually in Java or Python that allow querying the shards and then aggregating the results are called as MapReduce programs. One may say that before the writing the MapReduce logic there is no datamodel to the data – it’s the MapReduce that defines the data model and query model for the underlying data.

Besides this there are couple of other utilities that help you manage the big data system. They can be listed as follows:

· Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

· Avro: A data serialization system.

· Cassandra: A scalable multi-master database with no single points of failure.

· Chukwa: A data collection system for managing large distributed systems.

· HBase: A scalable, distributed database that supports structured data storage for large tables.

· Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

· Mahout: A Scalable machine learning and data mining library.

· Pig: A high-level data-flow language and execution framework for parallel computation.

· Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

· Tez: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

· ZooKeeper: A high-performance coordination service for distributed applications.

If you are non-technical, data savvy business person then more than likely the “how do you explain Hadoop?” section is where you loose the interest and ignore the details as jibersh. The next thing you will want to do is hire a person who takes care of all the details and run the big data project for you. And now you have a requirement out there in the market looking for 50 business analytics skills with all the technical skills and a person who is in touch with your business for last 10 years. Well if you believe its one person who could do this – then you have it wrong.

In general from my understanding this is how I would divide the big data team:

1) Make the system work: Traditionally these people have job titles of UNIX system administrators. These people will make the basic infrastructure work and will make the so-called Hadoop file system work with other applications. Their KPI of these resources is ‘system availability’

2) Business analyst: These people were called business analysts. The key skills for these social people is to find out all the data sources and detail the information contained in these data sources. These people also have to be tech savvy to understand the APIs and data models that allow marrying the datasets for a holistic view of the KPIs on which the organization is run. These resources can usually be the old hands who have been in your company for a while and understand the political boundaries and can negotiate their way to make things happen. People like me who have worked with web data and integrated offline source to create meaningful reporting frameworks can be bucketed here.

3) Team of analysts: These are the set of people who can write SQL, VBA scripts and excellent skills with creating spreadsheet dashboards and power point presentations.

I have spent some time trying understand the mystery systems and will continue to read more… Like my post is titled, this is my 2 cents worth. Hopefully you enjoyed this post.

My Cheat Sheets

Search This Blog

Wednesday, October 14, 2015

My 2¢s worth on Big Data

No comments:

Post a Comment