I recently left my role at an Internet market research
company. As this company I was helping manage the pre-sales and post-sales for
enterprise web-analytics platform business. I worked with unstructured data
(collected from web using GET request) for last 9 years and understand the
business need and data collection methods in depth. In the past few years “Big
Data” has become a buzzword in the field of technology.
To keep this post relevant – I am going to avoid
writing things, which you can read on other sources.
Genesis: The idea of big data and projects in this area got
popular after Google published a paper on their distributed file system and how
it could be used to collect, store and analyze large volumes of data on commodity
hardware.
Purpose: The need to store data in large quantities has been around since
banking, telecommunications, airline and power transmission have digitized
their data on computers. Typically these cash rich companies would spend on
mainframes and ensure high availability costly machines to host this data.
These were costly machines and could result in couple of million-dollar worth
of hardware, software license and personnel cost. What made it still okay to
spend so much was that the data was essential to be stored as each transaction
had a commercial value or was recorded for regulatory compliance and thus
missing the data was not an option.
http://www.vm.ibm.com/devpages/jelliott/pdfs/zhistory.pdf
(Short history of the IBM Mainframes)
Early 2000s: Saw
the rise of the internet and online applications where the end user actually
was interacting with computers. Thus the purpose of having an computer went
beyond record keeping. This also resulted in explosion on the volumes of data
generated. While the data in the logs was useful but every single action wasn’t
of commercial value; instead there was value in understanding what collection
of these logs would tell more about the customer and their behavioral journey.
So this was the challenge that large internet companies like
Yahoo!, Google, MSN et. al. were trying to solve. This resulted in creation of
systems similar and including the GFS. These systems allowed using commodity
hardware for store and querying of data. Thus reducing the cost of maintaining
a data collection and analysis system.
My encounter with big data
and challenge in learning: As a web analytics
consultant I helped companies to collect, ingest and analyze the web traffic
logs using software built by companies like Adobe (Omniture/WebSideStory/Visual
Sciences), WebTrends, Coremetrics, comScore and Google (analytics). These
applications worked nicely to satisfy the reporting needs of the executives and
worked as system on the side without interfering the primary ways in which the
main core of the web services would work.
Change in recent years: In the recent years internet has become inevitable part of the
lifestyle and thus making companies like Facebook, Twitter, Linkedin, Google
major part of ones day. This also means that these companies have access to
1bn+ online audiences who can be fed online advertising and thus fueling the
online commerce channels. Instead of paying web analytics companies for an
analytics system the engineers at these tech companies have resorted to the use
of Hadoop(a.k.a. big data) systems to collect, store and analyze the traffic
logs.
What it sparked: Since now there is a way to collect, store and analyze hoards of
data the application engineers also figured out other ways to store data from
clinical research, operations or any other activity which could result in
collection of data which was purely logging activity. This data is then mined
to perform statistical analysis, predictive analysis, natural language
processing, artificial intelligence and machine learning. Such applications
provide data analysts with a magnifying glass to look at large volumes or data
and find out macro trends and insights, which weren’t possible earlier as there
weren’t cheaper ways of performing the analysis and the value of the insights,
didn’t generate savings/profits greater than the cost of the systems.
Where its going: Internet of things (IOT) and Mobile technology has facilitated
automation of collection of data and thus further fuelling a growth in the collection
of more data.
What exactly is big data? There is lot of hoopla about what big data is and what it is not.
In simple words it’s a way to store large amount of data, process, query and
analyze it using cost efficient hardware system. The software that has become unanimous
with big data is Hadoop and other utilities that allow manipulating or querying
the data.
How do you explain Hadoop? Hadoop is sort of a misnomer for collection of software and anyone
who is knowledgeable about the components will actually be willing to speak
specifics about the components. People who are bullshitting their way around
will stop at the keyword ‘hadoop’.
•
Hadoop Common: The common utilities that support the other Hadoop
modules.
•
Hadoop Distributed
File System (HDFS™): A distributed file system that provides high-throughput
access to application data.
•
Hadoop YARN: A framework for job scheduling and cluster resource
management.
•
Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.
Once you have installed the HDFS you have a cluster where the file
system looks like one big volume/drive but is actually sharded across various
units that form your cluster. The tasks written, usually in Java or Python that
allow querying the shards and then aggregating the results are called as
MapReduce programs. One may say that before the writing the MapReduce logic
there is no datamodel to the data – it’s the MapReduce that defines the data
model and query model for the underlying data.
Besides this there are couple of other utilities that help you
manage the big data system. They can be listed as follows:
·
Ambari: A web-based tool for provisioning, managing, and monitoring
Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop
MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also
provides a dashboard for viewing cluster health such as heatmaps and ability to
view MapReduce, Pig and Hive applications visually alongwith features to
diagnose their performance characteristics in a user-friendly manner.
·
Avro: A data serialization system.
·
Cassandra: A scalable multi-master database with no single points of
failure.
·
Chukwa: A data collection system for managing large distributed systems.
·
HBase: A scalable, distributed database that supports structured data
storage for large tables.
·
Hive: A data warehouse infrastructure that provides data summarization
and ad hoc querying.
·
Mahout: A Scalable machine learning and data mining library.
·
Pig: A high-level data-flow language and execution framework for
parallel computation.
·
Spark: A fast and general compute engine for Hadoop data. Spark provides
a simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
·
Tez: A generalized data-flow programming framework, built on Hadoop
YARN, which provides a powerful and flexible engine to execute an arbitrary DAG
of tasks to process data for both batch and interactive use-cases. Tez is being
adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also
by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as
the underlying execution engine.
·
ZooKeeper: A high-performance coordination service for distributed applications.
If you are non-technical, data savvy business person then more than
likely the “how do you explain Hadoop?” section is where you loose the interest
and ignore the details as jibersh. The next thing you will want to do is hire a
person who takes care of all the details and run the big data project for you.
And now you have a requirement out there in the market looking for 50 business
analytics skills with all the technical skills and a person who is in touch
with your business for last 10 years. Well if you believe its one person who
could do this – then you have it wrong.
In general from my understanding this is how I would divide the big
data team:
1) Make the system work:
Traditionally these people have job titles of UNIX system administrators. These
people will make the basic infrastructure work and will make the so-called
Hadoop file system work with other applications. Their KPI of these resources
is ‘system availability’
2) Business analyst: These
people were called business analysts. The key skills for these social people is
to find out all the data sources and detail the information contained in these
data sources. These people also have to be tech savvy to understand the APIs
and data models that allow marrying the datasets for a holistic view of the
KPIs on which the organization is run. These resources can usually be the old
hands who have been in your company for a while and understand the political
boundaries and can negotiate their way to make things happen. People like me
who have worked with web data and integrated offline source to create
meaningful reporting frameworks can be bucketed here.
3) Team of analysts: These
are the set of people who can write SQL, VBA scripts and excellent skills with
creating spreadsheet dashboards and power point presentations.
I have spent some time trying understand the mystery systems and
will continue to read more… Like my post is titled, this is my 2 cents worth.
Hopefully you enjoyed this post.