Excerpt from the book The Rebel - Osho does explains nicely why most analytics, data science, big data and digital transformation initiatives fail? Its just because data alone can't change the organization, it has to be the culture of the organization that needs to be changed - an no, not after the new is built.
Old and New: Such is human mind
"I have heard about an old church: it was so ancient that people had stopped going there because even strong wind and the church would start swaying. It was so fragile, any moment it could fall. Even the priest had started giving his sermons outside the church, far away in the open ground.
Finally, the board of trustees had a meeting; something had to be done. But the trouble was that the church was very ancient - it was the glory of the town; their town was famous far and wide because of the old church: perhaps it was the oldest church in the world. It was not possible to demolish it and to make a new one. But it was also dangerous to let it remain as it was - it was going to kill someone. Nobody had been going in for years; even the priest was not courageous enough to go in because who knew at what moment the church would simply collapse? So something had to be done.
The board was in a very great dilemma: something had to be done, and nothing should be done because that church is so ancient, and man has been in such deep attachment with things that are ancient. So they passed a resolution with four clauses in it. The first was: "We will make a new church, but it will be exactly the same as the old. It will be made of the same material the old is made of - nothing new will be used in it, so it remains ancient. It will be made in the same place where the old church stands because that place has become holy by ancientness."
The last thing in their resolution was, "we will not demolish the old church until the new is ready." They were all happy that they had come to a conclusion. But who was going to ask those idiots, "how are you going to do it?" The old should not be demolished till the new was ready. And the new had to be made of everything the old was made of, in the same place where the old was standing, with exactly same architecture the old had. Nothing new could be added to it: the same doors, the same windows, the same glass, the same bricks - everything that needed to be used had to be of the old church.
And finally, they decided that the old should not be touched till the new was ready. "When the new is ready, then we can demolish the old."
Such is the humans mind: it clings to old, it also wants the new, and then it tries to find some compromise - that at least the new should be like the old. But a few things are impossible, nature just won't allow them.
I document some notes from the books I read in the domain of Web Analytics, Marketing, Decision Management etc...
Search This Blog
Friday, March 11, 2016
Wednesday, October 14, 2015
My 2¢s worth on Big Data
I recently left my role at an Internet market research
company. As this company I was helping manage the pre-sales and post-sales for
enterprise web-analytics platform business. I worked with unstructured data
(collected from web using GET request) for last 9 years and understand the
business need and data collection methods in depth. In the past few years “Big
Data” has become a buzzword in the field of technology.
To keep this post relevant – I am going to avoid
writing things, which you can read on other sources.
Genesis: The idea of big data and projects in this area got
popular after Google published a paper on their distributed file system and how
it could be used to collect, store and analyze large volumes of data on commodity
hardware.
Purpose: The need to store data in large quantities has been around since
banking, telecommunications, airline and power transmission have digitized
their data on computers. Typically these cash rich companies would spend on
mainframes and ensure high availability costly machines to host this data.
These were costly machines and could result in couple of million-dollar worth
of hardware, software license and personnel cost. What made it still okay to
spend so much was that the data was essential to be stored as each transaction
had a commercial value or was recorded for regulatory compliance and thus
missing the data was not an option.
http://www.vm.ibm.com/devpages/jelliott/pdfs/zhistory.pdf
(Short history of the IBM Mainframes)
Early 2000s: Saw
the rise of the internet and online applications where the end user actually
was interacting with computers. Thus the purpose of having an computer went
beyond record keeping. This also resulted in explosion on the volumes of data
generated. While the data in the logs was useful but every single action wasn’t
of commercial value; instead there was value in understanding what collection
of these logs would tell more about the customer and their behavioral journey.
So this was the challenge that large internet companies like
Yahoo!, Google, MSN et. al. were trying to solve. This resulted in creation of
systems similar and including the GFS. These systems allowed using commodity
hardware for store and querying of data. Thus reducing the cost of maintaining
a data collection and analysis system.
My encounter with big data
and challenge in learning: As a web analytics
consultant I helped companies to collect, ingest and analyze the web traffic
logs using software built by companies like Adobe (Omniture/WebSideStory/Visual
Sciences), WebTrends, Coremetrics, comScore and Google (analytics). These
applications worked nicely to satisfy the reporting needs of the executives and
worked as system on the side without interfering the primary ways in which the
main core of the web services would work.
Change in recent years: In the recent years internet has become inevitable part of the
lifestyle and thus making companies like Facebook, Twitter, Linkedin, Google
major part of ones day. This also means that these companies have access to
1bn+ online audiences who can be fed online advertising and thus fueling the
online commerce channels. Instead of paying web analytics companies for an
analytics system the engineers at these tech companies have resorted to the use
of Hadoop(a.k.a. big data) systems to collect, store and analyze the traffic
logs.
What it sparked: Since now there is a way to collect, store and analyze hoards of
data the application engineers also figured out other ways to store data from
clinical research, operations or any other activity which could result in
collection of data which was purely logging activity. This data is then mined
to perform statistical analysis, predictive analysis, natural language
processing, artificial intelligence and machine learning. Such applications
provide data analysts with a magnifying glass to look at large volumes or data
and find out macro trends and insights, which weren’t possible earlier as there
weren’t cheaper ways of performing the analysis and the value of the insights,
didn’t generate savings/profits greater than the cost of the systems.
Where its going: Internet of things (IOT) and Mobile technology has facilitated
automation of collection of data and thus further fuelling a growth in the collection
of more data.
What exactly is big data? There is lot of hoopla about what big data is and what it is not.
In simple words it’s a way to store large amount of data, process, query and
analyze it using cost efficient hardware system. The software that has become unanimous
with big data is Hadoop and other utilities that allow manipulating or querying
the data.
How do you explain Hadoop? Hadoop is sort of a misnomer for collection of software and anyone
who is knowledgeable about the components will actually be willing to speak
specifics about the components. People who are bullshitting their way around
will stop at the keyword ‘hadoop’.
•
Hadoop Common: The common utilities that support the other Hadoop
modules.
•
Hadoop Distributed
File System (HDFS™): A distributed file system that provides high-throughput
access to application data.
•
Hadoop YARN: A framework for job scheduling and cluster resource
management.
•
Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.
Once you have installed the HDFS you have a cluster where the file
system looks like one big volume/drive but is actually sharded across various
units that form your cluster. The tasks written, usually in Java or Python that
allow querying the shards and then aggregating the results are called as
MapReduce programs. One may say that before the writing the MapReduce logic
there is no datamodel to the data – it’s the MapReduce that defines the data
model and query model for the underlying data.
Besides this there are couple of other utilities that help you
manage the big data system. They can be listed as follows:
·
Ambari: A web-based tool for provisioning, managing, and monitoring
Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop
MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also
provides a dashboard for viewing cluster health such as heatmaps and ability to
view MapReduce, Pig and Hive applications visually alongwith features to
diagnose their performance characteristics in a user-friendly manner.
·
Avro: A data serialization system.
·
Cassandra: A scalable multi-master database with no single points of
failure.
·
Chukwa: A data collection system for managing large distributed systems.
·
HBase: A scalable, distributed database that supports structured data
storage for large tables.
·
Hive: A data warehouse infrastructure that provides data summarization
and ad hoc querying.
·
Mahout: A Scalable machine learning and data mining library.
·
Pig: A high-level data-flow language and execution framework for
parallel computation.
·
Spark: A fast and general compute engine for Hadoop data. Spark provides
a simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
·
Tez: A generalized data-flow programming framework, built on Hadoop
YARN, which provides a powerful and flexible engine to execute an arbitrary DAG
of tasks to process data for both batch and interactive use-cases. Tez is being
adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also
by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as
the underlying execution engine.
·
ZooKeeper: A high-performance coordination service for distributed applications.
If you are non-technical, data savvy business person then more than
likely the “how do you explain Hadoop?” section is where you loose the interest
and ignore the details as jibersh. The next thing you will want to do is hire a
person who takes care of all the details and run the big data project for you.
And now you have a requirement out there in the market looking for 50 business
analytics skills with all the technical skills and a person who is in touch
with your business for last 10 years. Well if you believe its one person who
could do this – then you have it wrong.
In general from my understanding this is how I would divide the big
data team:
1) Make the system work:
Traditionally these people have job titles of UNIX system administrators. These
people will make the basic infrastructure work and will make the so-called
Hadoop file system work with other applications. Their KPI of these resources
is ‘system availability’
2) Business analyst: These
people were called business analysts. The key skills for these social people is
to find out all the data sources and detail the information contained in these
data sources. These people also have to be tech savvy to understand the APIs
and data models that allow marrying the datasets for a holistic view of the
KPIs on which the organization is run. These resources can usually be the old
hands who have been in your company for a while and understand the political
boundaries and can negotiate their way to make things happen. People like me
who have worked with web data and integrated offline source to create
meaningful reporting frameworks can be bucketed here.
3) Team of analysts: These
are the set of people who can write SQL, VBA scripts and excellent skills with
creating spreadsheet dashboards and power point presentations.
I have spent some time trying understand the mystery systems and
will continue to read more… Like my post is titled, this is my 2 cents worth.
Hopefully you enjoyed this post.
Subscribe to:
Posts (Atom)