Unless you are in the business of selling data or analytics software or services, you have seen a fair share of data science projects fail to achieve promised value. Notwithstanding the flood of success stories in the media, we know that disappointment in data science is far more common.
Many of us have sat through presentations with wonderful charts and convincing stories that make it look as easy as eating a pie and perhaps wondered why it does not come as easy to us to achieve the bottom line value. Is data science only good in hindsight? Is the main product the pretty slide with an impressive chart? Do you have to be Amazon or Google or Facebook to actually benefit from it?
The truth is, the deck is stacked against you before you even begin a data science initiative; there are more reasons for it to fail than to succeed, and there are many opportunities for mistakes along the way.
You know the old saying: “lies, damned lies, and statistics.” The same can be said about data science broadly. We can easily, by mistake or not, say or imply something using numbers that are not there in reality; predictive models can be good on historical data but useless in a real-world situation; data analysis can yield interesting findings that are not very actionable and practically worthless. There are many reasons why this is so pervasive, and none of the reasons are new. Statisticians have known them well for decades. The problem is that the scale of data science today magnifies the pitfalls and their impact (as it does benefits). Below are the big three from my experience.
Sampling bias
Perhaps the biggest and most overlooked pitfall is the sampling bias. For example, marketers often look at Twitter for trends around what is generating interest. But the Twitter data set is one of the most biased datasets out there to measure interest. Active Twitter users are predominately urban, young, white, and in media, entertainment, or marketing industries. Any production or advertising decision made based solely on Twitter data is bound to be extremely biased.
This is not to be critical of Twitter. In reality, any data set is biased, and we often fail to recognize this. We assume – or we just choose to believe – that our sample is a good representation of the population. For data science to work, analysts must understand the bias and factor it into the analysis rather than ignore and make poorly informed decisions and create poor models.
Correlation delusion
Another big challenge is seeing correlations between datasets and assuming that they are real. A big part of data science is about finding correlations between different things. But the hunt for correlation can lead to pretty bad conclusions. For example, when Facebook was growing fast, Greece was descending into a debt crisis fast. The correlation was pretty strong. In reality, there is no relationship between Facebook usage and Greek debt crisis, they just happened to occur at the same time. This is an obvious example, but if you have enough data sets – and the vast scope of data capture today almost guarantees that you will – you are sure to find correlations that are not real; they are just there by chance.
Implication of causality
Hand-in-hand with correlation, you have causation – another, but possibly even more damaging pitfall. Even when the correlation is real, it may not imply causation. Amazingly, many respected economists, analysts, doctors, lawyers, and journalists make this mistake. Hardly a week passes without a major, respected news outlet making this mistake in a published article. In fact, the presence of correlation combined with anecdotal evidence and intuition can present a very strong temptation to conclude causation. Such mistakes have been known to shorten lives and put innocent people behind bars.
A simple example is that ice cream sales are highly correlated with drownings. This is a real correlation (summer time), not just by chance. Does that mean that ice cream is causing the drownings? Of course not.
Not all examples are this simple. As a matter of fact, proving causation is one of the most difficult tasks in data science, so we often jump to conclusions. And that is a recipe for a worthless predictive model or statistical model at best and a disaster at worst.
Perhaps you think that achieving statistical significance is enough. In reality, by most frequently used definitions of it, you will find a statistically significant but unreal result in perhaps one out of 20 times over the long term, simply by chance.
The enemy and friends
So what do we do – give up? No, we just need to understand our enemy: randomness. Randomness is there, and we can either accept it or we can ignore it and find patterns in data where there are no patterns.
Our friends in data science are the tools that help us stay true to what we are saying by challenging our own findings and assumptions. That is really what separates good data science from data voodoo. By taking the steps to understand the pitfalls that we face, we are better positioned to achieve the intended benefit of data science and make a difference, whether in selling more product or curing cancer.