I was reading an article in BusinessWeek titled “Kill Your Desk Chair” where the article cites the following “fact”:
A recent study, from the Pennington Biomedical Research Center in Baton Rouge, LA, followed 17,000 Canadians over 12 years and found that those who sat for most of the day were 54% more likely to die of heart attacks that those that didn’t.
Now being a person who spends a lot of time sitting behind a desk, or on an airplane, this “54% more likely to die of heart attacks” fact is very concerning. Should I throw out my current desk and buy one of those expensive “stand up and work” type of desk?
Then I started to think like a data scientist, and started to challenge the assumption that there is some sort of causation between sitting and heart attacks. Some questions that immediately popped to mind included:
- Are there other variables, like lack of exercise or eating habits or age or stress of the job, which might be the cause of the heart attacks?
- Was a control group used to test the validity of the study results?
- Is there something about Canadians that makes them more susceptible to sitting and heart attacks?
- Who sponsored this study? Maybe the manufacturer of these new expensive “stand up and work” type of desk?
One needs to be a bit skeptical when they hear these sorts of “factoids.” We should know better than to just believe these sorts of claims blindly. We’ve all heard the weatherman state that there is a 60% chance of rain on days when there isn’t a cloud in the sky (I guess the weatherman could have just flipped a coin and made as accurate a prediction). And the Governor Walker recall election in Wisconsin raised all sorts of concerns when the early exit polls predicted incorrectly the actual results of the recall vote (and caused some folks to demand a recount of the actual vote because the exit polls didn’t match the actual results).
Correlation Is Not Causality
A good data scientist knows that there is a big difference between correlation and causality. Causality is the ability to quantify cause and effect, and just because two items move in tandem, does not mean that there is causality. The relationship between the events may not even make logical sense. Here are some of my favorite examples:
Do we really think that the growth in the number of active Facebook users is actually driving up the yield on the 10-year Greek government bonds? Unless joining Facebook requires all subscribers to sell their 10-year Greek government bonds, there is no causality in this correlation.
Do we really think that this mountain range is driving the murder rate in the state of New York (maybe having to climb such a mountain puts one in a more murderous state of mind)?
Okay, this one might actually be true…
Thinking like a Data Scientist: Having A Dubious Attitude
Thinking like a data scientist requires imagination, curiosity and a lot of skepticism to question or challenge whatever analytic insights are derived out of the data. Don’t forget that common sense makes a good yardstick to apply against any analytic results. A good data scientist tends to:
- Be very clear and thorough on defining the hypothesis (and null hypothesis) they are testing; to clearly and articulately state the problem that they are trying to solve and what determines a statistically valid result.
- Embrace an exploratory, discovery, visually inspective analytic process to understand, validate and cleanse the data by throwing out incomplete, inappropriate or inaccurate data, and to not let outliers skew the results.
- Focus on identifying patterns and quantifying correlations (quantifying cause-and-effect) out of the data through statistical, descriptive and predictive analytics.
- Grabs whatever data might be available, whether or not the data scientist is even sure that they will use that data, and worries about the data integration issues as the come up in the analytic process.
- Build data enrichment processes and algorithms to create new variables and metrics that might be better predictors of performance.
- Tolerant of “good enough” data to fuel “good enough” decisions (taking into account the “costs” associated with “Understanding Type I and Type II Errors”).
What Does This Mean To The Business User?
So what does this mean to you as a business user in support of your data science team?
- Properly set up your hypothesis and detail out your analytic plan. Be clear and precise on what you are trying to prove and the business objectives. Be as granular, transparent and thorough as possible. If you are not clear, you might end up with a “correct” answer that is actually unusable (see chart to the right).
Thoroughly document your business assumptions. Allow others to review and challenge the reasonableness and validity of your assumptions. Constantly ask if the assumptions are reasonable and realistic. Don’t forget the importance of at least contemplating “black swan” events in hour model assumptions.
- Plan for experimentation especially to test those model assumptions that have the biggest influence on the analytic results. Use sample groups to ensure that you are comparing apples to apples. Determine if you have failed enough (explored enough other options) before declaring victory.
- Properly interpret and apply results. Apply the common sense test. Are the results reasonable and are they actionable?
In summary, don’t accept the analytic results blindly just because they come with precise-looking numbers and probabilities. Challenge the analytic results and conclusions drawn from the analytic models, especially from those who may not have the analytic credentials, experience or even the context of the business case against with the hypothesis is being tested. There have been some classic bad decisions drawn from what looked like rock-solid statistical analysis.
To learn more about EMC’s unique approach to leveraging Big Data to drive business value, please check out EMC’s Big Data Vision Workshop offering.
The moniker “Dean of Big Data” may have been applied in a light-hearted spirit, but Bill’s expertise around data analytics is no joke. After being deeply immersed in the world of big data for over 20 years, he shows no signs of coming up for air. Bill speaks frequently on the use of big data, with an engaging style that has gained him many accolades. He’s presented most recently at STRATA, The Data Science Summit and TDWI, and has written several white papers and articles about the application of big data and advanced analytics to drive an organization’s key business initiatives. Prior to joining Consulting as part of EMC Global Services, Bill co-authored with Ralph Kimball a series of articles on analytic applications, and was on the faculty of TDWI teaching a course on designing analytic applications.
Bill created the EMC Big Data Vision Workshop methodology that links an organization’s strategic business initiatives with supporting data and analytic requirements, and thus helps organizations wrap their heads around this complex subject.
Bill sets the strategy and defines offerings and capabilities for the Enterprise Information Management and Analytics within EMC Consulting, Global Services. Prior to this, he was the Vice President of Advertiser Analytics at Yahoo at the dawn of the online Big Data revolution.
Bill is the author of “Big Data: Understanding How Data Powers Big Business” published by Wiley.
©Bill Schmarzo, 2015. Unauthorized use and/or duplication of this material without express and written permission from this site’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Bill Schmarzo and with appropriate and specific direction to the original content.