Data preparation is often considered a necessary precursor to the “real” work found in visualizing or analyzing data, but this framing sells data prep short. The ways in which we cleanse and shape data for downstream use have significant bearing on our final analytic output, and cutting corners on data prep can run up a huge cost for companies.
According to a report from the Harvard Business Review, bad data costs the U.S. roughly $3 trillion per year. primarily due to the time involved in correcting data and the consequences of errors leaking through to customers. Below, we’ve outlined what we consider a “data prep sin” — or what will surely affect the end result for worse. Sin #1: Removing data Removing records containing incomplete, erroneous, outlying, or extraneous records is one of the most common transformations in data preparation. However, removing data can introduce bias or affect downstream results in meaningful ways..
Author: Sean Kandel