Data Science Must Embrace Creating Not Just Collecting Data

One of the greatest failures of modern deep learning research and the root cause of all its bias issues has been its near-absolute reliance on free data that can be mass harvested without cost, rather than paying to collect minimally biased data that accurately reflects society’s vast diversity.

In similar fashion, the world of data science has become defined as collecting existing data and attempting to work around its limitations (or ignoring them entirely) rather than creating new datasets that actually bear on the question at hand. Why are data scientists so adverse to creating new data? Why is it that data scientists no longer care how bad their data really is? Despite being fully aware of Twitter’s evolution over the last seven years, data scientists happily proceed with production analyses that rely on those very characteristics of Twitter that no longer exist.

Author: Kalev Leetaru

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s