One of the greatest failures of modern deep learning research and the root cause of all its bias issues has been its near-absolute reliance on free data that can be mass harvested without cost, rather than paying to collect minimally biased data that accurately reflects society’s vast diversity.
In similar fashion, the world of data science has become defined as collecting existing data and attempting to work around its limitations (or ignoring them entirely) rather than creating new datasets that actually bear on the question at hand. Why are data scientists so adverse to creating new data? Why is it that data scientists no longer care how bad their data really is? Despite being fully aware of Twitter’s evolution over the last seven years, data scientists happily proceed with production analyses that rely on those very characteristics of Twitter that no longer exist.
Author: Kalev Leetaru