Social Media has become a favorite source of data for all kinds of research. User Experience designers involved with social media use their access to vast data sources to make all kinds of conclusions. Content marketers in all industries use it to gauge customer sentiment about their products and services. Social science researchers use it to model societal values and relationships. I have used these myself when studying communication patterns in communities of practice.
So what happens if we find out that these data have serious validity flaws to them?
To anyone studying humanity, the big data generated by social media can be hard to resist. But that kind of data is often tainted by bias, argues a new paper published in Science, and data scientists and the public should be on alert.
The full paper is available here but it is gated so the FastCo article may be a more complete source if you trust their reporting skills. Or you can just read on…
I suspect that you already knew that the user group we get from social media data is not representative of the wider population. And those who are most active are not even representative of the mass customer base of social media, so the data is skewed there too.
They also found that the data you get from scraping the traffic is not clean. There are bots that make up a significant amount of traffic and this is not always removed. Do you want your conclusions to describe bot behavior or human behavior?
Many users of social media misrepresent a lot of their actions, preferences, emotions, and relationships on social media. Not just the mythical “no one knows you’re a dog” kind of deception, but simply responding to social pressure to like what is popular, embellish your descriptions, or pretending to be more of something than you really might be.
One of my favorite examples is the Google Flu Trend. This became famous in 2008 when it accurately predicted where and when the flu had spread across the U.S. But then, this relationship became public. In pure Heisenberg fashion, knowing that their searches for “flu” and flu-related terms were being monitored and modeled, more people began doing it. So by 2012, flu prevalence was overpredicted by 95%.
Do you use social media data in your work? Do you rely on its validity? Will any of what I have reported here change the way you collect or use this data? It would be great if you could share some of your experience or ideas with us so we can all get a little better at using social media data in our work.
Image credit: kropekk_pl