Causal Modeling for Social Media Self Selected Samples

Marketing uses data to answer questions since the modern use of the term. Its traditional methods involve expensive data gathering and experimental procedures such as a polls, focus groups, eye-tracking, clicks usually with rigorous sampling techniques.

Working with social media data (e.g., analyzing Twitter data) encounters the self selection problem: the people talking about a particular movie on-line are the subset of people interested in the movie that are vocal enough to publish their views to a global audience. This is not a minor issue, recent studies found that 75% of adults in US never tweet. This is in-line with the 1% rule in Internet culture.

Interestingly, the problem of self selected samples has been studied in causal modeling and there are techniques available to deal with such problems.

While the mathematics and the techniques exist, they are not being applied in data science practice. Practitioners are happy to take a collection of posts, run sentiment analysis over them and push forward for further analysis and decision making graphs and results based on that sample. A library or detailed tutorial of how to correct for these issues would be of great value to the field.