By Eric Van Meter
When Facebook announced in September that it would use all that personal data it collects to roll out a new ad platform to rival the almighty Google, privacy advocates groaned and marketers grinned. But what if all that intelligence could be used not to sell more widgets but to crack open one of today’s most pressing yet least understood public health issues?
That’s precisely the vision of University of Arizona’s Daniel Zeng, MIS professor at the Eller College of Management, and Scott Leischow, adjunct faculty in the UA College of Medicine and professor of health services research at Arizona’s Mayo Clinic. Fusing cutting-edge informatics and public health, their plan to scrape social media to create the world’s best data on e-cigarette usage and marketing recently won a five-year, $2.7 million thumbs up from the NIH.
The project will tackle four distinct goals over the next five years:
- Creating a massive, real-time and continuously growing data set of what consumers and marketers say about e-cigarettes on sites like Facebook and Twitter as well as social media forums focused on e-cigarettes and “vaping”
- Mining that content for insights into why people use e-cigarettes, how they believe they affect their health and whether or not they help them quit smoking
- Documenting the marketing landscape — all the ways brands and vendors use these channels to promote their products and how consumers respond
- Integrating all of that information in the world’s first one-stop resource for wide-ranging data on e-cigarettes as revealed through social media as a tool for other researchers, healthcare professionals and more
While e-cigarettes are relatively new in the U.S. (introduced in 2007), sales are doubling annually and were expected to reach $1 billion last year. Even so, any time public dollars fund research, two questions naturally arise: Why study this, and why study it this way?
“There’s so much we don’t know about e-cigarettes,” says Leischow. “The scientific community has found mixed data on whether they’re helpful for smoking cessation, we have questions about how different flavorings impact use, particularly among minors, and many health professionals worry that e-cigarettes may ultimately lead to more young people taking up smoking. All of these blind spots around a product that is still totally unregulated make this a top priority area for the FDA.”
As for why it makes sense to study e-cigarettes in this way, Zeng’s MIS expertise holds the key. “To date, most of the inquiry in this area has been through surveys or individuals personally combing through what people are saying online. Both methods carry inherent problems.”
In contrast, mining social media in real time, as Zeng and Leischow have proposed offers a number of strategic advantages:
- Data comes from people interacting naturally in their day-to-day lives, thus removing “presentation bias” problems intrinsic in surveys
- The data collection is automated, which means sample size is not constrained by how much money or how many eyeball hours researchers can muster
- That lack of constraint also makes anecdotal information scientifically relevant: one personal story is just that, but 10,000 or 100,000 personal stories over time equal robust statistical data
- Finally, because content is processed by algorithms not people, data is available in near real-time, not months or even years later after countless hours of labor-intensive review.
Making that vision a reality depends on a very elite class of technical wizardry. For starters, the world of e-cigarettes, like that of any niche product or interest, has it’s own specialized vocabulary of acronyms and slang, and so the research team will first need to construct a base lexical dataset for “training” the computers that will collect and process content.
It’s also one thing to scrape words but an exponentially more complex challenge to automate the process of extracting meaning, so that a computer can spot when someone cites a reason for using e-cigarettes or mentions how the products affect his/her health (both of which first require a computer to detect who is or isn’t a user) or correctly catalog the marketing strategy used in an advertisement.
“We basically will be creating a suite of novel technologies for this study using both established building blocks of informatics and methods that have yet to be developed,” Zeng explains, “including analysis and visualization tools that were developed here at the U of A. Beyond that, we’re relying on proven tools for pattern mining, group behavior prediction, social network analysis and a lot more, but in ways that have never been combined for this level of research and in this topic area.”
For Leischow, the knowledge those tools will produce is invaluable. “There are all kinds of messages out there, from how effective e-cigarettes can be to help smokers quit tobacco to how they’re totally harmless or taste like candy. It may be that e-cigarettes prove beneficial to public health, or they may be shown to do more harm than good. In either case, it often takes many years for experts to fully recognize how products are being used and how they impact wellbeing, and even longer for regulation to catch up. This time, it’s going to be different. This time, we’re getting out ahead.”
Top photo of e-cigarette courtesy Shutterstock.