Behavioral Risk Factors: Health Related Quality of Life (1993 - 2010)
I've been thinking a lot about study data, and as some of you may know I'm working with a couple of researchers to study social determinants of health. According to the all-knowing Wikipedia genie:
Why are we doing this? Well, if some people are more or less healthy because of factors like poverty, shouldn't those factors be a consideration when determining health policy?
The CDC has been studying behavioral risk factors since 1993 (actually, they stopped in 2010 due to a lack of funding) and their results are open to the public. I found this data set on data.world, one of my new favorite places to look for data. I'm pretty new to data.world, but I love the idea. People who are interested in data take pains to share it with others - and then that group starts a conversation about that data. Sometimes it's to suggest improvements, sometimes to toss out new ideas, and sometimes - as in my case - to share ideas of how to tell data stories with it. There are a ton of categories of data - and something for everyone. Let's hope data.world doesn't get disappeared, but in fact becomes a sanctuary site.
Anyway, back to the data. I cracked it open the other night - all 126465 rows of it - and started digging. Pen and paper handy, I started drawing the columns and the connections, color-coded the categories that jumped out and I poured over the figures carefully. I'd read the warning the CDC included:
Recently, I wrote a blog post for Tableau Public about the challenges I had visualizing survey data. Those problems were all about how I'd designed (or incorrectly designed) my survey. I was thinking about how to lower the bar for the respondent, and wasn't really thinking about how hard it would be to clean the data. I had one survey with 6 questions. I got 500 responses, and it took me 6 hours to clean. Had I l known then what I know now, it should have taken me less than 10 minutes to clean.
This data set was a completely different animal than my piddly 6-question Google form. On one hand, it was nicely designed and very clean. Relatively easy to visualize. On the other, there was that warning I read, and it gave me pause. Here's where I have to be honest about something. I am not an expert, nor do I purport to be. I have no illusions about how challenging it is to analyze statistical data, especially when tying it to human behavior. I would not give me raw data from a scientific study for which I have no qualifications to draw any conclusions. There's too much at risk when policy decisions affect peoples' lives, and I do not have the skills or the training (or the desire) to take that on.
So why visualize this? Because there's still something I can offer. I'm ethical and transparent in my approach. I set my boundaries carefully when looking at study data. I don't ever try to draw conclusions or correlations from the data. That's the role of the statistician or the biostatistician. Even in my little 6-question survey I was careful not to draw conclusions because I knew that I couldn't factor in bias and I couldn't calculate the impact of word choice, or of the time of day, or any other factors that could impact how people respond. Same goes for the study data from the CDC.
I could, however, offer a glimpse of who responded: how many from where, how old and what gender. A demographic breakdown of responses. No analysis, just display. That's the Study Demographics tab above. I could also help anyone interested explore the mean values and high and low confidence levels as well as the percent of population as calculated by the report authors. Again, no analysis, just display. That's the Population Response Dashboard tab above.
Why is this valuable? Two reasons: first of all, have you ever tried to get a sense of an Excel spreadsheet that's 126k rows long and 25 columns wide? It's not easy unless you start filtering and charting, and that takes time. Secondly, I'm fast. Study analysis takes months or years. Give me a data set that's pretty clean and I can turn something around in a few hours. That's valuable. Plus, using a dashboard based on visuals can be a lot more accessible to a lot more people of different skill levels. A well-designed dashboard can make your first steps into a big study a lot easier.