Daily Archives: 2016-06-03

Data APIs and Visualization: The Sum is Greater Than the Parts

Published by: Dan Rope

U.S. Government data has always been available for free to the public, but it can be tricky to find data curated to the point that it is easily consumed by all of our fancy visualization tools.  The folks over at datausa.io have done some interesting work building a data API that works across a variety of common and useful government statistics.  I was curious to see what the potential might be when we take a simple web data API and feed its content to a simple visualization language…

The mechanics turned out to be easy since the datausa.io API allows results formatted as CSV–which is what Brunel Visualization can consume.  So the data query essentially boils down to a URL placed inside a Brunel data() statement.  Visualizations can even be immediately created by pasting these URLs into the “data” section of the Brunel Visualization online app.

So, on to some examples..  This first one uses workforce data from the ACS PUMS data provided by the US Census Bureau.  The top graph shows a heatmap of wages by hours worked per week (binned) and colored by age for full time employees.  The age value is the median of the average ages of the occupations in the bin.  Note: it would probably be better here to calculate a weighted average of age using the field containing the number of people within the occupation.  Click on a cell to see the occupations within it below on the bubble chart.  The size of the bubble represents the number of people in the occupation and the color corresponds to the Gini coefficient.  Higher (darker) Gini values indicate greater inequality wages for the occupation.

It’s interesting to poke around with some of the outlying cells to find the occupations with the highest wages and shortest hours or vice versa.  Also, the occupations in these outlying cells seem to have the most consistency for wages.

The full code (including retrieving the data) for the above example is:

x(avg_wage_ft) y(avg_hrs_ft) color(avg_age_ft:red) median(avg_age_ft)  
    bin(avg_wage_ft, avg_hrs_ft)  interaction(select) | 
bubble x(soc_name) color(gini_ft:blue) size(num_ppl_ft) label(soc_name) 
    sum(num_ppl_ft) tooltip(#all) interaction(filter)

As expected, a lot of government data is summarized geographically.  This next example uses health metrics aggregated at the state level from the University of Wisconsin’s County Health Rankings.  The histograms show the distributions of four of these metrics across all 50 states.  Roll the mouse over a histogram bar to highlight (brush) to see which states correspond to those values–or, click on your state (if you reside in the US) to see where the values for your state land for each metric.

Again, the full source code is:

map key(geo_name) opacity(#selection) tooltip(geo_name,adult_obesity,health_care_costs,diabetes,
    excessive_drinking) at(0,0,100,50) interaction(select) |  
bar x(adult_obesity) axes(x) y(#count) bin(adult_obesity) opacity(#selection) stack 
    interaction(select:mouseover) at(0,50,50,75)  |  
bar x(health_care_costs) axes(x) y(#count) bin(health_care_costs) opacity(#selection) stack 
    interaction(select:mouseover) at(0,75,50,100)  |  
bar x(diabetes) axes(x) y(#count) bin(diabetes) opacity(#selection) stack  
    interaction(select:mouseover) at(50,50,100,75) | 
bar x(excessive_drinking) axes(x) y(#count) bin(excessive_drinking) opacity(#selection) stack 
    interaction(select:mouseover) at(50,75,100,100)

Having served in a government data agency in the past, I am well aware that a major concern that comes with this type of flexibility is the potential for misuse by not reading the fine print about what the data is and what it represents.  Nonetheless, powerful data APIs combined with flexible, rapid visualization design provide significant and interesting learning opportunities.