Junk Charts
Opinion graph
Sharon P. pointed us to this graph, which can make one speechless.
Reference: "Bulls, Bears, Donkeys, Elephants", New York Times, Oct 14 2008.
A joke
For light entertainment, see this chart sent in by Chris P. Original here.
Chris said: "his graph breaks a few rules, but it has a clear message".
The shocking, out of the box column certainly grabs attention, and it is probably true that football coaches earn too much money. But the chart really falls down on this one issue:
What's the median salary of these football coaches?
Reference: "Academic salaries", PHD Comics.
Dealing with skew
Bernard L pointed us to this income distribution chart printed in the Economist.
The accompanying paragraph points to the range of the bars, that is, the gap between the top decile average and the bottom decile average, as evidence of income disparity, concluding that the US and Britain are among the worst.
Bernard likes the use of vertical sections to represent the average incomes by decile and dislikes the USA-Today style background image. Agreed. But why plot the middle deciles at all when the only worthy data involve the endpoints of the bars?
A close examination of the spacing of the middle deciles leads to more befuddlement. There does not appear to be much difference between the countries.
The answer to this is that decile statistics are not appropriate for data as skewed as incomes. At the high end, the 10% intervals are too coarse.
One clue to this is that the top 10% in the US only earns $90,000 on average but we have all heard of the billion-dollar hedge fund managers and Wall Street bankers and $30 million a movie celebrities. The problem is that within the top decile, the income distribution is also tremendously skewed.
The neat idea of plotting the vertical sections indicates an awareness that the red dots (average income) are insufficient because of the skew. Alas, there remains a lot of skew above the top decile and the designer inadvertently falls back into the same trap by considering the average income within the top 10%. Thus, the amount of disparity on the right side of the chart is grossly underestimated. Roughly speaking, we are looking at 10 samples of the distribution, nine of which at the low end of the range and only one at the top end (long tail). Here is the idea:
Reference: "Spreading the wealth", Economist, Oct 21 2008.
The matter of bad choice
Right on the heels of the disastrous bubble chart comes another, courtesy of the NYT Magazine. Bubble charts are okay for the conceptual ("this is really big, and that is really tiny"). This chart wants readers to compare the sizes of the bubbles, which highlights the worst part of such graphs.
Poor scaling is the huge issue with bubble charts. They are the prototype of what I call not "self-sufficient" charts. Without printing all the data, the chart is unscaled, and thus useless (see below middle). When all the data is printed (as in the original, below left), it is no better than a data table.
In the above right chart, we simulated the situation of a bar or column chart, i.e. we provide a scale. For this chart, the convenient "tick marks" are at 10, 20, 34, 41. Unfortunately, this scaled version also fails to amuse.
Note further that the data should have been presented in two sections: the party affiliation analysis and the gender analysis. Also, it is customary to place "Independents" between "Republicans" and "Democrats" because they are middle-of-the-road.
A profile chart is an attractive way to show this data. Here, we quickly learn a couple of things obscured in the bubble chart.
On the issue of abortion, Independents are much closer to Democrats than Republicans. Also, there is barely any difference between the genders, the only difference being the strength of support among those who want to legalize.
Reference: "A matter of Choice", New York Times Magazine, Oct 19 2008.
PS. Based on RichmondTom's suggestion, here are the cumulative profile charts.
Bernard L. suggested a "tornado" chart:
Mind the gap
When comparing two time series, one typically wants to discuss the size of the gap as it changes over time. This Business Week chart, for example, depicted for readers the expanding gap between intra-day high and low prices of the S&P 500 for 2008.
This chart construct is effective at pointing out large changes but lacks precision in conveying smaller differences, or trends. It is always a good idea to plot the gap directly, as we will show below.
More importantly, a better choice of scale can help a lot. By focusing exclusively on variability (extreme values), this chart hides the relevant information of the closing prices of the S&P. A point spread of a 100 points means more when the index is at 800 than at 1200. In order to capture this, we can divide the point spread by the opening price of that day so we say the gap is one-eighth or one-twelfth of the opening price.
The junkart version makes both changes. The top chart fixes the scale, plotting the point spread as a percentage of daily opening prices. Relative to the original chart, the variability in the front part of 2008 was muted because the index was at higher levels back then.
The bottom chart plots the gap sizes (lengths of the high-low lines). It is without doubt that directly plotting the gaps showcases the key message. The current level of volatility is more than double what occurred at the beginning of the year.
If one wants to illuminate the trend as opposed to daily fluctuations, a further improvement will be using moving averages.
For those interested, shown below is a scatter plot that compares the original point spread and the derived point spread, which shows that the change is not trivial.
Reference: "The Market: A Daily Roller Coaster", Business Week, Oct 27 2008.
How to read a graph
Via Gelman, here is a nifty book-buying map from Amazon, displaying the split between "red books" and "blue books" bought by Amazon users in each state in the months leading up to the 2004 and 2008 presidential elections.
Gelman noted the similarity between the Amazon map and the red-blue split of rich voters.
This post is about how to read a graph. Here are some things that come to mind looking at the map:
- Sampling bias: how does Amazon's customer base compare with the U.S. population, or rich voters? It would be prudent to check this before making generalizations. Gelman's point may be that Amazon customers behave like rich voters.
- Sampling period: is the period long enough to capture the average inclination of the book buyers? As is well known, book sales follow a long-tail distribution (Chris Anderson wrote an entire book based on this observation.) Best-sellers have a disproportionate influence on average values. If the time period is too short, the data may only represent the best-sellers. Consider the following two maps in successive periods in 2004:
- Classification: The long-tailed nature of book sales has wide-reaching implications on interpreting the data. The most essential feature is that single books (bestsellers) have a disproportionate impact on average sales. Since the key metric here is proportion of red (or blue) books, it follows that whether a best-seller is classified as red or blue makes a huge difference.
If the purple books include best-sellers, then the decision to call it purple rather than red or blue causes an influential book to be excluded from the calculation. We often forget that the decision to exclude is not a neutral decision; it is an active decision that says the excluded data contains no useful information.
This is not to say that excluding those books is the wrong decision. We must make these decisions with considerable care, and realize that excluding best-sellers when book sales have a long-tailed distribution must not be taken lightly.
- Causality: Lets say we are sufficiently satisfied that we can make a statement about book buying habits and voting behavior. Then we need to think about the direction of causality. Is the map saying that red book buyers are likely to vote red? Or that red voters are likely to buy red books? No prolonged staring at this data set will resolve this issue as other data would be needed to address it.
The more data is used to create a graph, the harder our task is to interpret it. But the pay-off for spending the time is all the sweeter. Happy graph-reading!
One final note: there is no doubt that this interactive map feature is a brilliant marketing move by Amazon. This is a great and fun way for readers to find interesting books.
Reference: "Amazon, U.S.A.", Gelman blog, Oct 5 2008.
Break it down, build it up
Thought of the day:
While commuting today, I wondered why we use the term "data analysis" or "data analyst". I recalled that in chemistry class, we learnt that analysis means breaking things down while synthesis means building things up.
With regards to data, typically we try to collect data at the most detailed level and we build up messages and stories from the little pieces. We don't break things down. We can't break things down, in fact, if the data come to us in aggregated form. (Think ecological fallacy.)
So why don't we say "data synthesis" rather than "data analysis"?
From bad to worse
Pie charts can range from bad to worse. Brent L. pointed us to a few on the right end of that spectrum.
Brent wrote: "The background image makes it almost totally unreadable. And what does the forest scene have to do with programming? *sigh*"
That's not to mention the oval rather than circle, the dizzling array of colors, the Excel-style legend that inverts the order of importance ("Other" at the top), etc. etc.
Again, a column chart would have been much clearer. Since the total number of famous programmers is arbitrary, a chart of counts would work at least as well as one that plots proportions.
More here.
Reference: "Famous programmers from Adleman to Zimmerman", grokcode.
Vanishing act
This is a well-executed chart showing the complex dealings between Wall Street firms in the last 40 years.
They found a way to present all the information without criss-crossing lines. The right column is the clincher. It listed all the important recent events.
Reference: "Wall Street: RIP", New York Times, Sep 28 2008.
Political theater
Jens, a long-time reader, tried to re-make the boring data tables used to report poll data. Here is an example from USA Election Polls (left) and his enhanced version (right).
Like Jens, I find most of the tabular presentation of poll data underwhelming. Too much data hiding all the useful information. For example, the pollster and polling date data provide a context for super-serious poll watchers to interpret the data; however, they do not present themselves in a way that actually help readers. Read further for versions that bring out this data much better.
Meanwhile, Jens' revision uses color and ordering to bring out the current state of affairs. The addition of electoral votes allows us to understand the relative weight of each row, countering the weakness of the tabular format, that each row has the same height, implying erroneously that they have the same importance.
There are a number of good web-sites where this type of data is presented in attractive ways.
I have been a fan of Political Arithmetik, which made great use of the pollster and polling date data mentioned above. Those data have been averaged to show the overall trend while the individual poll results are plotted as dots in the background. The polling date data is embedded in the horizontal positions of the dots. Even more impressively, the margins of error are presented. Remarkably, this race has been a statistical tie for all these months, the 95% lower limit never quite making it above the zero level.
Another great site is fivethirtyeight.com. Below, they essentially turned Jen's enhanced table into a map. The legend on the right perhaps represents what they call "East Coast bias"? All of Nathan's graphs are very attractively produced; I just wish he'd put more labels on them (such as the differentials corresponding to shades of red and blue.)
Bubbles of the same size
Frederic M. sent in this chart, together with his commentary.
Bubbles across rows have vastly different numbers but their circles are of identical size (or vice versa). It borders on the ridiculous that all bubbles of the US row have the same size... The question if teenage birth rates and teen sex are correlated cannot be eye-balled with this kind of display. The fact that you cannot compare across rows make this an instance of “chart junk”.
I add:
White spaces -- always dangerous. Does lack of bubble imply no data or no abortions/sex?
Sorting -- this is what Howard Wainer called "Arizona first" with a twist (United States)
Loss aversion -- would U.S. readers be resentful if countries like Iceland are excluded? A much reduced version comparing U.S. to say Canada, U.K, Japan and Germany may yield more information for the reader.
Sufficiency -- if all the data are printed as in a table, why do we need the bubbles?
Reference: "Let's Talk About Sex ", New York Times, Sep 6 2008.
Reading
What I have been reading:
"Google Co-Founder Has Genetic Code Linked to Parkinson’s" (New York Times)
Studies show that his likelihood of contracting Parkinson’s disease in his lifetime may be 20 percent to 80 percent, Mr. Brin said.
Talk about useless statistics. A confidence interval that is utterly useless.
"How Wall Street Lied to Its Computers" (New York Times)
Risk manager must be the most miserable job ever. When traders were raking in the millions, quants didn't get the credit (or the pay), according to Taleb, etc. Now when the market is imploding, they get the blame?
"Competing Tax Plans: two perspectives" (Freakonomics blog)
Three ways to plot the distribution of tax cuts across income brackets. I don't see why the first, and simplest, chart has a problem. The two revisions use bar charts with varying-width bars which give excessive focus on the number of people, in one case, and the base income, in the other case. It is not easy to compare areas of a tall, thin bar and a narrow, flat bar. The income group labels also present a problem of "loss aversion": why not lose the precision? or just report the percentiles?
Loss aversion
Loss aversion manifests itself in chart-making, as it does in economics. In chart-marking, loss aversion can be defined as the tendency to avoid losing data at any cost. Given a rich data set, designers often make the mistake of cramming as much data into the chart as possible. This is taking Tufte's concept of maximizing data-ink ratio to the extreme, and it often leads to awkward, muddled charts.
Gelman provided a great example of this recently. See here.
Every piece of data is given equal footing, which results in nothing standing out. The reader gasps for air.
Here is a recent example from the New York Times, in which the designer showed admirable restraint.
The best evidence is the set of small multiples shown at the bottom. These give the amount of phosphorus flowing into the lake annually since 1973, as measured from four locations.
The point is that the pollution has been most serious on the northern shores, especially in recent years. Thus, the Florida plan focusing on the southern region is likely to make limited impact.
The choice of vertical lines is smart, as the typical time-series connected-line chart would jump up and down crazily. A simple vertical axis marks the amounts, avoiding the temptation to print all the data. The designer realizes it is the trend, rather than individual values, that is the issue.
Taken together, the three components tell a good story. This is a well-executed effort. The Times once again proves itself the leader in developing sophisticated graphics.
Reference: "Florida Deal for Everglades May Help Big Sugar", New York Times, Sep 13 2008.
As simple as possible but no simpler
In this political season, we are bombarded with soundbites. For example, we keep hearing about the Red States versus the Blue States. By coloring states, we endorse the notion that Red State people are conservatives and Blue State people are liberals. Using sophisticated analyses of real data, Prof. Gelman tells us why this notion is wrong in his recently published book "Red State, Blue State, Rich State, Poor State".
The key chart is below (courtesy of Gelman).
There is indeed a red-blue divide but the gap is much much wider among rich voters (right ends of lines) and the middle class (mid-points of lines) than among poor voters (left ends). Poor voters are almost everywhere liberals on economic issues, and on social issues, they are moderates leaning conservative. Rich voters, by contrast, are very polarized on both economic and social fronts.
This is one of those charts that express their messages without fuss. A lot of data was analyzed but only the statistical conclusion is portrayed, much of the hard work hidden. Contrast this with data-rich infographics. In my view, this chart fits the description: as simple as possible but no simpler.
It is not that the common notion of a Red-Blue divide is entirely wrong; it suggests that such a view is too simple. This divide is strongly present among middle-class and rich voters but not so much poor voters. To be even more nuanced, for middle-class voters, the Red-Blue divide is manifested almost exclusively along the social dimension; middle-class voters are economic moderates everywhere, leaning liberal.
More technically, the above is summarized by saying that the interaction effect (between state residence and wealth) is significant and cannot be ignored. Prof. Gelman is one of the strongest advocates of always including interaction terms in regression of social and economic data. And here is an example of why.
The concept of interaction is tough to explain to the business audience, I have found. In presenting one such chart recently, I found the audience confused by the lines. Indeed, the lines, their slopes, etc. do not contain any information. They merely serve as guides to how to read the chart. In order to see the interactions, our eyes need to trace a path from one dot to another dot, literally tracing the lines on the chart.
Another way to discuss interactions: the common notion of a Red-Blue divide masks the reality that this divide is of varying importance across income groups. A view that aggregates income groups is too simplistic.
Reference: Red State, Blue State, Rich State, Poor State, by Andrew Gelman (2008).
Lining things up
Guess where I went for vacation (clue in the chart).
This long, narrow country is divided into 15 regions. In the chart below, an uneven parade of 13 bubbles was used to present some sort of economic projections. The capital of the country was singled out as the top of the table.
The unevenness has a side effect, that the guiding lines are forced to have differing lengths and bewildering turns. Further, because bubbles have no intrinsic scale, the designer must put all the data onto the map as well, thus failing our self-sufficiency test..
The following bar chart version respects the wide, thin space and yet delivers the data more clearly. The top version displays all the data while the bottom one uses a simple axis. The bottom chart is my preference since most readers are probably interested in approximate and relative comparisons, rather than exact projections. (The map would be better off without colors.)
Reference: "Inversiones entre 2008 y 2012 llegaran a US$ 57 mil millones impulsadas por mineria y energia", El Mercurio, Aug 25 2008.
Sloppy statistics
As hinted in the previous post, there are rare situations in which pie charts are acceptable; typically, these charts must show proportions that add up to 100%. If column charts (or line charts) are used instead, readers who aren't careful may assume incorrectly that the columns add up to the whole.
Pie charts show distributions. How should one state the key message of the following pie chart?
I. Type A is the majority.
II. The most frequent type is Type A.
III. Type A is a minority.
IV. Every other type but A form the majority.
I would pick statement II, followed by statement I. Statement I is the only false statement out of the four if one uses a strict definition of "majority" (more than half). If one goes by the spirit rather than the word of the law, statement I does pick up the key message albeit imprecisely. Statement III is a true statement but particularly misleading in the context of this pie chart. For every type is a minority type if we define "minority" as less than half. Statement IV is a tortuous way to define a "majority" where there is none.
Neither III nor IV points to a key feature of the data. It seems ridiculous to even include them. Lets reveal the underlying data.
Last week, a story coursed through the mainstream media, relating to the above projections published by the Census Bureau. (Projections were created for 2050 but mention was made of the fact that the largest racial group would account for less than half the population by 2042.) Here were some of the headlines:
"2042 to see a white minority" (New York Post, 8/14/2008) -- III
"Minorities fixed to become new majority" (Daily Vidette, Illinois State University, 8/20/2008) -- IV
"US set for dramatic change as white America becomes minority by 2042" (Guardian, 8/15/2008) -- III
"...minorities collectively will make up the majority of people in America by 2042..." (Detroit Free Press, 8/21/2008) -- IV
Like I said, statement III is strictly speaking true but by 2042 every race is projected to be a minority. Statement IV is just odd: of course, if one started adding up enough "minority" types, one will eventually attain majority.
Not all is lost, however. The following headlines painted a more vivid image:
"Whites to lose majority status in US by 2042" (Wall Street Journal, 8/14/2008)
"White Americans no longer a majority by 2042" (Associated Press, 8/13/2008)
Elsewhere, a Boston Globe column makes an important observation: that Hispanic whites should probably be grouped with whites rather than Hispanics. Technically, he argued that Hispanic is not a race. From his point of view, the pie chart looks like this:
Off-sheet accounting
For mid-week entertainment, this full-page ad appeared in the Wall Street Journal recently:
The vertical axis says "% NYSE of All Market Share Volume". The time-line is from July 05 to beyond July 08. The text in the black box is "Matched Market Share: July 11, 2008".
When it's so obvious, it's probably not obvious. The big story is off the chart: what happened to the other 50% of the volume over the years?
Faced with this, one reaches for the pie chart (... almost).
Small add (8/21/2008):
Olympic tallies
Andrew N., a reader from Australia, wasn't too impressed with the way National Nine News presents the Olympic medal table on its home page. To the extent that we want to venture beyond the typical tabular presentation, this bar chart is in fact quite appropriate. Let me explain.
Lets take a tour around the world. It's the battle of the data tables.
The Boston Globe's is the cleanest of the bunch. I especially like the way they set up the USA count at the top; the use of country codes is inferior to spelling out country names, as done in all of the other examples. The New York Times is the only one to utilize colors to set aside gold, silver and bronze, which lets readers easily assess the two dominant metrics, total golds and total medals. A small touch but very nice.
The biggest design issue here is the existence of the two different metrics. In any tabular presentation, the countries can be ranked by only one metric so the designer must make a choice. The American papers present ranking by total medals; the French paper by total golds; the two Canadian ones shown here are split. The American papers also choose to carry the ranking implicitly while the others explicitly provide a numerical rank. Le Monde and Globe and Mail provide ranks that are consistent with ordering of countries, both by total golds. The Star, by contrast, wants it both ways: the order reflects total medals while the "POS" column shows total golds. This extra column does help the readers who prefer ranking by golds but the primacy of the other ranking has not been overcome.
So what about National Nine News? I have not been a fan of stacked bar charts but surprisingly, this is a great application. Stacked bars have the disadvantage that the stacked segments don't share the same base and thus it is difficult to compare their lengths. Here, though, our two metrics are total medals and total golds so readers should be drawn to compare the total lengths, and the lengths of the first segments. Those wanting to compare silvers and bronzes must make a stronger effort but they will be in the minority.
What can be improved are the distracting data labels, especially the gold circles. Instead, one should provide a scale, or use symbols such as one circle per medal. (See this old post.) Here is a version with a scale:
One cannot end this post without mentioning the attempt by NYT editors to insert levity into these proceedings with first a cartogram and then a bubble chart.
The dog ate the margins
In his column on automated polls versus traditional telephone polls, the Numbers Guy at Wall Street Journal gave us a few entertaining quotes.
"The dog could be answering the questions, " Ann Selzer, a traditional pollster, said of automated polling, which occurs through automated voice messages to voter who record responses. Also, WSJ cited a prominent textbook which labelled them as "Computerized Response Automated Polls -- insulting acronym intended."
Reader Mark A. brought this to our attention because of the following chart. He wondered what the point of the vertical axis was.
Aside from that cosmetic problem, the biggest issue is the lack of explanation. Predictive power, pollster-introduced error, methodological error: what are these? The article itself gives no clues. To make sense of the chart, readers need to consult Nathan Silver's (excellent) site, fivethirtyeight.com. (The gory details here.)
By the way, Nathan's site has a variety of nicely produced charts. (Like this one, readers will need to dig around to collect background information to interpret some of those charts.)
Another improvement is to provide some sense of the variance in the data, either by showing more than the top five pollsters or by showing the range of errors. Since the average pollster sits on the right edge, it is as if the right half of the chart was clipped. In the version below, we found most polls hovering around the average, with two egregiously bad.
If we know which polls are automated and which aren't, then color the dots accordingly.
There are bench players on every chart: these are the titles, axes, labels, text and so on. They provide background information required to interpret the chart. They may sit in the margins but their value is not to be underestimated.
Don't let the dog eat the marginal information.
Reference: "Press 1 for Obama, 2 for McCain", Wall Street Journal, Aug 1 2008.
A tale on two charts
By now, everyone knows subprime mortgage lenders in the U.S. are in a world of hurt. The following pair of charts illustrates how serious the problem is. Lenders track the proportion of borrowers who are "60-day" and "90-day" delinquent, meaning late or no payment in the last 60 or 90 days. Lenders count on a contained delinquency rate in order to run a viable business. These loans stretch for often 20-30 years so it is crucial to catch problems early.
This trend can be seen in the IMF chart (right). Take the 2000 vintage of subprime borrowers. The peak 60-day delinquency occurred around 45 months after loan origination, and then tailed off. (A 60-day delinquent borrower will eventually become 90-day delinquent, or less likely, non-delinquent, in either case causing the curve to tail off. The tailing off feature is, in fact, undesirable, and can be removed by plotting delinquency rates of 60 days or more, as opposed to just 60 days. This is what the NYT chart on the left did.)
What do these graphs say about the current malaise? The NYT chart (right) carries its information in the relative slopes of the lines. The steeper is the line, the quicker borrowers are becoming 90+ days delinquent. Another piece of information is when each curve starts to take off: the 2007 curve lifts off much earlier in the year than the 2005 and 2006 curves. This last point is less secure because the graphic does not preclude the scenario in which loan origination is biased toward the latter part of the year. Similarly, the crossovers between one curve and another tell us the extent of the problem compared to the past but again, the reader has to do much work to learn this information, as shown.
The IMF chart took a different view. If we selected only the 2005-2007 curves and shifted the 2006 curve to the 12 month point, and the 2007 curve to the 24 month point, we would have recreated the NYT chart. (We gloss over the matter of counting 90+ days rather than 60-day delinquency, and the other matter concerning the recency of the data used.) Here, the key information is coded in the vertical distances between the curves. Taking the vertical as shown below, the reader can see that the 2006 and 2007 vintages have performed almost twice as badly right off the starting gates while the 2005 vintage looked normal but worsened significantly after the two-year mark.
For lenders monitoring performance, the IMF chart is much more useful. For someone wanting to know the current state of delinquency, the NYT chart is easier to work with. It must be said that it is always better to plot more history and longer time horizons (like the IMF did).
Finally, can someone please prove a four-color theorem for graphs? Spraying rainbow colors on charts is a bad habit (for example, also in the house price index charts).
Reference: "Housing Lenders Fear Bigger Wave of Loan Defaults", New York Times, Aug 4 2008; "IMF Sees World Growth Slowing, with U.S. Marked Down", IMF, Jan 29 2008.