Talk:Box plot

From Wikipedia, the free encyclopedia

This is the talk page for discussing improvements to the Box plot article.

Article policies

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.
Mathematics rating: Start Class Mid Priority  Field: Probability and statistics

Contents

[edit] New Question!

If the only data I have is a box plot, can I determine the SD and SE of the mean? Thanks —Preceding unsigned comment added by 128.118.7.83 (talk) 13:27, 28 March 2008 (UTC)

[edit] Question!

I have a question. In the data you use to make the diagram, if a number is repeated, how do you put that on the graph? I'm confused.

—The preceding unsigned comment was added by 74.244.254.26 (talk) 23:08, 6 December 2006 (UTC).

You don't put it itself in the graph. You use it to compute the median, box edges, and whiskers. Those quantities are the ones plotted. Baccyak4H (Yak!) 03:13, 7 December 2006 (UTC)

[edit] Expansion, revision

Hey folks, When I came across this page it was in need of some serious revision. There were grammatical errors and repeated information. I took the liberty of removing some of the repeated information and providing a detailed step-by-step construction of the boxplot.

[edit] More suggestions

Something on parallel boxplots would be nice. I am very new here, still feeling my way. I can try to do something, but it might be messy, html is not my forte Plf515 23:52, 26 November 2006 (UTC)plf515

Something should be added about "notches". —Preceding unsigned comment added by CptNautilus (talk • contribs) 17:50, 7 November 2007 (UTC)

[edit] Error?

"Histograms and probability density functions require assumptions of the statistical distribution." What about empirical histograms and pdf's? As far as I can see these do not require any a-priori assumptions on distributions?

I don't know who wrote that but i agree, so i've altered that section accordingly. Qwfp (talk) 17:43, 22 February 2008 (UTC)

[edit] Boxplot Figure

The current figure, Michelsonmorley-boxplot.svg, does not render correctly and could thus confuse readers. I suggest that we revert to the non-svg file R-speed_of_light_boxplot.png. Innohead 13:06, 15 February 2007 (UTC)

[edit] How to make a box plot

Ex: Let's say the data is 39, 63, 92, 41, 99, and 55.

1. Find the median. Ex: 39, 99. 41, 92. Then, find the mean of 55 and 63. (It's 59).

2. Find the median of the lower numbers. 39, 59. 41 and 55. Median: 48.

3. Do the same with the higher numbers. 59, 99. 63 and 92. Median: 77.5.

4. Plot on a number line. Ex:

35--40--45--50--55--60--65--70--75--80--85--90--95--100
  39------48-------59------------77.5--------------99 

5. Draw a box around the 3 quartiles (the medians).

6. Label. Sealpiano 01:06, 15 June 2007 (UTC)

[edit] Formatting error

The quartile description is wrongly formatted, with leading [[s but no closing ]]. I'd fix it, but I can't decipher what the correct sentence should be, as I don't understand it and think it may be a partial delete which has become nonsensical.Lilac Soul 08:32, 20 June 2007 (UTC)

[edit] Errors

Where did someone get the idea that Sheldon invented the box plot? Tukey invented the plot at least as early as 1970, when he was circulating the manuscript for Exploratory Data Analysis. He adapted the shape from a graphic invented by Mary Ellen Spear in 1952 (Tufte shows a picture of her "range chart in VDQI). Tukey modified Spear's charts in several significant ways: he used letter values instead of quartiles, he devised distribution-free quantiles for representing outliers and extreme outliers, and (with McGill) he devised confidence intervals on the median and represented them with notches.

The algorithm for producing the box plot in this article page is not Tukey's. A discussant above describes the correct algorithm. Will someone please read Tukey's EDA and correct these serious errors? As it stands, this article only contributes to the confusion over boxplot definitions and makes it seem as if anything goes. As the Frigge, Hoaglin, and Iglewicz article clearly shows, only a few statistics packages (Minitab, SYSTAT, DataDesk) get it right. 202.62.81.253 (talk) 04:15, 23 January 2008 (UTC)

Feel free to change the article as you see fit, especially if you can include reliable sources. However, it may be prudent to recognize that while there may be only one "right" way to do these right in the sense of doing them like (say) Tukey described, note that the method is an exploratory one, and the value is not in getting it "right" but making it useful. Alternative implementations are not a bad thing so long as they are not attributed falsely. Baccyak4H (Yak!) 14:34, 23 January 2008 (UTC)
For reasons I've explained in a footnote to the article, I don't think these can really be described as "serious errors". Sampling error is always much bigger than the difference between these definitions of quartiles/hinges. And Tukey himself proposed several variations so there's no single "right" version of a boxplot (see also my comments below). Qwfp (talk) 17:41, 22 February 2008 (UTC)

[edit] In need of attention

1) The external link http://www.physics.csbsju.edu/stats/box2.html contradicts the article in regards to how you mark outliers / Suspected outliers on a plot. 2) I believe the "whisker" of a box plot is the line drawn from the box, not the tick mark at the end. 3) The whole article could use some cosmetic work and in-line citation. Ajonlime (talk) 01:35, 29 January 2008 (UTC)

1) Precise definitions of what constitutes an outlier and how to mark them do vary. This is not surprising as Tukey himself gave 3 variations on the box plot just in the one book (see mathworld [1] – if you follow the links from that you'll see why he was also famous for introducing a plethora of neologisms and technical meanings for existing words: hinge, fence, H-spread, step, adjacent value... you can't blame people for getting confused). I think a note that "precise definitions vary" is all that's needed, and a link to quantile to explain that there are different ways of calculating quartiles – however sampling variation is always much bigger than the variation between definitions so the latter is of very little consequence.
2) Agree. I'll go ahead and change it.
3) Maybe, but that's "clean-up" not "expert attention", so when i've done (1) and (2) i'll remove the "expert attention" template.
Qwfp (talk) 16:58, 22 February 2008 (UTC)

No, Tukey did not give 3 variations on the box plot. You are trusting a secondary source who has not read Tukey carefully and did not know Tukey when he was alive. Go ahead and "follow the links" from a source who never met Tukey, but it will take you to misinformation. Tukey actually cared about the differences you are dismissing with a shrug. Furthermore, the site is obviously being vandalized, since the Sheldon quote has turned up again. This is a minor issue and a small point, but I think it is representative of a Wikipedia problem. Wikipedians have excluded experts by requiring secondary-source attribution, and so the material that ends up in a listing like this is often wrong. You can argue all you want, but I knew Tukey and he would be amused by the content of this article. —Preceding unsigned comment added by 67.173.98.211 (talk) 00:18, 26 March 2008 (UTC)

67.173.98.211, If not, it is true that Wikipedia focuses on claims that can be verified. If you can provide a reference other than your personal recollection, I think it would make sense to include it. Do you have one? Pdbailey (talk) 03:25, 26 March 2008 (UTC)
If you see something wrong it's simpler if you edit the article and give your sources. Wikipedia does not require secondary sources for references, only to establish notability (which is not the issue here) so it's fine to cite Tukey's original works and I'd quite agree that it would be preferable, but unfortunately they aren't always to hand. I've removed the unsourced Sheldon claim — thanks for pointing it out but again why didn't you just go ahead and fix it yourself? I missed that change because I haven't looked at Wikipedia for several days. You've succeeded in making me curious enough to want to dig Tukey's EDA book out of the library though. I'll admit it's several years since I last looked at it. Practises do evolve, and this article is primarily about the box plot as currently used rather than the history of the concept. The history is interesting however and if it's going to be included we should try to get it right. Qwfp (talk) 05:17, 26 March 2008 (UTC)

Thanks. I'm a bit gun-shy about editing and, as you can see, not too literate on the Wiki conventions. I thought it would be better to stay in the background and goad others to think more about the box plot entry. It's fascinating to me that anyone would care to vandalize a page with a "Sheldon" comment without providing a single piece of evidence. I can't imagine the social dynamic going on here. And yes, practices do evolve. One fairly widely used statistics package put the center on the mean, the hinges at one standard deviation, and the end of the whiskers at the extreme values. They called this EDA. It completely defeated the purpose of the box plot and, worse, it led many to think this is what a box plot is. So history does matter. We don't go changing the formula for least-squares regression and call it "least-squares," even though there are more useful regression methods available today. There are better contemporary alternatives to the box plot (for the purpose Tukey had in mind), but we shouldn't change the meaning of the display to defeat his original purpose in inventing it. Thanks again for your thoughtful reply. —Preceding unsigned comment added by 67.173.98.211 (talk) 14:05, 26 March 2008 (UTC)

Well, thanks for calling my reply courteous. I thought it was slightly curt by my usual WP standards — No angry mastodons had a big influence on me. I'd encourage you to be bold and edit the articles without worrying too much about conventions — if the substance looks right and is referenced somehow, someone else will (usually) fix any perceived problems with the style and format the refs (which I admit gets a bit technical).
I've no idea what the "Sheldon" business is about either — not come across anything quite like that on other pages. I don't know which stats package uses mean and SD for boxplots but i'd agree that's just plain wrong. I can't get so bothered about minor differences in definitions of quartiles or the distinction between quartiles and hinges. Meant to get EDA out of the library today but forgot my library card... another day. I agree that history matters — I've added the seminal or eponymous ref to several stats articles. I've been wondering what you have in mind when you mention "better contemporary alternatives to the box plot" — violin plots perhaps? I used them quite a bit when I first discovered them but I'd all but forgotten about them since and it seems Wikipedia doesn't have any mention of them (as yet...) Qwfp (talk) 21:18, 26 March 2008 (UTC)

The violin plot is nice, but it's just a kernel density plot. Better to overlay it with a box plot. Then you get both types of information. Also, the violin plot suffers from the bandwidth estimation problem that someone mentioned in the box-plot article. A better alternative is the dot-box plot, found in Wilkinson, L (1999). Dot plots. The American Statistician, 53, 276-281. It overlays a box plot with a dot plot, so you can see all the data and also see the median, outliers, etc. It ameliorates the main deficiency of the box plot - that it can look identical for certain unimodal and multimodal datasets. I also like the Hofmann, Wickham, Kafadar letter-value box plots <http://www.stat.iastate.edu/preprint/articles/2006-10.pdf>. This paper covers a lot of the details I've been bringing up in this discussion.

I want to illustrate the specifics of our disagreement. The argument has nothing to do with the appearance of the box plot (that's the topic of the Mathworld article; Tukey drew several different kinds of box plots -- that's just a matter of surface appearance). It has to do with letter values vs. quantiles. Basically, your statement that "sampling variation is always much bigger than the variation between definitions so the latter is of very little consequence" is false. Take the simple example x = {1,5,6,7,9}. The first 3 Tukey letter values for this batch are 5 (median) and <6, 7> (hinges). The Tukey box plot for these data show an outlier for 1, a box from 5 to 7 and one whisker from 7 to 9. Now, there are many ways to compute quartiles. Here are just a few results using different algorithms (try it with SAS or another comparable statistics package): <4, 6, 7.5>, <2, 5.5, 6.75>, <5, 6, 7>, <3, 6, 8>. Only one of these (the <5, 6, 7> based on the empirical cumulative distribution function) yields the Tukey letter values; and even that method doesn't always yield letter values. If you draw a box plot based on those estimates, it will look quite different from Tukey's. Only one of the quartile methods yields an outlier. Now, you might say that larger datasets will show less dramatic a difference. That would be generally true, but it is easy to construct counterexamples. The kind of ill-behaved data Tukey anticipated are precisely the ones that are smoked out in his box plot (as opposed to histograms and other density estimators).

Let me describe Tukey's letter-value algorithm, because the poster above didn't quite get it right: 1. Sort the data. 2. Label the sorted list W. 3. Compute the conventional median of W (pick middle value if N is odd, or average two middle values if N is even). Save this letter value. 4. Split W at the median into two lists, L and U. If N is odd, include the median at the end of L and at the beginning of U. 5. Recurse 3-4 for L and U (labeling each as W) until there are no blocks left to split.

The algorithm is most easily programmed as a recursive function, but it is simple to do in a loop with several pointers. For the box plot, we need to recurse only once to get the hinges. Tukey computed more letter values to characterize distributions in more detail. The Hofmann, Wickham, Kafadar letter-value plot exploits this characterization.

Now, why did Tukey do such a peculiar thing? 1. It was simple to do with a paper and pencil. That's one of the main points of EDA. Although Tukey revolutionized statistical computing, he always chose the simpler course over the more complex when he could. 2. Tukey chose actual data values as descriptors (exemplars) instead of latent, hidden, hypothetical, population (pick your word) parameters. There are exceptions, of course, but note how often he used letter values in other, more complex, analytics such as smoothing. He didn't like getting too far away from data, and this was the source of many of the controversies Tukey got into with model-oriented statisticians. 3. Letter values have a precise definition in terms of the data batch. A high-school student can understand the algorithm. If you do some research on quantiles, by contrast, you will find it a morass of different approaches. The elementary statistics book algorithm, based on linear interpolation, barely scratches the surface. 4. Letter values are robust. See Understanding Robust and Exploratory Data Analysis by David C. Hoaglin, Frederick Mosteller, John W. Tukey, John Wiley & Sons., 2000. Many quantile methods depend on restrictive assumptions on the data.

So, this is of more than historical interest. The statistics packages really do differ, sometimes substantially. That's the point of the Hoaglin et al. article cited in the references. So, if I were writing this article, I'd devote a paragraph to letter values vs. quantiles. And I'd point out the difference -- feel free to use my example. It's not that one method's right and the other wrong. It's that the quantile/letter-value distinction can have a profound effect on the appearance of the box plot -- enough to influence what one considers an outlier. You would be surprised at how much box plots differ across statistics packages. On the same data. That's because the quantile-based box plots don't always disclose the algorithm they are using to estimate quantiles. With Tukey letter-values, there's no ambiguity.

I'm going to frustrate you again by not touching the article. You are obviously an intelligent and curious editor experienced with the ways of Wiki and a good monitor of this area (I know that sounds patronizing, but I don't mean it to be). And I suspect you edit in other statistical areas as well. So, in the end, the more research you pick up on your own, the more likely the quality of these articles will improve. I am an expert in this area but I'm not likely to get involved much further. And I really should stop adding comments to this discussion, because it's taking more space than the topic deserves. Thanks for your understanding.67.173.98.211 (talk) 16:25, 28 March 2008 (UTC)


[edit] New Question!

If the only data I have is a box plot, can I determine the SD and SE of the mean? Thanks —Preceding unsigned comment added by 128.118.7.83 (talk) 13:27, 28 March 2008 (UTC)

No. Qwfp (talk) 13:53, 28 March 2008 (UTC)

Actually, you can, but it's kind of pointless. The estimate isn't very good. Estimating the standard deviation from the range was used in the classic quality control literature because it was more difficult to cumulate sums of squares on a simple calculator. See Introduction to Statistical Quality Control, 5th ed., Douglas Montgomery, pages 95-6. If you use a box plot to do this, you will have to be sure there are no extreme outliers, because they will bias the estimate even more than usual.67.173.98.211 (talk) 14:16, 28 March 2008 (UTC)