Wikipedia:Reference desk/Archives/Mathematics/2007 October 29

From Wikipedia, the free encyclopedia

< Wikipedia:Reference desk | Archives | Mathematics

Mathematics desk
< October 28	<< Sep \| October \| Nov >>	October 30 >

Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.

1 October 29

[edit] October 29

[edit] suggestion

it sould have algebraic terms and expressions —Preceding unsigned comment added by Glhenn08 (talk • contribs) 03:35, 29 October 2007 (UTC)

I'm sorry, a suggestion for what? If you're referring to something else on this page, instead of clicking the "Ask a question" link up the top, click on the [edit] link to the right of the section header, so that you edit the specific question. Confusing Manifestation 06:08, 29 October 2007 (UTC)

The questioner is suggesting that the Mathematics Reference Desk should feature some algebraic terms and expressions. I disagree, but I believe this was the intended suggestion. HYENASTE 22:29, 30 October 2007 (UTC)

[edit] Histograms and probability distributions

What are the correct terms to distinguish between what you get when you smoothen out a histogram of observed data, correcting scales such that the area under the curve is 1, and the underlying true probability distribution? --NorwegianBlue^talk 09:24, 29 October 2007 (UTC)

By smoothing you are trying to estimate the probability density function of the population, much like the arithmetic average of a sample is used to estimate the population average. So you could call it the "estimated density (function)". A common class of methods for estimating the p.d.f. from the observed data that bypasses the histogram is described under Kernel density estimation. --Lambiam 09:50, 29 October 2007 (UTC)

Thank you!

Can I extrapolate your answer into stating that the expression "observed probability distribution" (~550 google hits) should be avoided?

Is "estimated probability distribution" (~10,000 google hits) acceptable? --NorwegianBlue^talk 10:14, 29 October 2007 (UTC)

P.S. I fully appreciate that "estimated density (function)" is better than "estimated probability distribution". The reason I'm asking is that I'm preparing a talk for an audience which is not very mathematically sophisticated, and I'd prefer to avoid talking about densities. At the same time, I need to be precise about the distinction between our observations, and the underlying processes that we imagine are responsible for generating the data. I'm thinking of drawing a cartoon of Plato's allegory of the cave, with the histogram on the cave wall, and the density function behind the viewer. --NorwegianBlue^talk 10:54, 29 October 2007 (UTC)

Indeed, there is no such thing as "observed probability distribution". I have a coin — maybe fair, maybe not. I flip it once, and you record the outcome. Have you observed the probability distribution? Absurd, eh? Obviously not. I flip again. Now? No; but then when? Even a fair coin can produce ten heads in a row! From statistical observations we can only estimate properties of the generator. The more observations, the better the estimate. Plato's cave is a pretty good metaphor, but perhaps the coin drives the point home better. --KSmrq^T 14:27, 29 October 2007 (UTC)

Thanks for the suggestion! Good point, I'll use it. --NorwegianBlue^talk 16:51, 29 October 2007 (UTC)

[edit] Exponentially distributed data

When counting residual (unwanted) cells in blood components, I find that such counts tend to be best described by the exponential distribution. The illustration is a histogram of platelet counts in centrifuged plasma. The exponential distribution (red) gives a decent fit, while a normal distribution (blue) obviously is way off.

The article exponential distribution gives some examples of real-world scenarios which tend to be exponentially distributed, such as the time or distance between poisson-distributed events, but none of these really resemble my example, as far as I can see. I suspect that the principal source of variation in my data is the handling skills of the operator who carried the centrifuged blood bag from the centrifuge and placed it in the separator device (the device then applies gentle pressure to the bag, and lets the plasma escape through a tube on top of the bag).

I would like to understand why the exponential distribution gives such a good fit with the cell count data. Thank you! --NorwegianBlue^talk 10:17, 29 October 2007 (UTC)

A possible explanation. Assume, that there are say a N platelet and each platelet has a small but finite chance, p, of escaping through the tube. You can use the binomial distribution to calculate the chance of n platelets escaping: NCn p^n (1-p)^(N-n). Now for very large n and very small p, the binomial is approximated by the Poisson distribution with mean p*n. This would also fix your data. Just a thought. --Salix alba (talk) 11:23, 29 October 2007 (UTC)

Platelet conc	Occurences
0	1
1	103
2	415
3	212
4	112
5	106
6	64
7	72
8	50
9	36
10	42
11	30
12	28
13	29
14	15
15	18
16	12
17	11
18	10
19	5
20	6
21	5
22	2
23	3
24	3
25	3
26	2
27	5
28	1
29	2
30	2
35	2
37	1
43	1
52	1
63	1

I don't think that's quite it. As mentioned above, I think the key lies in what happens when the bag is carried from the centrifuge, and put in the separating device. After centrifugation, there is an interface (buffy coat) between the plasma and the red cells. Most of the platelets are there. Although the bags are handled very gently to preserve the buffy coat layer, occationally a bag might be shaken slightly, causing some of the platelets to mix with the plasma. This would be a rare event, and it would have varying strength, i.e. disturb the interface to a varying extent. Would such a scenario be expected to lead to an exponential distribution? --NorwegianBlue^talk 12:11, 29 October 2007 (UTC)

The result of counting is a nonnegative integer. So you need a model providing nonnegative integers for its outcome. The normal distribution provides negative outcomes and it provides noninteger outcomes. The exponential distribution it provides noninteger outcomes. As Salix alba explained, the poisson distribution is a suitable model, providing nonnegative integer outcomes. Knowing the parameter λ the probability of observing the count i is e^−λλⁱ/i! . If you sum this expression over i, from zero to infinity, you get 1, to confirm that it is a discrete probability distribution function. If you multiply this expression by i and sum over i from zero to infinity you get λ, to confirm that λ is the mean value of i. Similarily you may compute the standard deviation of i as λ^1/2. Knowing the parameter λ the poisson distribution model provides an estimate for the observation i~λ±λ^1/2. However, your problem is the dual one, having made the observation i you need to estimate the parameter λ. Keeping i constant and letting λ be a nonnegative real variable, the above expression e^−λλⁱ/i! is a continuous distribution function called the gamma distribution. Integrating over λ gives 1, confirming that it is a probability distribution function. The special case i=0 is the exponential distribution function. The mean value of λ is i+1 and the standard deviation of λ is (i+1)^1/2. So the gamma distribution model provides an estimate for the parameter λ ~(i+1)±(i+1)^1/2. Bo Jacoby 12:43, 29 October 2007 (UTC).

I don't know why people insist on mentioning the Poisson distribution, which quite obviously does not fit the data. The exponential distribution has both a continuous and a discrete version, the latter being called the Geometric distribution. I suspect that the cell count, although discrete, strongly depends on the (continuous) time period between two events, which for some reason is distributed exponentially. I do not understand biology well enough to speculate as to why that is so, but I do suggest that time is the key here. -- Meni Rosenfeld (talk) 13:18, 29 October 2007 (UTC)

It is not obvious that the poisson distribution does not fit the data for some λ in the interval 0 < λ < 1. Bo Jacoby 14:39, 29 October 2007 (UTC).

Except for the fact that the expectation of the given distribution is roughly 6. -- Meni Rosenfeld (talk) 14:47, 29 October 2007 (UTC)

Thank you all for taking the time to respond to my question! I realise, of course, that just about any distribution representing measurements of physical quantities is in principle a discrete distribution. When weighing a sample of a pure substance, you are in essence counting the number of molecules in the sample. I should have pointed out, also, that the unit on the x-axis is platelets × 10⁹/L, so the number of platelets counted in each measurement is much larger than the numbers on the axis suggest. I did suspect, beforehand, that the gamma distribution might be appropriate. This is because it is unreasonable to assume that the mode of the distribution is zero. When reexamining the data, I see that the mode is in fact 2. The data was recorded without any decimals, and is detailed in the table on the right. I read the section about parameter estimation in the article, but alas, this was way above my head.

If a gamma distribution is indeed appropriate, is there a not-too-difficult way of calculating the scale and rate parameters? Specifically, does anyone know whether R (programming language) is able to do this? To Meni, thanks for the suggestion about a time factor being responsible, I'll certainly look closer into that. --NorwegianBlue^talk 15:00, 29 October 2007 (UTC)

This changes everything. Since the problem is essentially continuous, the Poisson distribution is completely irrlevant. The data is much too skew to be normal, and the exponential distribution seems less likely now that we see that the mode isn't 0 (note that my suggestion to look at time was based on the assumption that the distribution is exponential, though it could still be relevant). What you can do is to calculate the moments of the data (that is, the quantities derived from the moments - mean, variance, skewness and excess kurtosis), and compare them to those of different distributions (our articles have those). The first few can be used to find the parameters which give the best fit, and the rest can tell you how good the fit is. For the gamma distribution (note that Bo has suggested it not as the distribution of the data), for example, denoting the mean by

μ

and the variance by

σ 2

, you have $\theta=\frac{\sigma^2}{\mu}$ and $k=\frac{\mu^2}{\sigma^2}$ , and if, for your data, the skewness is $\frac{2}{\sqrt{k}}$ and the kurtosis is $\frac{6}{k}$ , you have a good match.

Note that the data also seems to be quite noisy, so no common distribution will be a great match. -- Meni Rosenfeld (talk) 16:28, 29 October 2007 (UTC)

Unless I have made a mistake, for our data we have

μ = 5.55563,σ = 5.57955,γ 1 = 3.03122,γ 2 = 15.9955

. If you try to fit this as a gamma distribution, you'll get

k = 1

, so it would be exponential; but the skewness would be only 2, so I'd say that's off the table. You can double-check my calculations, and try your luck with any of these distributions. -- Meni Rosenfeld (talk) 16:52, 29 October 2007 (UTC)

I think you do have a normal distribution in the chart. The graph doesn't show it because you lumped the 0-2 range together to get a value greater than that for the 3-4 range. However, if you plot each value separately, it looks like it will match a normal distribution fairly well. One non-math comment: someone should develop a machine to transport the bag of centrifuged blood so it won't be shaken up, or at least so the degree of shaking will be consistent. StuRat 17:23, 29 October 2007 (UTC)

Are you kidding? The normal distribution is symmetric, this one is anything but. It has a skewness of 3; It has a mode at 2 and mean at 5.5; it has a long tail on the right and a ridiculously short one on the left. Not to mention that the fitted normal in the picture doesn't even remotely resemble the distribution, even if we split the first bin. -- Meni Rosenfeld (talk) 18:02, 29 October 2007 (UTC)

It looks to me like it might well be truncated, but symmetrical (or at least somewhat close), if you toss out the single the value at 0. And no, the blue curve isn't correct, a different normal curve would be needed to fit the data. The maximum point might be around 2.3. StuRat 16:16, 30 October 2007 (UTC)

The poisson distributions constitute a one-parameter family of distributions having nonnegative unlimited integer outcome. That's why you would try it first. Calculation in J:

  concentrations  NB. the new table of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 37 43 52 63
  occurences
1 103 415 212 112 106 64 72 50 36 42 30 28 29 15 18 12 11 10 5 6 5 2 3 3 3 2 5 1 2 2 2 1 1 1 1
  ] N =: +/ occurences NB. compute and display the number of occurences
1411
  ] S =: +/ occurences * concentrations NB. sum of concentrations
7839
  ] L =: S % N NB. mean value of the concentrations
5.55563
  ] SS=:+/ occurences * *: concentrations NB. sum of squares
87477
  ] var=:(SS%N)-*:S%N NB. variance of concentrations
31.1235

As the variance is greater than the mean value, the distribution is not poisson. Sticking to discrete distributions, it might be a negative binomial distribution. (See also cumulant). Bo Jacoby 18:05, 29 October 2007 (UTC).

Seriously, don't you read what the OP is saying? The data is not discrete. It is a continuous variable, which is presented on an arbitrary scale and rounded. -- Meni Rosenfeld (talk) 18:19, 29 October 2007 (UTC)

I'm sorry that I when posting the question didn't state clearly that the concentrations were based on much larger counts than the numbers on the x-axis suggested, and that the concentration is for practical purposes a continuous variable, although it unfortunately was recorded without decimals.

To Sturat: The cellular content is not a great concern, because (a) the plasma is pooled and processed further industrially, and (b) units which obviously have been shaken are re-centrifuged before further processing.

To Meni: Thanks again. I did some further experimentation in R. When examining the histogram with bin sizes of 1, I get the impression that the distribution ~~is bimodal~~ may be a mixture of two distributions, with ~~a second peak~~ the second distribution peaking somewhere in the vicinity of 8. Something like 0.7*gamma_density(shape=2, scale=1)+0.3*normal_density(mu=8, sd=4) fits reasonably well. For practical purposes (making control charts), however, I think I'll treat the data as exponentially distributed, and use the transformation y=x^0.2777, which according to textbooks on statistical process control should result in a Weibull random variable which is well approximated by the normal distribution. --NorwegianBlue^talk 18:37, 29 October 2007 (UTC)

Within a single bin, the number should be an observation of a random variable with a Poisson distribution. That means that in bins 6, 7 and 8 we have an uncertainty of about ±8 in the frequency of occurrence. The local hump is slight in comparison and insufficient to suggest bimodality. The data roughly fits a log-normal distribution. --Lambiam 21:12, 29 October 2007 (UTC)

However, I cannot get a good fit with a log-normal distribution. For the maximum likelihood fit, the expected # of occurrences for the bin c = 2 is only 218.8, differing wildly from the observed value of 415. --Lambiam 19:34, 30 October 2007 (UTC)

Do all the bins have the same width, or has bin 0 half the width of the other bins because the variable is nonnegative? Bo Jacoby 23:01, 29 October 2007 (UTC)

[edit] Name of a Tiling?

I remember reading about a tiling that had sixteen types of tiles and they had to be connected. The tiles looked something like these:

*───
│┌┬┐
│├┼┤
│└┴┘

And I remember reading that they couldn't be tiled periodically. Does anyone know what these tiles are called? --Zemyla^t 17:36, 29 October 2007 (UTC)

I see an asterisk, two line segments, and a square divided into four smaller squares. Are these supposed to suggest several tiles from the set? Maybe our Aperiodic tiling article suggests something to you, but I see nothing there resonating with the image, and no 16-tile set is mentioned. --Lambiam 18:09, 29 October 2007 (UTC)

There are 16 characters. With spaces between them:

* ─ ─ ─ 

│ ┌ ┬ ┐

│ ├ ┼ ┤

│ └ ┴ ┘

PrimeHunter 23:36, 29 October 2007 (UTC)

I don't recognize these tiles, but see Wang tile. —Tamfang 00:45, 30 October 2007 (UTC)

Since they are not able to be peroidically tiled, are they a [Penrose_tiling]?

Everyhing I ever new about tiling and a few things I didnt Artoftransformation 11:29, 3 November 2007 (UTC)

[Science U Titling]

┌ ─ ┬ ┐

│ * │ │

├ ─ ┼ ┤

└ ─ ┴ ┘

[edit] volume of a pyramid

could i have a problem with some numbers filled in concerning the volume of a pyramid so i can help my niece with her geometry????? it has been 20 years since i saw any of this. thank you —Preceding unsigned comment added by 71.161.244.172 (talk) 21:58, 29 October 2007 (UTC)

The volume of a pyramid having 1 square meter base and 1 meter height has volume 1/3 cubic meter. You may cut a cube into 3 equal pyramids. Bo Jacoby 23:31, 29 October 2007 (UTC)

If the base isn't square, the volume is the surface area of the base times the height divided by three. This works for any base polygon (square, rectangle pentagon, etc.) and for a circle or an ellipse. risk 00:26, 30 October 2007 (UTC)

......and for any other measurable sets, too. --CiaPan 10:32, 30 October 2007 (UTC)

Wikipedia:Reference desk/Archives/Mathematics/2007 October 29

From Wikipedia, the free encyclopedia

Contents

[edit] October 29

[edit] suggestion

[edit] Histograms and probability distributions

[edit] Exponentially distributed data

[edit] Name of a Tiling?

[edit] volume of a pyramid

Views

Navigation

Interaction

Search

Platelet conc	Occurences
0	1
1	103
2	415
3	212
4	112
5	106
6	64
7	72
8	50
9	36
10	42
11	30
12	28
13	29
14	15
15	18
16	12
17	11
18	10
19	5
20	6
21	5
22	2
23	3
24	3
25	3
26	2
27	5
28	1
29	2
30	2
35	2
37	1
43	1
52	1
63	1

Platelet conc	Occurences
0	1
1	103
2	415
3	212
4	112
5	106
6	64
7	72
8	50
9	36
10	42
11	30
12	28
13	29
14	15
15	18
16	12
17	11
18	10
19	5
20	6
21	5
22	2
23	3
24	3
25	3
26	2
27	5
28	1
29	2
30	2
35	2
37	1
43	1
52	1
63	1

Platelet conc	Occurences
0	1
1	103
2	415
3	212
4	112
5	106
6	64
7	72
8	50
9	36
10	42
11	30
12	28
13	29
14	15
15	18
16	12
17	11
18	10
19	5
20	6
21	5
22	2
23	3
24	3
25	3
26	2
27	5
28	1
29	2
30	2
35	2
37	1
43	1
52	1
63	1