Talk:Empirical Bayes method

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

As far as I understand from the modern Bayesian perspective empirical Bayes is about hierarchical Bayesian models and learning the parameters of a prior distribution by sharing that distribution across different data sets. In other words it is an assumption that these data sets are conditionally independent given the prior parameters. I don't detect any of that in this article. Am I missing the point or something?

I don't see how you're failing to detect it. Look at the quantity called Θ in the article. Michael Hardy 22:44, 30 July 2006 (UTC)


So the point is that there are several approachs to Empirical Bayes, and what is presented here is the simply the "Robbins" Method (See Carlin and Louis).

In a broader context, Empirical Bayes is about using empirical data to estimate and/or evaluate the Marginal distribution that arises in the Posterior, from Bayes Theorem. For simple models (with simple conjugate priors such as Beta-Bionomial, Gaussian-Gaussian, Poission-Gamma, Multinomial-Dirichlet, Uniform-Pareto, etc), there are several simple and elegant results that basically estimate the Marginal using Maximum Likelihood Estimate (MLE), and then a simple point estimate for the posterior (i.e. a point estimate for the prior). The basic results are quite easy to interpret (i.e as a linear regression) and implement. It would be nice to have discussion on these topics here, and some summary of the basic models.

A good example would be to work out the Beta-Bionomial Model, since this model is somewhat complicated and is a good starting point for modeling small, discrete data sets.

Once the basic ideas are laid out, it would be good to then add sections on computational methods for complex models, such as Expectation Maximization, Markov Chain Monte Carlo, etc.

Some applications would be nice to, such as modeling consumer Marketing data (which is what I do with it) —Preceding unsigned comment added by Charlesmartin14 (talkcontribs)

[edit] deletion of a misguided paragraph

This section was recently added to the article (this version is after some cleanups including misspellings, fixing links, some TeX corrections, etc.):

Bayes' theorem
In the Bayesian approach to statistics, we consider the problem of estimating some future outcome or event, based on measurements of our data, a model for these measurements, and some model for our prior beliefs about the system. Let us consider a standard two-stage model, where we write our data measurements as a vector y = {y_1, y_2, \dots, y_n} , and our prior beliefs as some vector of random unknowns θ. We assume we can model our measurements with a conditional probability distribution (the likelihood ) Pr(y | θ), and also the prior as \Pr(\theta|\eta) , where η is some hyper-parameter. For example, we might choose \Pr(y|\theta) to be a binomial distribution, and \Pr(\theta|\eta) as a Beta distribution (the conjugate prior). Empirical Bayes then employs empirical data to make inferences about the prior θ, and then plugs this into the likelihood Pr(y | θ) to make estimates for future outcomes.

This could leave the impression that empirical Bayes methods are an instance of the Bayesian approach to statistics. But that is incorrect: the Bayesian approach is about the degree-of-belief interpretation of probability.

This could also leave the impression that the Bayesian approach is about estimating FUTURE outcomes or events. It's not. (In some cases it may be about the future, but that's certainly nowhere near essential.)

This characterization of "likelihood" fails to make clear that the likelihood is a function of θ and not of y. It also works only for discrete distributions, whereas likelihood is more general than that.

It uses the word "hyperparameter" without explanation.

After the words "for example", the examples are far to terse to be comprehensible. The examples the article already gave are comprehensible. More could be useful if they were presented in the same way. Michael Hardy 21:23, 27 September 2006 (UTC)


This is an attempt to provide some additional information about how Empirical Bayes is done in practice and the basic formulation from first principles, as opposed to just saying "use Baye's theomr and the result pops out!" It is a first start...these things do take time

Some comments;

(1) I am not sure what you mean by saying the Likeliehood is not a function of y? Perhaps you mean it is not a function of the data? The point is that Empirical Bayes will eventually plug this in. (2) Add a section on the hyperparameter to define it...why did you delete it? (3) If you don't like the word future, then change it? Why did you delete everything?

(4) You leave the impression that Empirical Bayes is nothing more than Robbin's method?! Under your logic, I would delete the entire page and just start form scratcvh.

(5) Empirical Bayes (EB) is an approach to Bayesian statsitics which combines the Bayesian formalism and empirical data? Again, what's the problem? Here the issue is more between the Robbin's style Non-Parametric EB and the Carlin and Louis style Parametric EB.

(6) The examples ARE NOT comprehensible...you did not explain anything except give Robbin's formula and explain how to plug in the results...you need to explain WHY Robbin's is doing what he is doing in the more general context, based on the rgior and presentation in other areas of probability theory on the wikipedia. For example, you could explain that Robbin's method is actually the Bayes Estimate of the prior under Squaered Error Loss.

The point of the article should be to explain the primary models involved with some rigor and derivation, such as the Beta-Binomial, the Poisson-Gamma, etc, since these are commonly used and not explained elsewhere.

The likelihood function is L(θ) = fθ(x) where fθ is the probability density function given that θ is the value of the parameter. The argument to the function L(θ) is of course θ.
No, empirical Bayes is more than Robbins' examples. I have no problem with additional examples.
There's nothing inherently Bayesian about empirical Bayes methods. Empirical Bayes methods that are most often used are not Bayesian. The mere fact that Bayes' theorem is used does not make it any more Bayesian than it would otherwise be. Bayesianism is the degree-of-belief interpretation, as opposed to the frequency interpretation or some others, of probability.
The example I referred to is one that you very tersely mentioned but did not explain. The example that was already there was explained. In the examples section below the paragraph I criticized, you could add some fully explained examples. Michael Hardy 22:48, 27 September 2006 (UTC)


This issue about the functional dependence is merely notation...I am simply following the convention in the wikipedia entry on Bayes Theorem and other well known treasties on conditional distributions, such as Carlin and Louis to papers. It would be good to have a consistant notation accross the wikipedia page after some other issues are cleaned up. I fail to see the need to use the term "bound variable" because that is really confusing to anyone who is not a programmer, and especially here, since the point of Empirical Baeys is to "unbind the varibales" and approximate them with their empirical counterparts in the marginal

I am not a programmer. The term "bound variable" is older than computers and older than computer software. Michael Hardy 17:33, 29 September 2006 (UTC)


The formulae presented are fine for discrete and continuous distributions? What specifically do you mean

As for being inherently Bayesian, the point is that Empirical Bayes methods use empirical data to appoximate the Bayesian marginal and/or posterior distribution under certain approximations (such as squared error loss, Stein estimation, maximum likelihood estimate (MLE), etc), or they may use computatiuoal methods to approximate the marginal (Gaussian Quadrature, Metropolis Monte Calro, Markov Chain Monte Carlo, etc.). This is true with Robbin's method, with the Beta-Bionomial model, with Bayesian Regresion, etc. Each "example" uses some combination of these approximation (i.e Robbins is a point estimate assuming a non-informative, unspecified prior and squared error loss).

The current explanation of Robbin's method for Empirical Bayes does not clearly explain how the marginal is being approximated...indeed, you just refer to it as the "normalizing constant", and hwile there is a wikipedia entry on this, it is just not transparent and not the terminology used in the some of the popular literature on Empirical Bayes (Carlin and Louis, Rossi, etc).

It is is also confusing since it is not actually constant (i.e. it will be a function of any hyperparameters as well as the data that appears in the Likelihood )

You should take out the comments like "That is why the word Bayes appears" and "that is why the word empirical appears" and, instead, explain concisely but from general first principles what is going on. I have tried to add some of this in the introduction. One good formulaton, at least for Robbin's method, is to show that Bayes rule arises as a point estimate when you minimize the posterior squared error loss (i.e risk minimization) , and it takes about 3-4 lines of basic calculus

I have included some of this and will need to fix up the notation to complete it...IMHO it would be important to present the basic derivations and their motivations


Your example with the normal distibution is inclomplete because you are describing, again, a very specific case of Bayesian Regression, whereas a more complete discussion would at least include the Parametic Point Estimates for Gaussian and its conjugate priors (either for unknown mean and known variance, or for unknown mean and unknown variance).

Most importantly, the article should explain the difference between Non-Parameteric and Parametric EB and also discuss the basic result of the Parameteric EB and Point Estimation, which include the notions of information "borrowing," "shrinkage," and the trade-off between bias and variance

I have cleaned up the proof considerably by explaining why we can use a point estimate for the prior and clarified that we are essentially estimating the marginals. This will also make it easier to add a section on parametric Empirical Bayes. Clearly much more work is needed!

The next planned step is to add a section on Parametric Empirical Bayes, to derive the Beta-Binomial model, and to provide a numerical example.

Writing Pr(y|θ) is at best valid only for discrete distributions. "Pr" means probability, and must be between 0 and 1 (inclusive). For continuous distributions, you need probability density, which, of course, need not be less than 1, and writing "Pr" for probability density seems rather strange to me. Michael Hardy 17:35, 29 September 2006 (UTC)
The current version of the page already uses a more common notation

I have now schetched out mathematical details for the so-called "example for a normal distribution" There are some subtlies here I have avoided (such as a proper derivation of the conjugate priors), among other things

It would be good to create an example which uses the results of the mathematical derivation It would also be good to add specific sections for the Beta-Binomial model, which is non-trivial, and , perhaps, the multi-variate linear regression model (which is commented on in Estimation of covariance matrices)

[edit] Some confusions

In the introduction part, I understands most of them until the sentence "Empirical Bayes then employs the complete set of empirical data to make inferences about the prior θ, and then plugs this into the likelihood ρ(y | θ) to make estimates for future outcomes of individual measurements." First of all, what kind of estimation won't use the "complete set of empirical data"? I am not an expert in Statistics, so forgive me if I am ignorant. As far as I knew, for estimation, it is always better to use the entire dataset, except that you are talking about cross-validation. Secondly, what do you mean of "plugs this estimation into the likelihood"? It would be better to write it in a mathematical equation about the statement. I am guessing you are saying that one predicts new data y_new by p(ynew|yold)=p(ynew|θ)p(θ|yold). Is it correct? —Preceding unsigned comment added by 76.121.137.101 (talk) 10:23, 18 February 2008 (UTC)