Nested sampling algorithm
From Wikipedia, the free encyclopedia
The introduction to this article provides insufficient context for those unfamiliar with the subject. Please help improve the article with a good introductory style. |
This article may require cleanup to meet Wikipedia's quality standards. Please improve this article if you can. (March 2008) |
The nested sampling algorithm is a computational approach to the problem of comparing models in Bayesian statistics.
Bayes' theorem can be applied to a pair of competing models M1 and M2 for data D, one of which must be true (though which one is not known) but which both cannot simultaneously be true, as follows:
Given no a priori information in favor of M1 or M2, it is reasonable to assign prior probabilities P(M1) = P(M2) = 1 / 2, so that P(M2) / P(M1) = 1. The remaining ratio P(D | M2) / P(D | M1) is not so easy to evaluate since in general it requires marginalization of nuisance parameters. Generally, M1 has a collection of parameters that can be lumped together and called θ, and M2 has its own vector of parameters that may be of different dimensionality but is still referred to as θ. The marginalization for M1 is
and likewise for M2. This integral is often analytically intractible, and in these cases it is necessary to employ a numerical algorithm to find an approximation. The Nested sampling algorithm was developed by John Skilling specifically to approximate these marginalization integrals, and it has the added benefit of generating samples from the posterior distribution P(θ | D,M1) [1]. It is an alternative to methods from the Bayesian literature [2] such as bridge sampling and defensive importance sampling, though there is some controversy as to its soundness as compared to these methods.
Here is a simple version of the Nested sampling algorithm, followed by a description of how it computes the marginal probability density Z = P(D | M) where M is M1 or M2:
Start with N points θ1,...,θN sampled from prior. for i = 1 to j do % The number of iterations j is chosen by guesswork. Li: = min(current likelihood values of the points); Xi: = exp( − i / N); wi: = Xi − 1 − Xi Z: = Z + Li * wi; Save the point with least likelihood as a sample point with weight wi. Update the point with least likelihood with some Markov Chain Monte Carlo steps according to the prior, accepting only steps that keep the likelihood above Li. end return Z;
At each iteration, Xi is an estimate of the amount of prior mass covered by the hypervolume in parameter space of all points with likelihood greater than θi. The weight factor wi is an estimate of the amount of prior mass that lies between two nested hypersurfaces {θ | P(D | θ,M) = P(D | θi − 1,M)} and {θ | P(D | θ,M) = P(D | θi,M)}. The update step Z: = Z + Li * wi computes the sum over i of Li * wi to numerically approximate the integral
The idea is to chop up the range of f(θ) = P(D | θ,M) and estimate, for each interval [f(θi − 1),f(θi)], how likely it is a-priori that a randomly chosen θ would map to this interval. This can be thought of as a Bayesian's way to numerically implement Lebesgue integration.
It should be noted that the pseudocode as given here omits some important implementation details, and will not produce acceptable results if it is directly translated into a language such as C. It is only to give an idea of what the algorithm does. A practical implementation in C can be downloaded from http://www.inference.phy.cam.ac.uk/bayesys/sivia.
Here is a version in the R language for statistical computing:
# NESTED SAMPLING MAIN PROGRAM # (GNU General Public License software, (C) Sivia and Skilling 2006) # translated to R by Issac Trotts in 2007 uniform <- function() runif(1, 0.0,1.0) # given x=log(a) and y=log(b), return log(a+b) log.plus <- function(x,y) { if(x>y) x+log(1+exp(y-x)) else y+log(1+exp(x-y)) } DBL.MAX <- 1.79769e+308 nested.sampling <- function(n, rprior, explore, max.iter=100) { Samples <- list() # Weighted posterior samples H <- 0.0 # Information, initially 0 logZ <- -DBL.MAX # ln(Evidence Z, initially 0) ## Set prior objects Obj <- list() # Collection of n objects for(i in 1:n) { Obj[[i]] <- rprior() stopifnot(!is.null(Obj[[i]]$logL)) } ## Outermost interval of prior mass logwidth <- log(1.0 - exp(-1.0 / n)) # ln(width in prior mass) for(nest in 1:max.iter) { ## Find worst object in collection, with Weight <- width * Likelihood worst <- which.min(sapply(Obj, function(o) o$logL)) Obj[[worst]]$logWt <- logwidth + Obj[[worst]]$logL ## Update Evidence Z and Information H logZnew <- log.plus(logZ, Obj[[worst]]$logWt) H <- exp(Obj[[worst]]$logWt - logZnew) * Obj[[worst]]$logL + exp(logZ - logZnew) * (H + logZ) - logZnew; logZ <- logZnew ## Posterior Samples (optional) Samples[[nest]] <- Obj[[worst]] ## Kill worst object in favour of copy of different survivor logLstar <- Obj[[worst]]$logL; # new log likelihood constraint if(n>1) { # Don't kill the worst object if it's the only one. icopy <- c(1:(worst-1),(worst+1):n)[floor(runif(1,1,(n-1)))] Obj[[worst]] <- Obj[[icopy]]; # overwrite worst object (deep copy) } ## Evolve copied object within constraint Obj[[worst]] <- explore(Obj[[worst]], logLstar); stopifnot(!is.null(Obj[[worst]]$logL)) ## Shrink interval logwidth <- logwidth - 1.0 / n; } ## Exit with evidence Z, information H, and optional posterior Samples list(logZ=logZ, logZ.sd=sqrt(H/n), Hnats=H, Hbits=H/log(2), samples=Samples) } # This function can be used to generate the explore function in simple # situations. # # u.to.x is a function that maps [0,1]^dim to the parameter space. make.explorer <- function(logLhood, u.to.x, step = 0.1, num.steps=20) { function(Obj,logLstar) { Ret <- Obj accept <- 0; # # MCMC acceptances reject <- 0; # # MCMC rejections Try <- list(); # Trial object for(m in 1:num.steps) { # MCMC counter (pre-judged # steps) ## Create trial object by stepping out a random amount within a ## cube around current point, wrapping around if we go out of ## [0,1]^N. N <- length(Obj$x) Try$u <- (Obj$u + step * (2.*runif(N,0,1) - 1.)) %% 1.0; Try$x <- u.to.x(Try$u) Try$logL <- logLhood(Try) # trial likelihood value ## Accept if and only if within hard likelihood constraint if( Try$logL > logLstar ) { Ret <- Try; accept <- accept+1 } else reject <- reject+1 ## Refine step-size to let acceptance ratio ## converge around 50% step <- step * (exp(1.0 / accept) ^ (if(accept>reject) 1 else -1)) } Ret } }
An example of its use is the lighthouse problem:
# apply.c "LIGHTHOUSE" NESTED SAMPLING APPLICATION # (GNU General Public License software, (C) Sivia and Skilling 2006) # translated to R by Issac Trotts in 2007. # # u=0 u=1 # ------------------------------------- # y=2 |:::::::::::::::::::::::::::::::::::::| v=1 # |::::::::::::::::::::::LIGHT::::::::::| # north|::::::::::::::::::::::HOUSE::::::::::| # |:::::::::::::::::::::::::::::::::::::| # |:::::::::::::::::::::::::::::::::::::| # y=0 |:::::::::::::::::::::::::::::::::::::| v=0 # --*--------------*----*--------*-**--**--*-*-------------*-------- # x=-2 coastline -->east x=2 # Problem: # Lighthouse at (x,y) emitted n flashes observed at D[.] on coast. # Inputs: # Prior(u) is uniform (=1) over (0,1), mapped to x = 4*u - 2; and # Prior(v) is uniform (=1) over (0,1), mapped to y = 2*v; so that # Position is 2-dimensional -2 < x < 2, 0 < y < 2 with flat prior # Likelihood is L(x,y) = PRODUCT[k] (y/pi) / ((D[k] - x)^2 + y^2) # Outputs: # Evidence is Z = INTEGRAL L(x,y) Prior(x,y) dxdy # Posterior is P(x,y) = L(x,y) / Z estimating lighthouse position # Information is H = INTEGRAL P(x,y) log(P(x,y)/Prior(x,y)) dxdy ###################################################################### ###################################################################### ## object layout ## ------------- ## u Uniform-prior controlling parameter for position (2-vector) ## x Geographical position of lighthouse (2-vector) ## logL logLikelihood = ln Prob(data | position) ## logWt log(Weight), adding to SUM(Wt) = Evidence Z # Data set: arrival positions D <- c( 4.73, 0.45, -1.73, 1.09, 2.19, 0.12, 1.31, 1.00, 1.32, 1.07, 0.86, -0.49, -2.59, 1.73, 2.11, 1.61, 4.98, 1.71, 2.23,-57.20, 0.96, 1.25, -1.56, 2.45, 1.19, 2.17,-10.66, 1.91, -4.16, 1.92, 0.10, 1.98, -2.51, 5.55, -0.47, 1.91, 0.95, -0.78, -0.84, 1.72, -0.01, 1.48, 2.70, 1.21, 4.41, -4.79, 1.33, 0.81, 0.20, 1.58, 1.29, 16.19, 2.75, -2.38, -1.79, 6.50, -18.53, 0.72, 0.94, 3.64, 1.94, -0.11, 1.57, 0.57); xmin <- c(-2,0) xmax <- c(2,2) logLhood <- function(obj) { x <- obj$x[1] # Easterly position y <- obj$x[2] # Northerly position sum(log((y/pi) / ((D-x)*(D-x) + y*y))) } u.to.x <- function(u) { xmin+(xmax-xmin)*u } sample.from.prior <- function() { Obj <- list() Obj$u <- runif(2) Obj$x <- u.to.x(Obj$u) Obj$logL <- logLhood(Obj) Obj } ## Output posterior properties, here mean and stddev of x,y show.results <- function(results) { x <- c(0,0); xx <- c(0,0); # 1st and 2nd moments for(s in results$samples) { w <- exp(s$logWt - results$logZ); # proportional weight x <- x + w * s$x; xx <- xx + w * s$x * s$x; } cat(sprintf("Evidence: ln(Z) = %g +- %g\n",results$logZ, results$logZ.sd)) cat(sprintf("Information: H = %g nats = %g bits\n", results$Hnats, results$Hbits)) cat('mean(x) =',x[1], 'stddev(x) =',sqrt(xx-x*x)[1],'\n') cat('mean(y) =',x[2], 'stddev(x) =',sqrt(xx-x*x)[2],'\n') } t0 <- proc.time() results <- nested.sampling(n=100, max.iter=1000, rprior=sample.from.prior, explore=make.explorer(logLhood, u.to.x)) show.results(results) cat('Time elapsed:',(proc.time()-t0)[1],'seconds\n')
Another example shows a Bayesian alternative to t-tests
## Given two vectors A and B, this function evaluates the ## probability of the hypothesis ## H1 = "Their elements are from the same Gaussian distribution." ## versus ## H2 = "They are from two different Gaussians. Either the ## mean is different, or the variance is different, or both." ## on the assumption that P(H1 or H2) = 1 (and P(H1 and H2) = 0). P.same.gaussian <- function(A,B, mumin=-10, mumax=10, sdmin=(mumax-mumin)/300, sdmax=(mumax-mumin)/3, n=100, max.iter=1000, P.H1=0.5) { log.P.D.given.H1 <- { xmin <- c(mumin,sdmin) xmax <- c(mumax,sdmax) rprior <- function() { Obj <- list() Obj$u <- runif(2) Obj$x <- xmin+(xmax-xmin)*Obj$u Obj$logL <- logLhood(Obj) Obj } data <- c(A,B) logLhood <- function(obj) { sum(dnorm(data, mean=obj$x[1], sd=obj$x[2], log=TRUE)) } result <- nested.sampling(n=n, max.iter=max.iter, rprior = rprior, explore = make.explorer(logLhood, xmin, xmax, step=0.1, num.steps=20)) result$logZ } log.P.D.given.H2 <- { xmin <- rep(c(mumin,sdmin),2) xmax <- rep(c(mumax,sdmax),2) rprior <- function() { Obj <- list() Obj$u <- runif(4) Obj$x <- xmin+(xmax-xmin)*Obj$u Obj$logL <- logLhood(Obj) Obj } logLhood <- function(obj) { sum(dnorm(A, mean=obj$x[1], sd=obj$x[2], log=TRUE)) + sum(dnorm(B, mean=obj$x[3], sd=obj$x[4], log=TRUE)) } result <- nested.sampling(n=n, max.iter=max.iter, rprior = rprior, explore = make.explorer(logLhood,xmin,xmax, step=0.1,num.steps=20)) result$logZ } P.H2 <- 1 - P.H1 1/(1+exp(log.P.D.given.H2 - log.P.D.given.H1)*P.H2/P.H1) } test.P.same.dist <- function(N=100,mean=0) { P.same.dist(rnorm(N,mean=mean),rnorm(N,mean=mean)) } test.P.same.dist.2 <- function(N=100,mean1=-2, mean2=2) { P.same.dist(rnorm(N,mean=mean1),rnorm(N,mean=mean2)) }
[edit] See also
[edit] References
[1] Nested Sampling for General Bayesian Computation. John Skilling. 2006. International Society for Bayesian Analysis.
[2] Monte Carlo Methods in Bayesian Computation. Chen, M.H., Shao, Q.M., and Ibrahim, J.G. 2000. Springer-Verlag, New York. .