Computer experiment

From Wikipedia, the free encyclopedia

In the scientific context, a computer experiment typically implies two phases. The modelization phase and the experimentation phase. The modelization phase can be seen as a replacement of the traditional mathematical modelization, and the experimentation phase as a replacement of the traditional in vivo and in vitro experiements. It has become common to call such experiements in silico.

Contents

[edit] Computer simulation as a building block of a computer experiment

In a computer simulation, a "computer" model typically replaces a traditional mathematical model. Whereas a mathematical model is traditionally solved analytically, a computer model can be solved numerically: this is what a computer simulation of a system (typically a physical systems) is about. (Sometimes, an analytical solution to a mathematical model is not known, however a computer simulation can find an approximate solution, typically this happens with differential equations).

In a computer experiment a computer model is used to make inferences about some underlying system. The idea is that the computer model takes the place of an experiment we cannot do: the phrase in silico experiment is also used. At the moment, for example, the debate on climate change is being informed largely from evaluations of climate simulators running on some of the largest computers in the world, which are being used to investigate the impact of a substantial increase in the atmospheric concentration of greenhouse gases like carbon dioxide. In this case, the accumulation of many simulations on different initial conditions form an experiment.

[edit] Computer experiments and statistics

Computer experiments can be seen a branch of applied statistics, because the user must account for three sources of uncertainty. First, the models often contain parameters whose values are not certain; second, the models themselves are imperfect representations of the underlying system; and third, data collected from the system that might be used to calibrate the models are imperfectly measured. However, it is fair to say that most practitioners of computer experiments do not see themselves as statisticians.

[edit] History

The first computer experiments were probably conducted at Los Alamos National Laboratory to study the behaviour of nuclear weapons. Since then, the use of computer models has branched out into large parts of the physical and environmental sciences (where they are sometimes referred to as process models), and in medicine. Because computer experiments have developed in such a wide range of applications there is little standardisation of the terminology.

[edit] Preliminary remark

As a general guide, in this article learning about the model parameters using data from the system is referred to as (model) calibration, while learning about the system behaviour itself as (system) prediction. Combining both of these, e.g., using the model and system data to make predictions about the system, is referred to as calibrated prediction. Other terminology is discussed below, in #The "traditional" approach.

[edit] Constructing a simulator

The simulator is the computer code that we actually evaluate: the outputs of the simulator correspond, usually directly, to measurable aspects of the system. It is important to understand the process of creating a simulator, because this allows us to make judgements about how similar two or more simulators of the same system are. Without this information it is difficult to combine information from different simulators, because we do not know to what extent we can treat them as independent sources of information. See also Computer simulation.

In most applications there are typically three parts to a simulator: the model, the treatment and the solver. Thus we might write

Simulator = Model + Treatment + Solver

Differences in each of these three components give rise to different simulators.

[edit] The model

The subject of models is a large one: see Model (abstract) for an introduction and further links. Our starting point is a mathematical model for the system of interest. In the physical sciences a model typically describes the state variables, plus fundamental laws and equations of state that variables exist and evolve in space and time.

For example, suppose you were interested in building an ocean model. Then you might proceed as follows:

  • State variables: Velocity (in each of three directions), pressure, temperature, salinity, density
  • Equations of state: Relationship of density to temperature, salinity and pressure, and perhaps also a model for the formation of sea-ice

The state variables for the ocean model are expressed as a continuum in space and time, and the fundamental laws as partial differential equations. Even at this stage, though, simplifications may be made. For example, it is common to treat seawater as incompressible. Furthermore, equations of state are often specified by empirical relationships based on laboratory experiments.

In the "less-physical" sciences the models tend to be constructed with respect to the key processes, using what are sometimes referred to as compartmental models. (Example to be supplied here.) This also happens with some physical models. For example, a simple model of an ocean such as the Atlantic might divide the ocean into four compartments: 'south', 'tropical', 'north' and 'deep'.

[edit] The treatment

The treatment is what makes the model applicable to a particular instance. For example, it is what makes our model of the ocean applicable to the earth during the period 1750 - 2100. The treatment in this case comprises boundary conditions that describe the ocean margins and topography, initial conditions that quantify the state vector (velocity, pressure, etc) at every location at the start of 1750, and forcing functions that describe external influences on the oceans over the period 1750 - 2100. These forcings mainly describe events at the surface of the ocean, such as temperature, winds, and exchanges of freshwater through evaporation and precipitation. `Historic' values of forcing can be inferred from data, while `future' values, ie those from today to 2100, will be specified according to a particular scenario.

Note that in large-scale climate modelling we couple an ocean model with an atmosphere model, so that the forcing at the margin between the ocean and the atmosphere does not have to be prescribed, but can be inferred. At the moment this type of coupling is a bit of a black art. The forcing in these coupled models tends to be on the atmosphere: things like orbital effects, and atmospheric concentrations of greenhouse gases like CO2.

[edit] The solver

Finally, the solver turns the model and the treatment into a calculation that approximates the evolution of the state vector. At this point it is usually necessary to discretise the problem, which involves replacing the continuum with a lattice of discrete points. For an ocean simulator, the earth's surface might be divided into rectangles, and the ocean itself into a number of layers. This division is typically fixed for a given simulator, and the number of cells is referred to as the simulator's resolution. Time is also discretised, although it is often possible to treat the step-size between adjacent time-points in quite a sophisticated manner, so that it adapts to the needs of the calculation.

Discretisation allows us to approximate the solution of the model for our given treatment, but it introduces problems that can necessitate further adjustment. There may be processes with characteristic scales that are smaller than a grid cell, or a time-step. These don't get picked up by the simulator, which behaves as though the state vector is constant over each cell and time-step. These so-called sub-grid-scale processes need to be put back in if they are thought to be a large component of the model. So the solver also includes an approximation for these processes: the impact of this approximation should go to zero as the simulator resolution becomes large.

[edit] The simulator as a function

The simulator is implemented as a piece of computer code that can be evaluated to produce a collection of outputs, normally written to file. In the code itself, or in files that are read by the code, we find all of the numbers that are required before the code will run. Typically these comprise (a) coefficients in the underlying model; and possibly also (b) initial conditions and (c) forcing functions. It is natural to see the simulator as a deterministic function that maps these inputs into a collection of outputs. As an aside, this notion can be extended to stochastic simulators, if we think of one of the inputs as being the seed to a random number generator.

On the basis of seeing our simulator this way, it is common to refer to the collection of inputs as x, the simulator itself as f, and the resulting output as f(x). Both x and f(x) are vector quantities, and they can be very large collections of values, often indexed by space, or by time, or by both space and time.

It is worth noting here that the simulator itself is a worthy object of inference. Many simulators embody, both formally and informally, the expertise of large sections of the relevant scientific community and as such are interesting objects in their own right.

Although f(\cdot) is known in principle, in practice this is not the case. Many simulators comprise tens of thousands of lines of high-level computer code, which is not accessible to intuition. As discussed in the next section, without actually running the code it is impossible to predict exactly what the outputs will be.

Such a view lends itself to a Bayesian analysis, in which f(\cdot) is treated as a random function, and the set of simulator runs as observations. The next section implicitly treats f(\cdot) as a random function about which inferences may be made using the Bayesian paradigm.

[edit] Sources of uncertainty

There are four different sources of uncertainty in a computer experiment, which are discussed in turn.

[edit] Uncertainty about the simulator behaviour

We are uncertain about what would happen if we evaluated the simulator at a particular input value x; that is, until we actually make this evaluation, the outcome f(x) is uncertain. This is because the simulator is usually sufficiently complicated that we cannot know f(x) in the same way that we know x1 + x2 for given values of x1 and x2. Where the simulator is cheap to evaluate and the number of components in x is quite small this is not usually a problem, as we can either evaluate the simulator at little cost, or find an input close by at which we already have an evaluation. In this case our uncertainty about f(x) for any given x is not really an issue. But where the simulator is expensive to evaluate or there are lots of components in x, our typical state is to be uncertain about f(x), and this uncertainty can be quite large.

[edit] Uncertainty about the 'correct' simulator input

There are two reasons why we might be uncertain about the 'correct' simulator input. First, we may not have a precise measurement: this applies to components of x which have an analogue in the system. These measurable inputs would typically include initial conditions, since these are starting values of the state vector and the state vector is typically a quantity with an analogue in the system, like water temperature. Many forcing functions are also measurable. But just being measurable doesn't mean that we actually have the measurements, made without error.

Second, some components of x may not correspond to any measurable quantity in the system. These tuning inputs are typically found in the model parameters, in places where the model has been simplified, or where processes that are not completely understood have been represented by flexible but non-physical sub-models. These tuning inputs tend to represent quite general concepts, often highly aggregated.

For example, in a hydrocarbon reservoir it is common to parameterise each fault with a "transmissibility". We know that the nature of a fault varies at the microscale level, so obviously a single number is a gross simplification, but a reasonable one if the treatment has 100 faults. But what is the right value for "transmissibility" in this case? Exactly the same problems apply to rock permeability when averaged over regions, to give an aggregated "permeability". These are two examples where simplifying the model can lead to problems with the definitions of some of the input components, resulting in uncertainty about the 'correct' value. We cannot even be sure that there is a correct value, hence the use of 'correct' instead.

Add something on flexible non-physical sub-models, example coupling

[edit] Uncertainty about the simulator output and the system

ทดสอบ

[edit] Uncertainty about measurements on the system

[edit] The "traditional" approach

[edit] The probabilistic or Bayesian approach

[edit] Challenges in large computer experiments

[edit] Further reading

Santner, Thomas (2003). The Design and Analysis of Computer Experiments. Berlin: Springer. ISBN 0387954201.