Biostatistics

Biostatistics is the application of statistics to a wide range of topics in biology. The science of biostatistics encompasses the design of biological experiments, especially in medicine, pharmacy, agriculture and fishery; the collection, summarization, and analysis of data from those experiments; and the interpretation of, and inference from, the results. A major branch of this is medical biostatistics,^[1] which is exclusively concerned with medicine and health.

History

Biostatistical modeling forms an important part of numerous modern biological theories. In the early 1900s, after the rediscovery of Gregor Mendel's Mendelian inheritance work, the gaps in understanding between genetics and evolutionary Darwinism led to vigorous debate between biometricians, such as Walter Weldon and Karl Pearson, and Mendelians, such as Charles Davenport, William Bateson and Wilhelm Johannsen. By the 1930s, statisticians and models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis.

The leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology.

Ronald Fisher developed several basic statistical methods in support of his work studying the field experiments at Rothamsted Research, including in his 1930 book The Genetical Theory of Natural Selection
Sewall G. Wright developed F-statistics and methods of computing them
J. B. S. Haldane's book, The Causes of Evolution, reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics.

These and other biostatisticians, mathematical biologists, and statistically inclined geneticists helped bring together evolutionary biology and genetics into a consistent, coherent whole that could begin to be quantitatively modeled.

In parallel to this overall development, the pioneering work of D'Arcy Thompson in On Growth and Form also helped to add quantitative discipline to biological study.

Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent. One anecdote describes Thomas Hunt Morgan banning the Friden calculator from his department at Caltech, saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining."^[2]

Recent developments

Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques.

Use in high-throughput data

New biomedical technologies like microarrays, next generation sequencers (for genomics) and mass spectrometry (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously.^[3] Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.^[4]

Multicollinearity often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as gene expression levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R²-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R² of the validation test set, not those of the training set.

Often, it is useful to pool information from multiple predictors together. For example, Gene Set Enrichment Analysis (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes. These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway) using this approach.

Use of computationally intensive methods

On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like bootstrapping and resampling methods.

In recent times, random forests have gained popularity as a method for performing statistical classification. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a very basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems.

Applications

Public health, including epidemiology, health services research, nutrition, environmental health and healthcare policy & management.
Design and analysis of clinical trials in medicine
Assessment of severity state of a patient with prognosis of outcome of a disease.
Population genetics, and statistical genetics in order to link variation in genotype with a variation in phenotype. This has been used in agriculture to improve crops and farm animals (animal breeding). In biomedical research, this work can assist in finding candidates for gene alleles that can cause or influence predisposition to disease in human genetics
Analysis of genomics data, for example from microarray or proteomics experiments.^[5]^[6] Often concerning diseases or disease stages.^[7]
Ecology, ecological forecasting
Biological sequence analysis^[8]
Systems biology for gene network inference or pathways analysis.^[9]
Population dynamics, especially in regards to fisheries science.

Scope and training programs

Almost all educational programmes in biostatistics are at postgraduate level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics.

In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as epidemiology. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on bioinformatics and computational biology, whereas older departments, typically affiliated with schools of public health, will have more traditional lines of research involving epidemiological studies and clinical trials as well as bioinformatics. In larger universities where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry (quality control), business and economics and biological areas other than medicine.

References

↑ Abhaya Indrayan (2012). Medical Biostatistics. CRC Press. ISBN 978-1-4398-8414-0.
↑ Charles T. Munger (2003-10-03). "Academic Economics: Strengths and Faults After Considering Interdisciplinary Needs" (PDF).
↑ Hayden, Erika Check (8 February 2012). "Biostatistics: Revealing analysis". Nature. 482 (7384): 263–265. doi:10.1038/nj7384-263a.
↑ Efron, Bradley (February 2008). "Microarrays, Empirical Bayes and the Two-Groups Model". Statistical Science. 23 (1): 1–22. arXiv:0808.0572 . doi:10.1214/07-STS236.
↑ Helen Causton; John Quackenbush; Alvis Brazma (2003). Statistical Analysis of Gene Expression Microarray Data. Wiley-Blackwell.
↑ Terry Speed (2003). Microarray Gene Expression Data Analysis: A Beginner's Guide. Chapman & Hall/CRC.
↑ Frank Emmert-Streib; Matthias Dehmer (2010). Medical Biostatistics for Complex Diseases. Wiley-Blackwell. ISBN 3-527-32585-9.
↑ Warren J. Ewens; Gregory R. Grant (2004). Statistical Methods in Bioinformatics: An Introduction. Springer.
↑ Matthias Dehmer; Frank Emmert-Streib; Armin Graber; Armindo Salvador (2011). Applied Statistics for Network Biology: Methods in Systems Biology. Wiley-Blackwell. ISBN 3-527-32750-9.

External links

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode
Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range
Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data
Survey methodology	Sampling stratified cluster Standard error Opinion poll Questionnaire
Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment
Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F
Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

Branches of life science and biology

Public health

General

Preventive healthcare

Population health

Biological and
epidemiological statistics

Infectious and epidemic
disease prevention

Food hygiene and
safety management

Food
- Additive
- Chemistry
- Engineering
- Microbiology
- Processing
- Safety
- Safety scandals
Genetically modified food
Good agricultural practice
Good manufacturing practice
- HACCP
- ISO 22000

Health behavioral
sciences

Organizations,
education
and history

Organizations	Europe Centre for Disease Prevention and Control Committee on the Environment, Public Health and Food Safety India Ministry of Health and Family Welfare U.S. Centers for Disease Control and Prevention City and county health departments Council on Education for Public Health Public Health Service World Health Organization World Toilet Organization
Education	Health education Higher education Bachelor of Science in Public Health Doctor of Public Health Professional degrees of public health Schools of public health
History	Sara Josephine Baker Samuel Jay Crumbine Carl Rogers Darnall Joseph Lister Margaret Sanger John Snow Typhoid Mary Germ theory of disease Social hygiene movement

Category
Commons
WikiProject

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

Biostatistics

History

Recent developments

Use in high-throughput data

Use of computationally intensive methods

Applications

Scope and training programs

See also

References

External links