Biological network inference

Biological network inference is the process of making inferences and predictions about biological networks.

Biological networks

In a topological sense, a network is a set of nodes and a set of directed or undirected edges between the nodes. Many types of biological networks exist, including transcriptional, signalling and metabolic. Few such networks are known in anything approaching their complete structure, even in the simplest bacteria. Still less is known on the parameters governing the behavior of such networks over time, how the networks at different levels in a cell interact, and how to predict the complete state description of a eukaryotic cell or bacterial organism at a given point in the future. Systems biology, in this sense, is still in its infancy.

There is great interest in network medicine for the modelling biological systems. This article focuses on a necessary prerequisite to dynamic modeling of a network: inference of the topology, that is, prediction of the "wiring diagram" of the network. More specifically, we focus here on inference of biological network structure using the growing sets of high-throughput expression data for genes, proteins, and metabolites. Briefly, methods using high-throughput data for inference of regulatory networks rely on searching for patterns of partial correlation or conditional probabilities that indicate causal influence.^[1]^[2] Such patterns of partial correlations found in the high-throughput data, possibly combined with other supplemental data on the genes or proteins in the proposed networks, or combined with other information on the organism, form the basis upon which such algorithms work. Such algorithms can be of use in inferring the topology of any network where the change in state of one node can affect the state of other nodes.

Transcriptional regulatory networks

Genes are the nodes and the edges are directed. A gene serves as the source of a direct regulatory edge to a target gene by producing an RNA or protein molecule that functions as a transcriptional activator or inhibitor of the target gene. If the gene is an activator, then it is the source of a positive regulatory connection; if an inhibitor, then it is the source of a negative regulatory connection. Computational algorithms take as primary input data measurements of mRNA expression levels of the genes under consideration for inclusion in the network, returning an estimate of the network topology. Such algorithms are typically based on linearity, independence or normality assumptions, which must be verified on a case-by-case basis.^[3] Clustering or some form of statistical classification is typically employed to perform an initial organization of the high-throughput mRNA expression values derived from microarray experiments, in particular to select sets of genes as candidates for network nodes.^[4] The question then arises: how can the clustering or classification results be connected to the underlying biology? Such results can be useful for pattern classification – for example, to classify subtypes of cancer, or to predict differential responses to a drug (pharmacogenomics). But to understand the relationships between the genes, that is, to more precisely define the influence of each gene on the others, the scientist typically attempts to reconstruct the transcriptional regulatory network. This can be done by data integration in dynamic models supported by background literature, or information in public databases, combined with the clustering results.^[5] The modelling can be done by a Boolean network, by Ordinary differential equations or Linear regression models, e.g. Least-angle regression, by Bayesian network or based on Information theory approaches.^[6] For instance it can be done by the application of a correlation-based inference algorithm, as will be discussed below, an approach which is having increased success as the size of the available microarray sets keeps increasing ^[1]^[7]^[8]

Signal transduction

Signal transduction networks (very important in the biology of cancer). Proteins are the nodes and directed edges represent interaction in which the biochemical conformation of the child is modified by the action of the parent (e.g. mediated by phosphorylation, ubiquitylation, methylation, etc.). Primary input into the inference algorithm would be data from a set of experiments measuring protein activation / inactivation (e.g., phosphorylation / dephosphorylation) across a set of proteins. Inference for such signalling networks is complicated by the fact that total concentrations of signalling proteins will fluctuate over time due to transcriptional and translational regulation. Such variation can lead to statistical confounding. Accordingly, more sophisticated statistical techniques must be applied to analyse such datasets.^[9]

Metabolic

Metabolite networks. Metabolites are the nodes and the edges are directed. Primary input into an algorithm would be data from a set of experiments measuring metabolite levels.

Protein-protein interaction

Protein-protein interaction networks are also under very active study. However, reconstruction of these networks does not use correlation-based inference in the sense discussed for the networks already described (interaction does not necessarily imply a change in protein state), and a description of such interaction network reconstruction is left to other articles.

References

↑ 1.0 1.1 Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, The DREAM5 Consortium, Kellis M, Collins JJ, Stolovitzky G (2012). "Wisdom of crowds for robust gene network inference". Nature Methods 9 (8): 796–804. doi:10.1038/nmeth.2016. PMC 3512113. PMID 22796662.
↑ Sprites, P; Glymour, C; Scheines, R (2000). Causation, Prediction, and Search: Adaptive Computation and Machine Learning (2nd ed.). MIT Press.
↑ Oates, C.J. and Mukherjee, S.; Mukherjee (2012). "Network Inference and Biological Dynamics". To appear in Ann. Appl. Stat. arXiv 1112: 1047. arXiv:1112.1047. Bibcode:2011arXiv1112.1047O.
↑ Guthke, R et al. (2005). "Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection.". Bioinformatics 21 (8): 1626–34. doi:10.1093/bioinformatics/bti226. PMID 15613398.
↑ Hecker, M et al. (2009). "Gene regulatory network inference: Data integration in dynamic models - A review.". Biosystems 96 (1): 86–103. doi:10.1016/j.biosystems.2008.12.004. PMID 19150482.
↑ van Someren, E et al. (2002). "Genetic network modeling.". Pharmacogenomics 3 (4): 507–525. doi:10.1517/14622416.3.4.507. PMID 12164774.
↑ Faith, JJ et al. (2007). "Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles". PLoS Biology 5 (1): 54–66. doi:10.1371/journal.pbio.0050008. PMC 1764438. PMID 17214507.
↑ Hayete, B; Gardner, TS; Collins, JJ (2007). "Size matters: network inference tackles the genome scale". Molecular Systems Biology 3 (1): 77. doi:10.1038/msb4100118. PMC 1828748. PMID 17299414.
↑ Oates, C.J. and Mukherjee, S. (2012). "Structural inference using nonlinear dynamics". CRiSM Working Paper 12 (7).