Microarray data analysis based on the Bayesian theorem and the entropy maximization principle
Sungchul Ji (Department of Pharmacology and Toxicology, Rutgers University); Ming Ouyang, Panos Georgopoulos (EOHSI, UMDNJ - R.W. Johnson Medical School and Rutgers University)
Most statistical approaches to analyzing DNA microarray data appear to assume that primary data, upon analysis, would automatically lead to biologically meaningful information. Although this may turn out to be true in some cases, it is more likely that, in the majority of cases, the complexity of microarray data would invalidate such a simple inductive method of analysis. A more realistic approach may combine both the inductive and deductive approaches, thereby capitalizing not only raw microarray data (D) but also a body of established prior information or knowledge in the field concerned (I) and a set of hypotheses to be tested (H). A mathematical rationale for combining these three elements in statistical analysis in general has been worked out by E. T. Jaynes [J. Skilling, ed., "Maximum Entropy and Bayesian Methods," Kluwer Academic Publishers, Dordrect, 1988]. The Bayesian theorem enables us to update the prior probability p(H|I) of the numerical representation of a hypothesis H to the posterior probability p(H|DI), as a result of taking into account both microarray data D and the established knowledge in the field I (e.g., known mechanisms of cell cycle and apoptosis, etc). By applying the maximum entropy principle, we can assign a probability distribution (p1 . . . pn) on the hypothesis space, populated by H, based on the criterion that it maximizes the Shannon entropy S "subject to constraints that express properties we wish the distribution to have, but are not sufficient to determine it." [E. T. Jaynes on p. 26 in Skilling, op. cit].
The Maximum Entropy and Bayesian methods may allow us to discriminate between the two possible hypotheses, one generated on the basis of the conformon theory [S. Ji, BioSystems 54: 107-130 (2000)] and the other on the cell language theory [S. Ji, BioSystems 44: 17-39 (1997)], about the possible molecular mechanisms underlying the recently observed phenomena that there are only two or three fundamental patterns of gene expression changes, called "characteristic modes," revealed by the singular value decomposition (SVD) analyses of the time series microarray data measured in organisms from yeast to human cells [N. S. Holter et al., PNAS 97 (15): 8409-8414 (2000)].