Motivation Individual microarray studies searching for prognostic biomarkers often have few samples and low statistical power; however, publicly accessible data sets make it possible to combine data across studies.Method We present a novel approach for combining microarray data across institutions and platforms.We introduce a new algorithm, robust greedy feature selection (RGFS), to select predictive genes.Results We combined two prostate cancer microarray data sets, confirmed the appropriateness of the approach with the Kolmogorov-Smirnov goodness-of-fit test, and built several predictive models.
The best Womens Sandal logistic regression model with stepwise forward selection used 7 genes and had a misclassification rate of 31%.Models that combined LDA with different feature selection algorithms had misclassification rates between 19% and 33%, and the sets of genes in the models varied substantially during cross-validation.When we combined RGFS with LDA, the best model used two genes and had a misclassification rate of 15%.Availability Affymetrix U95Av2 array data are available at http://www.
broad.mit.edu/cgi-bin/cancer/datasets.cgi.
The cDNA microarray data Mechanical 3D Puzzles are available through the Stanford Microarray Database ( http://cmgm.stanford.edu/pbrown/ ).GeneLink software is freely available at http://bioinformatics.
mdanderson.org/GeneLink/.DNA-Chip Analyzer software is publicly available at http://biosun1.harvard.
edu/complab/dchip/.