A hybrid support vector machine strategy for ranking SNPs in genome-wide association studies

Abstract: In genome wide association studies we wish to rank SNPs such that true associated ones are placed at higher positions than false ones. The support vector machine (SVM) provides a discriminative alternative to the widely used chi-square statistic. We propose a hybrid strategy that combines the chi-square statistic with the support vector machine and study its performance on simulated data and the Wellcome Trust Case Control Consortium (WTCCC) studies. We show that our strategy ranks causal SNPs in simulated data significantly higher than the chi-square test and SVM alone. We also show that our strategy ranks previously replicated SNPs and associated regions (where applicable) of type 1 diabetes, rheumatoid arthritis, and Crohn's disease higher than the chi-square, SVM, SVM-RFE, and the HMM SNP rankings. In WTCCC studies with low signal strength such as type 2 diabetes there is no advantage with our method. Finally, we show that our strategy yields an economical set of SNPs that predict disease risk more accurately than previously replicated SNPs and top ranked SNPs in the chi-square and SVM ranking for type 1 diabetes and arthritis as measured by the area under curve of the widely used composite odds ratio score.

U. Roshan, S. Chikkagoudar, Z. Wei, K. Wang, H. Hakonarson, A hybrid support vector machine strategy for ranking SNPs in genome-wide association studies Submitted

Supplementary material: PDF