Shibin Qiu, Terran Lane, Ljubomir Buturovic
String kernels directly model sequence similarities without the necessity of extracting numerical features in a vector space. Since they better capture complex traits in the sequences, string kernels often achieve better prediction performance. RNA interference is an important biological mechanism with many therapeutical applications, where strings can be used to represent target messenger RNAs and initiating short RNAs and string kernels can be applied for learning and prediction. However, existing string kernels are not particularly developed for RNA applications. Moreover, most existing string kernels are n-gram based and suffer from high dimensionality and inability of preserving subsequence orderings. We propose a randomized string kernel for use with support vector regression with a purpose of better predicting silencing efficacy scores for the candidate sequences and eventually improving the efficiency of biological experiments. We show the positive definiteness of this kernel and give an analysis of randomization error rates. Empirical results on biological data demonstrate that the proposed kernel performed better than existing string kernels and achieved significant improvements over kernels computed from numerical descriptors extracted according to structural and thermodynamic rules. In addition, it is computationally more efficient.
Subjects: 12. Machine Learning and Discovery; 12. Machine Learning and Discovery
Submitted: Apr 23, 2007