Time series data abounds in real world problems. Measuring the similarity of time series is a key to solving these problems. One state of the art measure is the longest common subsequence. This measure advocates using the length of the longest common subsequence as an indication of similarity between sequences, but ignores information contained in the second, third, ..., longest subsequences. In order to capture the common information in sequences maximally we propose a novel measure of sequence similarity -- the number of all common subsequences. We show that this measure satisfies the common properties of similarity functions. Calculating this measure is not trivial as a brute force approach is exponential in time. We present a novel dynamic programming algorithm to calculate this number in polynomial time. We also suggest a different way of extending a class of such measures to multidimensional, real-valued time series, in the spirit of probabilistic metric spaces. We conducted an experimental study on the new similarity measure and the extension method for classification. It was found that both the new similarity and the extension method are consistently competitive.
Subjects: 12. Machine Learning and Discovery; 11. Knowledge Representation
Submitted: Oct 10, 2006