# Learn to Compress and Restore Sequential Data

Yi Wang, Jianhua Feng, Shixia Liu

Data compression methods can be classified into two groups: lossless and lossy. Usually the latter achieves a higher compression ratio than the former. However, to develop a lossy compression method, we have to know, for a given type of data, what information can be discarded without significant degradation of the data quality. A usual way to obtain such knowledge is by experiments. For example, from user statistics, we know that human eyes are insensitive to some frequency channels of the light signal. Thus we can compress image data by decomposing them into various frequency channels using a DCT transformation, and neglect the coefficients of the channels that are insensitive to human eyes. However, it is complex and expensive for human analysts to conduct and study so many experiments. Alternatively, we propose to learn the knowledge automatically by using machine learning techniques. Under the framework of Bayesian learning, general prior knowledge is expressed by designing the statistical models, and the refined posterior knowledge can be learned automatically from data to be compressed. More particularly, we consider the compression of some input data as learning a statistical model from the data, and consider the restoration of data as sampling from the learned model. Therefore, only the estimated model parameters are saved as the compressed version. A key to this idea is to design a statistical model that can accurately describe the data (so it is possible to recover the data precisely) and is defined by a compact set of parameters (so to achieve high compression ratio). For a general application of compressing sequential data, we designed the Variable-length Hidden Markov Model (VLHMM), whose learning algorithm automatically learns a minimal set of parameters (by optimizing a Minimum- Entropy criterion) that accurately models the sequential data (by optimizing a Maximum-Likelihood criterion). The selfadaption ability of the learning algorithm makes VLHMM able to accurately model highly varied sequential data. Moreover, as a hidden Markovian model, VLHMM is generally applicable to all kinds of sequences, whatever discrete/ continuous and univariate/multivariate.

Subjects: 12. Machine Learning and Discovery; 1.10 Information Retrieval

Submitted: Apr 9, 2007