Lyle H. Ungar and Dean P. Foster
Grouping people into clusters based on the items they have purchased allows accurate recommendations of new items for purchase: if you and I have liked many of the same movies, then I will probably enjoy other movies that you like. Recommending items based on similarity of interest (a.k.a. collaborative filtering) is attractive for many domains: books, CDs, movies, etc., but does not always work well. Because data are always sparse - any given person has seen only a small fraction of all movies - much more accurate predictions can be made by grouping people into clusters with similar movies and grouping movies into clusters which tend to be liked by the same people. Finding optimal clusters is tricky because the movie groups should be used to help determine the people groups and visa versa. We present a formal statistical model of collaborative filtering, and compare different algorithms for estimating the model parameters including variations of K-means clustering and Gibbs Sampling. This formal model is easily extended to handle clustering of objects with multiple attributes.