Ross Messing, Christopher Pal
Investigations of human perception have shown that non-local spatio-temporal information is critical and often sufficient for activity recognition. However, many recent activity recognition systems have been largely based on local space-time features and statistical techniques inspired by object recognition research. We develop a new set of statistical models for feature velocity dynamics capable of representing the long term motion of features. We show that these models can be used to effectively disambiguate behaviors in video, particularly when extended to include information not captured by motion, like position and appearance. We demonstrate performance surpassing and in some cases doubling the accuracy of a state-of-the-art approach based on local features. We expect that long range temporal information will become more important as technology makes longer, higher resolution videos commonplace.