Darnell Moore, Texas Instruments; Irfan Essa, Georgia Institute of Technology
In this paper, we present techniques for recognizing complex, multitasked activities from video. Visual information like image features and motion appearances, combined with domain-specific information, like object context is used initially to label events. Each action event is represented with a unique symbol, allowing for a sequence of interactions to be described as an ordered symbolic string. Then, a model of stochastic context-free grammar (SCFG), which is developed using underlying rules of an activity, is used to provide the structure for recognizing semantically meaningful behavior over extended periods. Symbolic strings are parsed using the Earley-Stolcke algorithm to determine the most likely semantic derivation for recognition. Parsing substrings allows us to recognize patterns that describe high-level, complex events taking place over segments of the video sequence. We introduce new parsing strategies to enable error detection and recovery in stochastic context-free grammar and methods of quantifying group and individual behavior in activities with separable roles. We show through experiments, with a popular card game, the recognition of high-level narratives of multi-player games and the identification of player strategies and behavior using computer vision.