Asaad Hakeem, Yaser Sheikh, and Mubarak Shah
A representational gap exists between low-level measurements (segmentation, object classification, tracking) and high-level understanding of video sequences. In this paper, we propose a novel representation of events in videos to bridge this gap, based on the CASE representation of natural languages. The proposed representation has three significant contributions over existing frameworks. First, we recognize the importance of causal and temporal relationships between sub-events and extend CASE to allow the representation of temporal structure and causality between sub-events. Second, in order to capture both multi-agent and multi-threaded events, we introduce a hierarchical CASE representation of events in terms of sub-events and case-lists. Last, for purposes of implementation we present the concept of a temporal event-tree, and pose the problem of event detection as subtree pattern matching. By extending CASE, a natural language representation, for the representation of events, the proposed work allows a plausible means of interface between users and the computer. We show two important applications of the proposed event representation for the automated annotation of standard meeting video sequences, and for event detection in extended videos of railroad crossings.