Randal Nelson and Yiannis Aloimonos
problem, both from the standpoint of practical applications, and as a central issue in attempting to describe the phenomenon of intelligence. On the practical side, there are a large number of applications that would benefit from improved machine ability to analyze activity. The most prominent are various surveillance scenarios. The current emphasis on homeland security has brought this issue to the forefront, and resulted in considerable work on mostly low- level detection schemes. There are also applications in medical diagnosis and household assistants that, in the long run, may be even more important. In addition, there are numerous scientific projects, ranging from monitoring of weather conditions to observation of animal behavior that would be facilitated by automatic understanding of activity. From a scientific standpoint, understanding activity understanding is central to understanding intelligence. Analyzing what is happening in the environment, and acting on the results of that analysis is, to a large extent, what natural intelligent systems do, whether they are human or animal. Artificial intelligences, if we want them to work with people in the natural world, will need commensurate abilities. The importance of the problem has not gone unrecognized. There is a substantial body of work on various components of the problem, most especially on change detection, motion analysis, and tracking. More recently, in the context of surveillance applications, there have been some preliminary efforts to come up with a general ontology of human activity. These efforts have largely been top-down in the classic AI tradition, and, as with earlier analogous effort in areas such as object recognition and scene understanding, have seen limited practical application because of the difficulty in robustly extracting the putative primitives on which the top- down formalism is based. We propose a novel alternative approach, where understanding activity is centered on perception and the abstraction of compact representations from that perception. Specifically, a system receives raw sensory input, and must base its understanding on information that is actually extractable from these data streams. We will concentrate on video streams, but we will presume that auditory, tactile, or proprioceptive streams might be used as well. There has been significant recent progress in what has been loosely termed image-based object and action recognition. The relevant aspect of the image-based approach is that the primitives that are assembled to produce a percept of an object or an action are extracted from statistical analysis of the perceptual data. We see this feature learning and first-level recognition as representative of the sort of abstraction that is necessary for understanding at all levels. We think that the statistical feature abstraction processes, and the structural grammars that permit them to be assembled in space (for object recognition) and time (for action recognition), can be extended to 1), reduce the human input required and 2), generate a higher level of abstraction. The resulting concept extraction process will produce a compact, extensible representation that enables event-based organization and recall, predictive reasoning, and natural language communication in a system observing activity in natural environments. This process of repeated abstraction and organization will naturally induce a symbolic structure onto the observed world. This approach is in contrast to certain classical approaches where understanding is based on analysis of input that is already in symbolic/linguistic form. Challenging problems can be found in this approach; however we feel that the state-of-the-art in extracting symbolic descriptions from real-world data remains so primitive that little can be assumed about the information that might be available. In fact, the extraction of the symbolic description is the primary problem, and the focus should be on making constructive use of what can be extracted, rather than on artificially formal problems constructed about what we hope might be extractable.