Rohini K. Srihari, Debra T. Burhans
This research explores the interaction of textual and photographic information in document understanding. The problem of performing general-purpose vision without a priori knowledge is difficult at best. The use of collateral information in scene understanding has been explored in computer vision systems that use scene context in the task of object identification. The work described here extends this notion by defining visual semantics, a theory of systematically extracting picture-specific information from text accompanying a photograph. Specifically, this paper discusses the multi-stage processing of textual captions with the following objectives: (i) predicting which objects (implicitly or explicitly mentioned in the caption) are present in the picture and (ii) generating constraints useful in locating/identifying these objects. The implementation and use of a lexicon specifically designed for the integration of linguistic and visual information is discussed. Finally, the research described here has been successfully incorporated into PICTION, a caption-based face identification system.