We address the problem of visually detecting causal events and fitting them together into a coherent story of the action witnessed by the camera. We show that this can be done by reasoning about the motions and collisions of surfaces, using high-level causal constraints derived from psychological studies of infant visual behavior. These constraints are naive forms of basic physical laws governing substantiality, contiguity, momentum, and acceleration. We describe two implementations. One system parses instructional videos, extracting plans of action and key frames suitable for storyboarding. Since learning will play a role in making such systems robust, we introduce a new framework for higher-order hidden Markov models and demonstrate its use in a second system that segments stereo video into actions in near real-time. Rather than attempt accurate low-level vision, both systems use high-level causal analysis to integrate fast but sloppy pixel-based representations over time. The output is suitable for summary, indexing, and automated editing.