Diane J. Litman and Rebecca J. Passonneau
The structuring of discourse into multi-utterance segments has been claimed to correlate with linguistic phenomena such as reference, prosody, and the distribution of pauses and cue words. We discuss two methods for developing segmentation algorithms that take advantage of such correlations, by analyzing a coded corpus of spoken narratives. The coding includes a linear segmentation derived from an empirical study we conducted previously. Hand tuning based on analysis of errors guides the development of input features. We use machine learning techniques to automatically derive algorithms from the same input. Relative performance of the hand-tuned and automatically derived algorithms depends in part on how segment boundaries are defined. Both methods come much closer to human performance than our initial, untuned algorithms.