Applying Statistical Methods to Small Corpora: Benefiting from a Limited Domain

Authors

David Fisher and Ellen Riloff

Track:

Contents

Downloads:

Abstract:

The application of statistical approaches to problems in natural language processing generally requires large (1,000,000+ words) corpora to produce useful results. In this paper we show that a well-known statistical technique, the t test, can be applied to smaller corpora than was previously thought possible, by relying on semantic features rather than lexical items in a corpus of limited domain. We apply the t test to the problem of resolving relative pronoun antecedents, using collocation frequency data collected from the 500,000 word MUC-4 corpus. We conduct two experiments where t is calculated with lexical items and with semantic feature representations. We show that the test cases that are relevant to the MUC-4 domain produce more significant values of t than the ones that are irrelevant. We also show that the t test correctly resolves the relative pronoun in 91.07% of the relevant test cases where the value of t is significant.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.