AAAI Publications, Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence

Font Size: 
Normalizing Microtext
Zhenzhen Xue, Dawei Yin, Brian D. Davison

Last modified: 2011-08-24


The use of computer mediated communication has resulted in a new form of written text--Microtext--which is very different from well-written text. Tweets and SMS messages, which have limited length and may contain misspellings, slang, or abbreviations, are two typical examples of microtext. Microtext poses new challenges to standard natural language processing tools which are usually designed for well-written text. The objective of this work is to normalize microtext, in order to produce text that could be suitable for further treatment.

We propose a normalization approach based on the source channel model, which incorporates four factors, namely an orthographic factor, a phonetic factor, a contextual factor and acronym expansion. Experiments show that our approach can normalize Twitter messages reasonably well, and it outperforms existing algorithms on a public SMS data set.

Full Text: PDF