AAAI Publications, 2013 AAAI Spring Symposium Series

Font Size: 
How Much Is Said in a Tweet? A Multilingual, Information-theoretic Perspective
Graham Neubig, Kevin Duh

Last modified: 2013-03-15

Abstract


This paper describes a multilingual study on how much information is contained in a single post of microblog text from Twitter in 26 different languages. In order to answer this question in a quantitative fashion, we take an information-theoretic approach, using entropy as our criterion for quantifying “how much is said” in a tweet. Our results find that, as expected, languages with larger character sets such as Chinese and Japanese contain more information per character than other languages. However, we also find that, somewhat surprisingly, information per character does not have a strong correlation with information per microblog post, as authors of microblog posts in languages with more information per character do not necessarily use all of the space allotted to them. Finally, we examine the relative importance of a number of factors that contribute to whether a language has more or less information content in each character or post, and also compare the information content of microblog text with more traditional text from Wikipedia.

Keywords


social media; information theory; language model; multilingual

Full Text: PDF