Brian Lent, Rakesh Agrawal, Ramakrishnan Srikant
We address the problem of discovering trends in text databases. Trends can be used, for example, to discover that a company is shifting interests from one domain to another. We are given a database D of documents. Each document consists of one or more text fields and a timestamp. The unit of text is a word and a phrase is a list of words. (We defer the discussion of more complex structures till the "Methodology" section.) Associates with each phrase is a history of the frequency of occurrence of the phrase, obtained by partitioning the documents based upon their timestamps. The frequency of occurrence in a particular time period is the number of documents that contain the phrase. (Other measures of frequency are possible, e.g. counting each occurrence of the phrase in a document.) A trend is a specific subsequence of the history of a phrase that satisfies the users’ query over the histories. For example, the user may specify a "spike" query to finds those phrases whose frequency of occurrence increased and then decreased.