Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus
Mikio Yamamoto∗ Kenneth W. Church† University of Tsukuba AT&T Labs - Research
Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer ngrams. Suffix arrays were first introduced to compute the frequency and location of a substring (ngram) in a sequence (corpus) of length N. To compute frequencies over all N(N +1)/2 substrings in a corpus, the substrings are grouped into a manageable number of equivalence classes. In this way, a prohibitive computation over substrings is reduced to a manageable computation over classes. This paper presents both the algorithms and the code that were used to compute term frequency (tf) and document frequency (df) for all ngrams in two large corpora, an English corpus of 50 million words of Wall Street Journal and a Japanese corpus of 216 million characters of Mainichi Shimbun. The second half of the paper uses these frequencies to find “interesting” substrings. Lexicographers have been interested in ngrams with high Mutual Information (MI) where the joint term frequency is higher than what would be expected by chance (composition- ality). Residual Inverse Document Frequency (RIDF) compares document frequency to a different notion of chance, highlighting technical terminology, names and good keywords for information retrieval. The combination of both MI and RIDF is better than either by itself in a Japanese word identification task.
1 Introduction
Suffix arrays will be used to compute a number of statistics of interest, including term fre- quency and document frequency, for all ngrams in large corpora. Term frequency (tf ) and document frequency (df ), and functions of these quantities such as mutual information
(MI) and inverse document frequency (IDF) have received considerable attention in the corpus-based and information retrieval (IR) literatures (Charniak, 1993; Jelinek, 1997;
Sparck Jones, 1972). Term frequency is the standard notion of frequency in corpus-based
∗ Institute of Information Sciences and Electronics, 1-1-1 Tennodai, Tsukuba 305-8573, JAPAN † 180 Park Avenue, Florham Park, NJ 07932, U.S.A