We consider the problem of estimating the total probability of all symbols that appear with a given frequency in a string of i.i.d. random variables with unknown distribution. We focus on the regime in which the block length is large yet no symbol appears
a r X i v :c s /0607014v 1 [c s .I T ] 5 J u l 2006
Strong Consistency of the Good-Turing Estimator
Coordinated Science Laboratory Univ.of Illinois at Urbana-Champaign and School of ECE,Cornell University
Coordinated Science Laboratory
and ECE Dept.
Univ.of Illinois at Urbana-Champaign
Princeton University firstname.lastname@example.org
Abstract —We consider the problem of estimating the total probability of all symbols that appear with a given frequency in a string of i.i.d.random variables with unknown distribution.We focus on the regime in which the block length is large yet no symbol appears frequently in the string.This is accomplished by allowing the distribution to change with the block length.Under a natural convergence assumption on the sequence of underlying distributions,we show that the total probabilities converge to a deterministic limit,which we characterize.We then show that the Good-Turing total probability estimator is strongly consistent.
The problem of estimating the underlying probability dis-tribution from an observed data sequence arises in a variety of ﬁelds such as compression,adaptive control,and linguistics.The most familiar technique is to use the empirical distribution of the data,also known as the type.This approach has a number of virtues.It is the maximum likelihood (ML)distribution,and if each symbol appears frequently in the string,then the law of large numbers guarantees that the estimate will be close to the true distribution.
In some situations,however,not all symbols will appear frequently in the observed data.One example is a digital image with the pixels themselves,rather than bits,viewed as the symbols .Here the size of the alphabet can meet or exceed the total number of observed symbols,i.e.,the number of pixels in the image.Another example is English text.Even in large corpora,many words will appear once or twice or not at all .This makes estimating the distribution of English words using the type ineffective.This problem is particularly pronounced when one attempts to estimate the distribution of bigrams,or pairs of words,since the number of bigrams is evidently the square of the number of words.
To see that the empirical distribution is lacking as an estimator for the probabilities of uncommon symbols,con-sider the extreme situation in which the alphabet is inﬁnite and we observe a length-n sequence containing n distinct symbols .The ML estimator will assign probability 1/n to the n symbols that appear in the string and zero probability to the rest.But common sense suggests that the (n +1)st symbol in the sequence is very likely to be one that has not yet appeared.It seems that the ML estimator is overﬁtting the data.Modiﬁcations to the ML estimator such as the Laplace “add one”and the Krichevsky-Troﬁmov “add half”have been proposed as remedies,but these only alleviate the problem .
In collaboration with Turing,Good proposed an esti-mator for the probabilities of rare symbols that differs con-siderably from the ML estimator.The Good-Turing estimator has been shown to work well in practice ,and it is now used in several application areas .Early theoretical work on the estimator focused on its bias ,,.Recent work has been directed toward developing conﬁdence intervals for the estimates using central limit theorems ,or concentration inequalities ,.Orlitsky,Santhanam,and Zhang showed that the estimator has a pattern redundancy that is small but not optimal.None of these works,however,have shown that the estimator is strongly consistent.
We show that the Good-Turing estimator is strongly consis-tent under a natural formulation of the problem.We consider the problem of estimating the total probability of all symbols that appear k times in the observed string for each nonnegative integer k .For k =0,this is the total probability of the unseen symbols,a quantity that has received particular attention ,.Estimating the total probability of all symbols with the same empirical frequency is a natural approach when the symbols appear infrequently so that there is insufﬁcient data to accurately estimate the probabilities of the individual symbols.Although the total probabilities are themselves random,we show that under our model they converge to a deterministic limit,which we characterize.Note that if the alphabet is small and the block length is large,then the problem effectively reduces to the usual probability estimation problem since it is unlikely that multiple symbols will have the same empirical frequency.
It is known that the Good-Turing estimator performs poorly for high-probability symbols ,but this is not a problem since the ML estimator can be employed to estimate the probabilities of symbols that appear frequently in the observed string.We therefore focus on the situation in which the symbols are unlikely,meaning that they have probability O (1/n ).We allow the underlying distributions to vary with the block length n in order to maintain this condition,and we assume that,properly scaled,these distributions converge.This model is discussed in detail in the next section,where we also describe the Good-Turing estimator.In Section III,we establish the convergence of the total probabilities.Section IV uses this convergence result to show strong consistency of the Good-Turing estimator.Some comments regarding how to estimate other quantities of interest are made in the ﬁnal section.