Uniform convergence (combinatorics)
For a class of predicates defined on a set and a set of samples , where , the empirical frequency of on is . The Uniform Convergence Theorem states, roughly,that if is "simple" and we draw samples independently (with replacement) from according to a distribution , then with high probability all the empirical frequency will be close to its expectation, where the expectation is given by . Here "simple" means that the Vapnik-Chernovenkis dimension of the class is small relative to the size of the sample.
In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
Uniform convergence theorem statement[1]
If is a set of -valued functions defined on a set and is a probability distribution on then for and a positive integer, we have,
- for some
where, for any ,
- ,
- and . indicates that the probability is taken over consisting of i.i.d. draws from the distribution .
is defined as: For any -valued functions over and ,
- .
And for any natural number the shattering number is defined as.
- .
From the point of Learning Theory one can consider to be the Concept/Hypothesis class defined over the instance set . Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
Sauer–Shelah lemma
The Sauer–Shelah lemma[2] relates the shattering number to the VC Dimension.
Lemma: , where is the VC Dimension of the concept class .
Corollary: .
Proof of uniform convergence theorem [1]
Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
- Symmetrization: We transform the problem of analyzing into the problem of analyzing , where and are i.i.d samples of size drawn according to the distribution . One can view as the original randomly drawn sample of length , while may be thought as the testing sample which is used to estimate .
- Permutation: Since and are picked identically and independently, so swapping elements between them will not change the probability distribution on and . So, we will try to bound the probability of for some by considering the effect of a specific collection of permutations of the joint sample . Specifically, we consider permutations which swap and in some subset of . The symbol means the concatenation of and .
- Reduction to a finite class: We can now restrict the function class to a fixed joint sample and hence, if has finite VC Dimension, it reduces to the problem to one involving a finite function class.
We present the technical details of the proof.
Symmetrization
Lemma: Let for some and
- for some .
Then for , .
Proof:
By the triangle inequality,
if and then .
Therefore,
- and
- and [since and are independent].
Now for fix an such that . For this , we shall show that
- .
Thus for any , and hence . And hence we perform the first step of our high level idea.
Notice, is a binomial random variable with expectation and variance . By Chebyshev's inequality we get,
- for the mentioned bound on . Here we use the fact that for .
Permutations
Let be the set of all permutations of that swaps and in some subset of .
Lemma: Let be any subset of and any probability distribution on . Then,
where the expectation is over chosen according to , and the probability is over chosen uniformly from .
Proof: For any
- [since coordinate permutations preserve the product distribution ].
- [Because is finite]
- [The expectation]
- .
The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a finite class
Lemma: Basing on the previous lemma,
- .
Proof:
Let us define and which is atmost . This means there are functions such that for any between and with for .
We see that iff for some in satisfies,
.
Hence if we define if and otherwise.
For and , we have that iff for some in satisfies . By union bound we get,
- .
Since, the distribution over the permutations is uniform for each , so equals , with equal probability.
Thus,
- ,
where the probability on the right is over and both the possibilities are equally likely. By Hoeffding's inequality, this is at most .
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.