Uniform convergence (combinatorics)

For a class of predicates $H\,\!$ defined on a set $X\,\!$ and a set of samples $x=(x_{1},x_{2},\dots,x_{m})\,\!$ , where $x_{i}\in X\,\!$ , the empirical frequency of $h\in H\,\!$ on $x\,\!$ is $\widehat{Q_{x}}(h)=\frac{1}{m}|\{i:1\leq i\leq m,h(x_{i})=1\}|\,\!$ . The Uniform Convergence Theorem states, roughly,that if $H\,\!$ is "simple" and we draw samples independently (with replacement) from $X\,\!$ according to a distribution $P\,\!$ , then with high probability all the empirical frequency will be close to its expectation, where the expectation is given by $Q_{P}(h)=P\{y\in X:h(y)=1\}\,\!$ . Here "simple" means that the Vapnik-Chernovenkis dimension of the class $H\,\!$ is small relative to the size of the sample.
In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

Uniform convergence theorem statement^[1]

If $H\,\!$ is a set of $\{0,1\}\,\!$ -valued functions defined on a set $X\,\!$ and $P\,\!$ is a probability distribution on $X\,\!$ then for $\epsilon>0\,\!$ and $m\,\!$ a positive integer, we have,

P^{m}\{|Q_{P}(h)-\widehat{Q_{x}}(h)|\geq\epsilon\,\!

for some

h\in H\}\leq 4\Pi_{H}(2m)e^{-\frac{\epsilon^{2}m}{8}}.\,\!

where, for any $x\in X^{m}\,\!$ ,

Q_{P}(h)=P\{(y\in X:h(y)=1\}\,\!

\widehat{Q_{x}}(h)=\frac{1}{m}|\{i:1\leq i\leq m,h(x_{i})=1\}|\,\!

and

|x|=m\,\!

P^{m}\,\!

indicates that the probability is taken over

x\,\!

consisting of

m\,\!

i.i.d. draws from the distribution

P\,\!

$\Pi_{H}\,\!$ is defined as: For any $\{0,1\}\,\!$ -valued functions $H\,\!$ over $X\,\!$ and $D\subseteq X \,\!$ ,

\Pi_{H}(D)=\{h\cap D:h\in H\}\,\!

And for any natural number $m\,\!$ the shattering number $\Pi_{H}(m)\,\!$ is defined as.

\Pi_{H}(m)=max|\{h\cap D:|D|=m,h\in H\}|\,\!

From the point of Learning Theory one can consider $H\,\!$ to be the Concept/Hypothesis class defined over the instance set $X\,\!$ . Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.

Sauer–Shelah lemma

The Sauer–Shelah lemma^[2] relates the shattering number $\Pi_{h}(m)\,\!$ to the VC Dimension.

Lemma: $\Pi_{H}(m)\leq\left( \frac{em}{d}\right)^{d}\,\!$ , where $d\,\!$ is the VC Dimension of the concept class $H\,\!$ .

Corollary: $\Pi_{H}(m)\leq m^{d}\,\!$ .

Proof of uniform convergence theorem ^[1]

Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

Symmetrization: We transform the problem of analyzing $|Q_{P}(h)-\widehat{Q}_{x}(h)|\geq\epsilon\,\!$ into the problem of analyzing $|\widehat{Q}_{r}(h)-\widehat{Q}_{s}(h)|\geq\epsilon/2\,\!$ , where $r\,\!$ and $s\,\!$ are i.i.d samples of size $m\,\!$ drawn according to the distribution $P\,\!$ . One can view $r\,\!$ as the original randomly drawn sample of length $m\,\!$ , while $s\,\!$ may be thought as the testing sample which is used to estimate $Q_{P}(h)\,\!$ .
Permutation: Since $r\,\!$ and $s\,\!$ are picked identically and independently, so swapping elements between them will not change the probability distribution on $r\,\!$ and $s\,\!$ . So, we will try to bound the probability of $|\widehat{Q}_{r}(h)-\widehat{Q}_{s}(h)|\geq\epsilon/2\,\!$ for some $h \in H\,\!$ by considering the effect of a specific collection of permutations of the joint sample $x=r||s\,\!$ . Specifically, we consider permutations $\sigma(x)\,\!$ which swap $x_i\,\!$ and $x_{m+i}\,\!$ in some subset of ${1,2,...,m}\,\!$ . The symbol $r||s\,\!$ means the concatenation of $r\,\!$ and $s\,\!$ .
Reduction to a finite class: We can now restrict the function class $H\,\!$ to a fixed joint sample and hence, if $H\,\!$ has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof.

Symmetrization

Lemma: Let $V=\{x\in X^{m}:|Q_{P}(h)-\widehat{Q_{x}}(h)|\geq\epsilon\,\!$ for some $h\in H\}\,\!$ and

R=\{(r,s)\in X^{m}\times X^{m}:|\widehat{Q_{r}}(h)-\widehat{Q_{s}}(h)|\geq\epsilon /2\,\!

for some

h\in H\}\,\!

Then for $m\geq\frac{2}{\epsilon^{2}}\,\!$ , $P^{m}(V)\leq 2P^{2m}(R)\,\!$ .

Proof: By the triangle inequality,
if $|Q_{P}(h)-\widehat{Q_{r}}(h)|\geq\epsilon\,\!$ and $|Q_{P}(h)-\widehat{Q_{s}}(h)|\leq\epsilon /2\,\!$ then $|\widehat{Q_{r}}(h)-\widehat{Q_{s}}(h)|\geq\epsilon /2\,\!$ .
Therefore,

P^{2m}(R)\geq P^{2m}\{\exists h\in H,|Q_{P}(h)-\widehat{Q_{r}}(h)|\geq\epsilon\,\!

and

|Q_{P}(h)-\widehat{Q_{s}}(h)|\leq\epsilon /2\}\,\!

=\int_{V}P^{m}\{s:\exists h\in H,|Q_{P}(h)-\widehat{Q_{r}}(h)|\geq\epsilon\,\!

and

|Q_{P}(h)-\widehat{Q_{s}}(h)|\leq\epsilon /2\}dP^{m}(r)=A\,\!

[since

r\,\!

and

s\,\!

are independent].

Now for $r\in V\,\!$ fix an $h\in H\,\!$ such that $|Q_{P}(h)-\widehat{Q_{r}}(h)|\geq\epsilon\,\!$ . For this $h\,\!$ , we shall show that

P^{m}\{|Q_{P}(h)-\widehat{Q_{s}}(h)|\leq\frac{\epsilon}{2}\}\geq\frac{1}{2}\,\!

Thus for any $r\in V\,\!$ , $A\geq\frac{P^{m}(V)}{2}\,\!$ and hence $P^{2m}(R)\geq\frac{P^{m}(V)}{2}\,\!$ . And hence we perform the first step of our high level idea.
Notice, $m\cdot \widehat{Q_{s}}(h)\,\!$ is a binomial random variable with expectation $m\cdot Q_{P}(h)\,\!$ and variance $m\cdot Q_{P}(h)(1-Q_{P}(h))\,\!$ . By Chebyshev's inequality we get,

P^{m}\{|Q_{P}(h)-\widehat{Q_{s}(h)}|>\frac{\epsilon}{2}\}\leq\frac{m\cdot Q_{P}(h)(1-Q_{P}(h))}{(\epsilon m/2)^{2}}\leq\frac{1}{\epsilon^{2}m}\leq\frac{1}{2}\,\!

for the mentioned bound on

m\,\!

. Here we use the fact that

x(1-x)\leq 1/4\,\!

for

x\,\!

Permutations

Let $\Gamma_{m}\,\!$ be the set of all permutations of $\{1,2,3,\dots,2m\}\,\!$ that swaps $i\,\!$ and $m+i\,\!$ $\forall i\,\!$ in some subset of $\{1,2,3,...,2m\}\,\!$ .

Lemma: Let $R\,\!$ be any subset of $X^{2m}\,\!$ and $P\,\!$ any probability distribution on $X\,\!$ . Then,

P^{2m}(R)=E[Pr[\sigma(x)\in R]]\leq max_{x\in X^{2m}}(Pr[\sigma(x)\in R]),\,\!

where the expectation is over $x\,\!$ chosen according to $P^{2m}\,\!$ , and the probability is over $\sigma\,\!$ chosen uniformly from $\Gamma_{m}\,\!$ .

Proof: For any $\sigma\in\Gamma_{m},\,\!$

P^{2m}(R)=P^{2m}\{x:\sigma(x)\in R\}\,\!

[since coordinate permutations preserve the product distribution

P^{2m}\,\!

\therefore P^{2m}(R)=\int_{X^{2m}}1_{R}(x)dP^{2m}(x)\,\!

=\frac{1}{|\Gamma_{m}|}\sum_{\sigma\in\Gamma_{m}}\int_{X^{2m}}1_{R}(\sigma(x))dP^{2m}(x)\,\!

=\int_{X^{2m}}\frac{1}{|\Gamma_{m}|}\sum_{\sigma\in\Gamma_{m}}1_{R}(\sigma(x))dP^{2m}(x)\,\!

[Because

|\Gamma_{m}|\,\!

is finite]

=\int_{X^{2m}}Pr[\sigma(x)\in R]dP^{2m}(x)

[The expectation]

\leq max_{x\in X^{2m}}(Pr[\sigma(x)\in R])\,\!

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

Reduction to a finite class

Lemma: Basing on the previous lemma,

max_{x\in X^{2m}}(Pr[\sigma(x)\in R])\leq 4\Pi_{H}(2m)e^{-\frac{\epsilon^{2}m}{8}}\,\!

Proof: Let us define $x=(x_{1},x_{2},...,x_{2m})\,\!$ and $t=|H|_{x}|\,\!$ which is atmost $\Pi_{H}(2m)\,\!$ . This means there are functions $h_{1},h_{2},...,h_{t}\in H\,\!$ such that for any $h\in H,\exists i\,\!$ between $1\,\!$ and $t\,\!$ with $h_{i}(x_{k})=h(x_{k})\,\!$ for $1\leq k\leq 2m\,\!$ .
We see that $\sigma(x)\in R\,\!$ iff for some $h\,\!$ in $H\,\!$ satisfies, $|\frac{1}{m}|\{1\leq i\leq m:h(x_{\sigma_{i}})=1\}|-\frac{1}{m}|\{m+1\leq i\leq 2m:h(x_{\sigma_{i}})=1\}||\geq\frac{\epsilon}{2}\,\!$ . Hence if we define $w^{j}_{i}=1\,\!$ if $h_{j}(x_{i})=1\,\!$ and $w^{j}_{i}=0\,\!$ otherwise.
For $1\leq i\leq m\,\!$ and $1\leq j\leq t\,\!$ , we have that $\sigma(x)\in R\,\!$ iff for some $j\,\!$ in ${1,...,t}\,\!$ satisfies $|\frac{1}{m}\left(\sum_{i}w^{j}_{\sigma(i)}-\sum_{i}w^{j}_{\sigma(m+i)}\right)|\geq\frac{\epsilon}{2}\,\!$ . By union bound we get,

Pr[\sigma(x)\in R]\leq t\cdot max\left(Pr[|\frac{1}{m}\left(\sum_{i}w^{j}_{\sigma_{i}}-\sum_{i}w^{j}_{\sigma_{m+i}}\right)|\geq\frac{\epsilon}{2}]\right)

\leq \Pi_{H}(2m)\cdot max\left(Pr[|\frac{1}{m}\left(\sum_{i}w^{j}_{\sigma_{i}}-\sum_{i}w^{j}_{\sigma_{m+i}}\right)|\geq\frac{\epsilon}{2}]\right)\,\!

Since, the distribution over the permutations $\sigma\,\!$ is uniform for each $i\,\!$ , so $w^{j}_{\sigma_{i}}-w^{j}_{\sigma_{m+i}}\,\!$ equals $\pm |w^{j}_{i}-w^{j}_{m+i}|\,\!$ , with equal probability.
Thus,

Pr[|\frac{1}{m}\left(\sum_{i}\left(w^{j}_{\sigma_{i}}-w^{j}_{\sigma_{m+i}}\right)\right)|\geq\frac{\epsilon}{2}]=Pr[|\frac{1}{m}\left(\sum_{i}|w^{j}_{i}-w^{j}_{m+i}|\beta_{i}\right)|\geq\frac{\epsilon}{2}]\,\!

where the probability on the right is over $\beta_{i}\,\!$ and both the possibilities are equally likely. By Hoeffding's inequality, this is at most $2e^{-\frac{m\epsilon^{2}}{8}}\,\!$ .

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.