$$ \def\*#1{\mathbf{#1}} \def\+#1{\mathcal{#1}} \def\-#1{\mathrm{#1}} \def\!#1{\mathsf{#1}} \def\@#1{\mathscr{#1}} \newcommand{\mr}[1]{\mbox{\scriptsize \color{RedViolet}$\triangleright\;$#1}\quad\quad} $$

Lecture 13: VC dimension, Empirical Risk Minimization

Author

instructed by Chihao Zhang, scribed by Weixian Xu and Yuchen He

Most expositions in this lecture follow (Vershynin 2018) and (Van Handel 2014).

Vershynin, Roman. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science. Vol. 47. Cambridge university press.
Van Handel, Ramon. 2014. “Probability in High Dimension.”

VC dimension

Recall that in the previous lecture, we established a uniform Law of Large Numbers (LLN) for Lipschitz functions. Specifically, we considered the class: \[ \+F=\set{f:[0,1]\to [0,1], f \mbox{ is }1\mbox{-Lipschitz}}. \] Let \(X_1,\dots,X_n\) be \(n\) be independent samples drawn uniformly from \([0,1]\). We analyzed the empirical process centered at the mean: \[ X_f=\frac{1}{n} \sum_{i=1}^n f(X_i)-\E{f(X)} \] Since \(X_f-X_g\) is \(\frac{\norm{f-g}^2_{\infty}}{n}\)-sub-Gaussian and the covering number \(N(\+F,\norm{\cdot}_{\infty},\eps)\) is finite, we applied the chaining argument (Dudley’s theorem) to bound the supremum.

However, the Lipschitz assumption is restrictive. Many fundamental problems in statistics involves discontinuous functions. A classic example is the Glivenko-Cantelli Theorem.

To estimate the cumulative distribution function (CDF) \(F(x)\) of a distribution \(\mu\), we use the empirical CDF: \[ F_n(x) = \frac{1}{n}\sum_{i=1}^n \*1[X_i\leq x]. \] The Glivenko-Cantelli theorem states that \(F_n\) converges to \(F\) uniformly almost surely: \[ \sup_{x \in \mathbb{R}} |F_n(x) - F(x)| \xrightarrow{a.s.} 0. \]

This problem can be framed as uniform convergence over the function class of indicator \[ \mathcal{F} = \set{f: x\mapsto \*1[x\leq a]\ |\ a\in \bb R}. \]

The functions in \(\mathcal{F}\) are step functions and are manifestly not Lipschitz. More critically, if we attempt to use the \(L_\infty\) norm (sup-norm), the covering number \(N(\+F,\norm{\cdot}_{\infty},\eps)\) is infinite for any \(\eps<1\). The failure of the \(L_\infty\) norm suggests we are measuring distance too strictly. For statistical convergence, two functions \(f\) and \(g\) don’t need to be close everywhere.

This motivates us to measure the complexity of \(\mathcal{F}\) using the \(L^2(\mu)\) norm \[ \|f - g\|_{L^2(\mu)} = \E{(f(X) - g(X))^2}^{1/2}. \] This shift allows us to bound the covering number of discontinuous classes.

VC dimension

To characterize the complexity of a class of Boolean functions \(\mathcal{F}\) where each \(f: \Omega \to \{0, 1\}\), we introduce the concept of Vapnik-Chervonenkis (VC) dimension.

The central idea is to check how “rich” the function class is by seeing if it can reproduce all possible binary patterns on a finite set of points.

Definition 1 (Shattering) A subset of points \(\Lambda = \{x_1, \dots, x_k\} \subseteq \Omega\) is said to be shattered by \(\mathcal{F}\) if for any binary labeling \(g: \Lambda \to \{0, 1\}\), \(\exists f \in \mathcal{F}\) such that \(f|_\Lambda = g\).

Definition 2 (VC dimension) The VC dimension of \(\mathcal{F}\), denoted \(\!{VC}(\mathcal{F})\), is the largest cardinality of a subset \(\Lambda \subseteq \Omega\) that is shattered by \(\mathcal{F}\).

To build intuition, let’s calculate the VC dimensions for some typical function classes.

Example 1 (Interval indicators on \(\bb R\)) Consider the class of indicators of intervals on the real line: \[ \+F=\set{f: x\mapsto \*1[x\in [a,b]]\ |\ a, b\in \bb R, a\leq b}. \] We claim that \(\!{VC}(\mathcal{F})=2\). For two points \(x_1< x_2\), to realise the labeling \((0,0), (1,0), (0,1)\) and \((1,1)\), we can choose the interval disjoint from points, the interval \([x_1,x_1]\), the interval \([x_2,x_2]\) and \([x_1,x_2]\) respectively.

If we have \(3\) points \(x_1<x_2<x_3\), consider the label pattern \((1, 0, 1)\). To satisfy this, an interval \([a, b]\) must contain \(x_1\) and \(x_3\). By convexity, it must also contain \(x_2\). This leads to a contradiction and implies that \(\!{VC}(\mathcal{F})=2\).

Example 2 (Half-planes in \(\bb R^2\)) Consider linear threshold functions in the plane: \[ \mathcal{F} = \set{ f:(x,y) \mapsto \*1[ax + by + c \ge 0] \mid a,b,c \in \mathbb{R} }. \] We claim that \(\!{VC}(\mathcal{F})=3\). Any three non-collinear points (forming a triangle) can be shattered. Any subset of the vertices can be separated from the rest by a straight line.

When there are \(4\) points:

  • If one point is inside the triangle formed by the other three and we label the outer points \(1\) and the inner point \(0\), no line can “cut out” the triangle without also including the interior point.
  • If the points form a convex quadrilateral and we label diagonal opposites as \((1, 1)\) and the other pair as \((0, 0)\), no single line can separate the 1s from the 0s.

Example 3 (Sphere indicators in \(\bb R^d\)) Consider the class of indicators of spheres in \(\mathbb{R}^d\): \[ \mathcal{F} = \set{ f: x \mapsto \*1[\|x - c\|_2 \le r] \mid c \in \mathbb{R}^d, r \ge 0 }. \] We claim that \(\!{VC}(\mathcal{F}) = d + 1\). Any \(d+1\) points in general position can be shattered by spheres. However, no set of \(d+2\) points can be shattered.

Example 4 (Boolean functions on a finite domain) Assume \(\Omega=\set{1,2,3}\) and consider the class defined by the binary strings \(\mathcal{F} = \{001, 010, 100, 111\}\). We claim that \(\!{VC}(\+F)=2\).

It is easy to verify that the subset \(\{1, 2\}\) is shattered. However, \(\Omega\) itself is not shattered because \(|\mathcal{F}| = 4 < 2^3 = 8\).

Combinatorial bounds

We now connect the cardinality of the function class \(|\mathcal{F}|\) to its VC dimension.

First, observe the trivial bounds: \[ 2^{\!{VC}(\+F)} \leq \abs{\+F}\leq 2^{\abs{\Omega}}. \]

We can derive a much tighter upper bound based on the number of shattered sets.

Lemma 1 (Pajor’s Lemma) For any finite set \(\Omega\),
\[ \abs{\+F}\leq \abs{\set{\Lambda\subseteq \Omega: \Lambda \mbox{ is shattered by }\+F}}. \]

Proof. We proceed by induction on the size of the domain \(|\Omega|\).

If \(\Omega = \emptyset\), then \(\mathcal{F}\) contains only the empty function. The empty set is shattered.

Assume the lemma holds for \(|\Omega| = n\). Consider a domain of size \(n+1\). Pick an arbitrary element \(x_0 \in \Omega\) and let \(\Omega' = \Omega \setminus \{x_0\}\). We partition \(\mathcal{F}\) into two subclasses based on their value at \(x_0\): \[ \mathcal{F}_0 = \set{ f \in \mathcal{F} : f(x_0) = 0 }, \quad \mathcal{F}_1 = \set{ f \in \mathcal{F} : f(x_0) = 1 } \]

Clearly \(|\mathcal{F}| = |\mathcal{F}_0| + |\mathcal{F}_1|\).

We define the projection of these classes onto \(\Omega'\) as \(\mathcal{F}_0|_{\Omega'}\) and \(\mathcal{F}_1|_{\Omega'}\). By the induction hypothesis applied to \(\Omega'\): \[ |\mathcal{F}_0|_{\Omega'}| \le |\mathcal{S}(\mathcal{F}_0|_{\Omega'})| \quad \text{and} \quad |\mathcal{F}_1|_{\Omega'}| \le |\mathcal{S}(\mathcal{F}_1|_{\Omega'})|, \] where \(\mathcal{S}(\mathcal{G})\) denotes the collection of subsets of \(\Omega'\) shattered by \(\mathcal{G}\). Thus, \[ |\mathcal{F}| \le |\mathcal{S}(\mathcal{F}_0|_{\Omega'})| + |\mathcal{S}(\mathcal{F}_1|_{\Omega'})|. \]

Now, we relate these projected shattered sets back to the full set \(\Omega\). Consider a set \(\Lambda \subseteq \Omega'\):

  • Case A: \(\Lambda \in \mathcal{S}(\mathcal{F}_0|_{\Omega'})\). This means \(\mathcal{F}_0\) can produce all patterns on \(\Lambda\) (while keeping \(f(x_0)=0\)). Thus \(\Lambda\) is shattered by \(\mathcal{F}\).
  • Case B: \(\Lambda \in \mathcal{S}(\mathcal{F}_1|_{\Omega'})\). This means \(\mathcal{F}_1\) can produce all patterns on \(\Lambda\) (while keeping \(f(x_0)=1\)). Thus \(\Lambda\) is shattered by \(\mathcal{F}\).
  • The Overlap: If \(\Lambda\) is in both \(\mathcal{S}(\mathcal{F}_0|_{\Omega'})\) and \(\mathcal{S}(\mathcal{F}_1|_{\Omega'})\), then \(\mathcal{F}\) can produce all patterns on \(\Lambda\) with \(f(x_0)=0\) and all patterns on \(\Lambda\) with \(f(x_0)=1\). This implies that \(\mathcal{F}\) shatters the larger set \(\Lambda \cup \{x_0\}\).

Therefore, \(\abs{S(\mathcal{F})} \ge \abs{S(\mathcal{F}_0|_{\Omega'})} + \abs{S(\mathcal{F}_1|_{\Omega'})}\) and the statement is proved.

The following Sauer-Shelah Lemma is a direct consequence of Pajor’s Lemma.

Lemma 2 (Sauer-Shelah Lemma) If \(\mathcal{F}\) is a class of functions on \(\Omega\) (where \(|\Omega|=n\)) with \(\text{VC}(\mathcal{F}) = d\), then \[ |\mathcal{F}| \le \sum_{k=0}^d \binom{n}{k} \le \left( \frac{en}{d} \right)^d \]

Covering number and VC dimension

To apply Dudley’s theorem, we need to bound the covering number of the function class. Since the \(L_\infty\) covering number is often infinite for Boolean classes, we work with the \(L^2(\mu)\) norm.

Let \(\mu\) be a probability distribution on \(\Omega\). For two boolean functions \(f\) and \(g\), we define their distance as \[ d(f,g) = \norm{f-g}_{L^2(\mu)}. \] Note that for Boolean functions, \((f(x)-g(x))^2 = \*1[f(x) \neq g(x)]\). Thus, the squared distance is simply the probability of disagreement: \[ \|f-g\|_{L^2(\mu)}^2 = \Pr[\mu]{f(X) \neq g(X)}. \]

We now prove that the covering number with regard to this distance can be controlled by the VC dimension.

Theorem 1 (Covering number via VC dimension) There exist universal constants \(C, c > 0\) such that for any Boolean function class \(\mathcal{F}\) with \(\!{VC}(\mathcal{F}) = d\), \[ {N}(\mathcal{F}, \norm{\cdot}_{L^2(\mu)}, \epsilon) \leq \tp{\frac{C}{\eps}}^{cd}. \]

If the domain \(\Omega\) is finite with size \(M\), Sauer-Shelah immediately gives a bound of roughly \((eM/d)^d\). However, \(M\) can be infinite.

We cannot cover \(\mathcal{F}\) on the whole infinite domain directly. Instead, we show that we can randomly sample a small number of points \(n\) such that the complexity of \(\mathcal{F}\) on these points faithfully represents the complexity of \(\mathcal{F}\) on the whole space. This is a dimension reduction argument.

Dimension reduction

Let \(\mathcal{G}\) be a maximal \(\varepsilon\)-packing of \(\mathcal{F}\) in the \(L^2(\mu)\) norm. By definition, for any distinct \(g_i, g_j \in \mathcal{G}\), \(\|g_i - g_j\|_{L^2(\mu)} > \varepsilon\) and \(\abs{\+G}\geq N(\mathcal{F}, L^2(\mu), \eps)\).

We sample \(n\) points \(X_1, \dots, X_n\) independently from \(\mu\). We define the empirical \(L^2\) distance on this sample \(\Omega_n = \{X_1, \dots, X_n\}\) as \[ \norm{f-g}_{L^2(\mu_n)} = \tp{\frac{1}{n}\sum_{i=1}^n \tp{f(X_i)-g(X_i)}^2}^{\frac{1}{2}}. \] We want to choose \(n\) large enough so that the separation between functions in \(\mathcal{G}\) is preserved. Standard concentration inequalities show that if we choose \(n \approx \varepsilon^{-4} \log |\mathcal{G}|\), then with high probability (say 0.99): \[ \forall g_i, g_j \in \mathcal{G}, g_i \neq g_j \implies \|g_i - g_j\|_{L^2(\mu_n)} > \varepsilon/2. \]

Fix a realization of \(\Omega_n\) where the separation property holds. Consider the projection of our packing set onto these points: \[ \mathcal{G}|_{\Omega_n} = \{ (g(X_1), \dots, g(X_n)) : g \in \mathcal{G} \}. \]

Let \(N = N(\mathcal{F}, L^2(\mu), \eps)\). On the one hand, we have \[ N\leq \abs{\mathcal{G}|_{\Omega_n}} \leq \abs{\mathcal{F}|_{\Omega_n}}. \] Let \(d_n=\!{VC}\tp{\mathcal{F}|_{\Omega_n}}\). On the other hand, applying the Sauer-Shelah lemma, \[ \abs{\mathcal{F}|_{\Omega_n}} \leq \tp{\frac{en}{d_n}}^{d_n}. \] Solving this inequality gives \(\log N\leq \+O(d_n\log \eps^{-1}) \leq \+O(d\log \eps^{-1})\), which completes the proof.

The VC law of large numbers

We are now ready to prove the main theorem, which establishes the uniform convergence rate for classes with finite VC dimension.

Theorem 2 (The VC law of large numbers) Let \(\mathcal{F}\) be a class of boolean functions with finite VC dimension. Then: \[ \E{\sup_{f\in \+F} \abs{\frac{1}{n}\tp{\sum_{i=1}^n f(X_i) -\E{f}}}}\leq \+O\tp{\sqrt{\frac{\!{VC}(\+F)}{n}}}. \]

To prove this, let’s scale the quantity of interest by \(\sqrt{n}\) to match the standard Central Limit Theorem scaling. Define the empirical process \[ X_{f}^n = \sqrt{n}\cdot X_f = \frac{1}{\sqrt{n}}\tp{\sum_{i=1}^n f(X_i) -\E{f}}. \] Our goal is to bound \(\E{\sup_{f\in \+F} \abs{X_f^n}}\).

Note that if we simply sum the absolute bounds of the terms \(f(X_i) -\E{f}=\+O(1)\), the total sum would be \(\+O(n)\). However, this worst-case view ignores the stochastic cancellations where positive fluctuations are offset by negative ones. Indeed, the Central Limit Theorem tells us that this sum scales as \(\sqrt{n}\) and behaves like a Gaussian.

As a result, the cancellations from the opposite signs lead to a much tighter bound and is the main reason for the \(\sqrt{n}\) bound. In the following, we will introduce a general technique to capture this cancellation effect.

Symmetrization

Note that we are working with random variables of the form \(f(X_i) - \E{f}\). The random variable \(f(X_i)\) fluctuates around its mean \(\E{f}\). As discussed before, We want to isolate the fluctuation part and get rid of the mean. So we will approximate \(f(X_i) - \E{f}\) with a symmetric random variable.

A random variable \(Z\) is symmetric if \(Z \overset{D}{=} -Z\), i.e., \(Z\) and \(-Z\) have the same distribution.

We replace \(\E{f}\) by a fresh copy of the empirical mean. Let \(Y_1, \dots, Y_n\) be an independent “ghost sample” drawn from the same distribution \(\mu\). Since \(\E{Y_i} =\E{f}\), we can approximate the centered process by the difference between two empirical sums: \[ \sum_{i=1}^n f(X_i) -n\E{f} \approx \sum_{i=1}^n f(X_i) -\sum_{i=1}^n f(Y_i). \] Define \(Z_i = f(X_i) - f(Y_i)\). Crucially, the distribution of \(Z_i\) is symmetric around 0. Therefore, multiplying \(Z_i\) by a random sign \(\varepsilon_i \in \{-1, 1\}\) does not change its distribution. Therefore, \[ \sum_{i=1}^n \tp{f(X_i) - f(Y_i)} \overset{D}{=} \sum_{i=1}^n \eps_i\tp{f(X_i) - f(Y_i)}. \]

This leads to the symmetrization Lemma, which bounds the empirical process by a Rademacher process.

Lemma 3 (Symmetrization Lemma) \[ \E{\sup_{f\in \+F} \abs{X_f^n}}\leq \E{\sup_{f\in \+F} \frac{1}{\sqrt{n}}\abs{\sum_{i=1}^n \eps_i\tp{f(X_i) - f(Y_i)}}} \leq 2\E{\sup_{f\in \+F} \abs{X_f^n}}. \]

Proof. Let \(\E[X_i]{\cdot}\) and \(\E[Y_i]{\cdot}\) denote the expectations with respect to the samples \(\set{X_i}\) and \(\set{Y_i}\) respectively. For the first inequality, \[ \begin{align*} \E{\sup_{f\in \+F} \abs{X_f^n}} &= \E{\sup_{f\in \+F} \frac{1}{\sqrt{n}}\abs{\sum_{i=1}^n \tp{f(X_i) - \E{f}}}} \\ &= \E[X_i]{\sup_{f\in \+F} \frac{1}{\sqrt{n}}\abs{\sum_{i=1}^n \tp{f(X_i) - \E[Y_i]{f(Y_i)}}}}\\ &=\E[X_i]{\sup_{f\in \+F} \frac{1}{\sqrt{n}}\abs{\sum_{i=1}^n \tp{\E[Y_i]{f(X_i) - f(Y_i)}}}}\\ \mr{Jensen inequality} &\leq \E[X_i]{\E[Y_i]{\sup_{f\in \+F} \frac{1}{\sqrt{n}}\abs{\sum_{i=1}^n \tp{{f(X_i) - f(Y_i)}}}}}\\ &= \E{\sup_{f\in \+F} \frac{1}{\sqrt{n}}\abs{\sum_{i=1}^n \eps_i\tp{{f(X_i) - f(Y_i)}}}}. \end{align*} \] The second inequality follows from the triangle inequality.

Via a simple application of the triangle inequality, we have the following corollary.

Corollary 1 \[ \E{\sup_{f\in \+F}\abs{\sum_{i=1}^n \tp{f(X_i) - \E{f}} }}\leq 2\E{\sup_{f\in \+F} \abs{\sum_{i=1}^n \eps_i\cdot f(X_i)}}. \]

Proof of the VC law of large numbers

Now we apply Dudley’s theorem to the symmetrized process. By symmetrization, \[ \E{\sup_{f\in \+F} \abs{\frac{1}{n}\tp{\sum_{i=1}^n f(X_i) -\E{f}}}} \leq \frac{2}{\sqrt{n}} \E{\sup_{f\in \+F}\frac{1}{\sqrt{n}} \abs{\sum_{i=1}^n \eps_i\cdot f(X_i)}}. \tag 1 \]

Condition on the samples \(X_1, \dots, X_n\). Consider the process indexed by \(\mathcal{F}\) defined by \[ G_f = \frac{1}{\sqrt{n}} \abs{\sum_{i=1}^n \varepsilon_i f(X_i)}. \] With \(X_i\) fixed, the only randomness comes from the Rademacher variables \(\varepsilon_i\). The increment is \[ G_f - G_g \leq \frac{1}{\sqrt{n}} \abs{\sum_{i=1}^n \varepsilon_i (f(X_i) - g(X_i))}. \] Since \(\varepsilon_i\) are independent 1-sub-Gaussian variables, the value is sub-Gaussian with variance parameter \[ \sigma^2 = \sum_{i=1}^n \left( \frac{f(X_i) - g(X_i)}{\sqrt{n}} \right)^2 = \frac{1}{n} \sum_{i=1}^n (f(X_i) - g(X_i))^2 = \|f-g\|_{L^2(\mu_n)}^2. \]

We can now apply Dudley’s integral bound. Note that since \(f\) is boolean, the diameter of the space is at most 1. Therefore, \[ \begin{align*} \E{\sup_{f\in \+F}\frac{1}{\sqrt{n}} \abs{\sum_{i=1}^n \eps_i\cdot f(X_i)} \mid X_1,\dots,X_n} &\lesssim \E{\int_0^1 \sqrt{\log N(\+F,L^2(\mu_n),\eps)} \ \dd \eps \mid X_1,\dots,X_n} \\ &\lesssim \E{\int_0^1 \sqrt{d_n\log \frac{2}{\eps}}\ \dd \eps \mid X_1,\dots,X_n}\\ &\lesssim \sqrt{\!{VC}(\+F)}. \end{align*} \] This completes the proof of the VC law of large number.

Applications

This VC LLN can be applied to prove the Glivenko-Cantelli theorem directively since the VC dimension of the function class \[ \mathcal{F} = \set{f: x\mapsto \*1[x\leq a]\ |\ a\in \bb R} \] is 1.

Another application is the geometric discrepancy. Consider throwing \(n\) random points uniformly into the unit square \([0, 1]^2\). We want to know how well these discrete points represent the continuous uniform distribution. Specifically, if we draw any circle \(\mathcal{C}\) inside the square, is the fraction of points falling inside \(\mathcal{C}\) close to the actual area of \(\mathcal{C}\)?

We can prove that the class of all indicator functions of circles in \(\mathbb{R}^2\) has a VC dimension of 3 (you can try to prove this). Therefore, applying the VC LLN, we conclude that with high probability, the empirical fraction converges to the true area simultaneously for all possible circles with an error bounded by \(\+O(1/\sqrt{n})\). That is, \[ \frac{\text{number of points in } \mathcal{C}}{n} \approx \text{Area}(\mathcal{C}) \pm \frac{1}{\sqrt{n}}. \]

Empirical risk minimization

We now turn to a central problem in machine learning: using finite data to learn a general rule.

Consider a map \(T:\Omega\to \bb R\). We do not know \(T\), but we observe its values on a training set of \(n\) points \(X_1,\dots,X_n\) drawn independently from some distribution \(\mu\) on \(\Omega\). So, our training data is the set of labeled pairs \(\set{(X_i,T(X_i))}_{i\in[n]}\).

Our goal is to find a function \(f\) from a candidate set \(\mathcal{F}\) (the hypothesis class) that approximates \(T\) well. We measure the quality of \(f\) using the true risk \[ R(f)\defeq \E[X\sim \mu]{\tp{f(X)-T(X)}^2}. \]

Furthermore, if \(T\) and \(f\) are boolean functions, we have \(R(f)=\Pr{f(X)\neq T(X)}\). Ideally, we want to find the best candidate in our class: \[ f^* =\arg\min_{f\in \+F} R(f). \]

However, we cannot compute \(R(f)\) because we do not know \(\mu\) or \(T\). Instead, we minimize the empirical risk \[ R_n(f)\defeq \frac{1}{n}\sum_{i=1}^n \tp{f(X_i)-T(X_i)}^2. \] The strategy of picking \(f_n^* = \arg\min_{f \in \mathcal{F}} R_n(f)\) is called empirical risk minimization (ERM).

A key question arises: Does having low training error \(R_n(f_n^*)\) guarantee low true error \(R(f_n^*)\)? The answer depends on the complexity of \(\mathcal{F}\).

Theorem 3 (The VC generalization bound) If \(T\) and the functions in \(\+F\) are boolean, then there exists a universal constant \(C\) such that \[ \E{R(f^*_n)}\leq R(f^*) + C\sqrt{\frac{\!{VC}\tp{\+F}}{n}}. \]

Note that this bound reflect a trade-off:

  • If \(\mathcal{F}\) is too simple (low VC dim), it might not contain any function that approximates \(T\) well. \(R(f^*)\) will be large (Underfitting).
  • If \(\mathcal{F}\) is too complex (high VC dim), we can fit the training data easily, but the generalization gap grows. We might pick a function that memorizes noise rather than learning the structure (Overfitting).

Optimal performance requires choosing an \(\mathcal{F}\) with moderate complexity.

Now we prove this theorem. Let \(\eps=\sup_{f\in \+F}\abs{R_n(f) - R(f)}\). We can bound the risk of our learned estimator \(f_n^*\) as follows: \[ R(f^*_n) \leq R_n(f_n^*) + \eps \leq R_n(f^*) + \eps \leq R(f^*) + \eps +\eps = R(f^*) + 2\eps. \] Rearranging gives the “excess risk” bound \[ R(f^*_n) - R(f^*) \leq 2\eps. \] Now we must bound the expectation of this supremum. Let \(\+L=\set{\ell_f(x)=\tp{f(x)-T(x)}^2\mid f\in \+F}\) be the class of loss functions. Applying the VC LLN, we have \[ \begin{align*} \E{\eps} &= \E{\sup_{f\in \+F}\abs{R_n(f) - R(f)}}\\ &\leq \E{\sup_{\ell\in \+L} \frac{1}{n} \abs{\sum_{i=1}^n \ell(X_i) - \E{\ell(X)}}}\\ &\lesssim \sqrt{\frac{\!{VC}(\+L)}{n}}\\ &= \sqrt{\frac{\!{VC}(\+F)}{n}}. \end{align*} \] This completes the proof.

Example 5 (Example: Learning circles) Suppose we want to classify points in the unit square \([0,1]^2\). The points are labeled \(1\) if they lie inside an unknown fixed circle \(\mathcal{C}_{\!{true}}\), and \(0\) otherwise.

We receive \(n\) labeled samples \((X_i, Y_i)\) uniform on the square and then search for the circle \(\mathcal{C}\) that minimizes misclassifications on the training set.

Since the true circle \(\+C_{\!{true}}\) gives zero error, and the VC dimension of circles is at most \(3\), the VC generalization bound tells us the risk for our learned circle is at most \(\+O\tp{\frac{1}{\sqrt{n}}}\). This means that simply minimizing training error allows us to learn the concept with a misclassification probability that vanishes as \(\+O\tp{\frac{1}{\sqrt{n}}}\).