Lecture 11: Supremum of random variables, Covering and Packing
Supremum of random variables
In previous lectures, we focused on the concentration of the sum of a collection of random variables. In this lecture, we shift our focus to the supremum of a collection. Let \(T\) be an index set, and consider a collection of random variables \(\{X_t\}_{t\in T}\) defined on the same probability space. Our goal is to bound the quantity \(\E{\sup_{t\in T} X_t}\).
The exposition in this lecture is mainly based on (Van Handel 2014).
Finite index sets
We begin with the case where the index set \(T\) is finite.
A trivial bound is \(\sup_{t\in T} X_t \leq \sum_{t\in T} |X_t|\). Taking expectations, this yields \[ \E{\sup_{t\in T} X_t} \leq \abs{T}\cdot \sup_{t\in T} \E{\abs{X_t}}. \] This linear dependence on \(|T|\) is usually too loose. We can improve this by considering higher-order moments. Assume that for each \(t\in T\), \(\E{\abs{X_t}^p}<\infty\) for some constant \(p\geq 1\). Using Jensen’s inequality, \[ \begin{align*} \E{\sup_{t\in T} X_t}&\leq \E{\tp{\sup_{t\in T} \abs{X_t}^p}^{\frac{1}{p}}}\\ &\leq \E{\sup_{t\in T } \abs{X_t}^p}^{\frac{1}{p}} \\ &\leq \E{\sum_{t\in T } \abs{X_t}^p}^{\frac{1}{p}} \\ &\leq \abs{T}^{\frac{1}{p}}\cdot \sup_{t\in T} \E{\abs{X_t}^p}^{\frac{1}{p}}. \end{align*} \] This calculation reveals that by choosing a larger \(p\), the dependence on \(|T|\) in the upper bound becomes milder. This suggests us to consider the the exponential function. Assume that for all \(\lambda > 0\) and all \(t\in T\), the variables satisfy the exponential bound \(\E{e^{\lambda X_t}}\leq e^{\psi(\lambda)}\) for some function \(\psi\). We can bound the expectation of the supremum as follows: \[ \begin{align*} \E{\sup_{t\in T} X_t} &\leq \lambda^{-1}\cdot \E{\log e^{\lambda \cdot \sup_{t\in T} \abs{X_t}}}\\ \mr{Jensen's inequality} &\leq \lambda^{-1}\cdot \log \E{e^{\lambda \cdot \sup_{t\in T} \abs{X_t}}}\\ &\leq \lambda^{-1}\cdot \log \tp{ \abs{T}\cdot \sup_{t\in T} \E{e^{\lambda \cdot \abs{X_t}}}}\\ &\leq \lambda^{-1}\cdot \tp{\log \abs{T} + \psi(\lambda)}. \end{align*} \]
Suppose each \(X_t\) is centered \(\sigma^2\)-sub-Gaussian, which implies that \(\psi(\lambda) \leq \frac{\sigma^2 \lambda^2}{2}\). The bound becomes \[ \begin{align*} \E{\sup_{t\in T} X_t} &\leq \frac{\log T}{\lambda} + \frac{\sigma^2}{2}\lambda. \end{align*} \] To obtain the tightest bound, we minimize the right-hand side with respect to \(\lambda\). The optimal choice is \(\lambda =\sqrt{\frac{2\log \abs{T}}{\sigma^2}}\), which yields the inequality \[ \begin{align*} \E{\sup_{t\in T} X_t} &\leq \sigma\sqrt{2\log T}.\tag 1 \end{align*} \]
Tail probabilities via the union bound
Alternatively, we can derive the expectation bound by integrating the tail probabilities. Using a simple union bound, for any \(a>0\), \[ \Pr{\sup X_t \geq a}\leq \Pr{\exists t\in T,\ X_t \geq a}\leq \sum_{t\in T}\Pr{X_t \geq a}. \] From the sub-Gaussian property, \(\Pr{X_t \geq a}\leq e^{-\frac{a^2}{2\sigma^2}}\). Therefore, \[ \Pr{\sup X_t \geq a}\leq \abs{T}\cdot e^{-\frac{a^2}{2\sigma^2}}. \] Let \(Z=\sup X_t\) and \(Z_{+} = 0\vee Z\). We use the integral identity for expectation and then splitting the integral at a threshold \(A\): \[ \begin{align*} \E{Z}&\leq \E{Z_+} =\int_0^{\infty} \Pr{Z_+\ge x}\dd x \\ &\leq A + \int_A^{\infty} \abs{T}\cdot e^{-\frac{x^2}{2\sigma^2}}\dd x \\ \mr{choose $A=\sqrt{2\sigma^2 \log \abs{T}}$}&\leq \sqrt{2\sigma^2 \log \abs{T}} + 1. \end{align*} \] This gives a similar bound with Equation (1). Note that we derived this bound using the union bound, which assumes the worst-case scenario where events are disjoint. Recall that for events \(A\) and \(B\), \(\Pr{A \cup B} = \Pr{A} + \Pr{B} - \Pr{A \cap B}\). The union bound drops the intersection term. If \(\set{X_t}\) are highly correlated, the union bound is very loose. But if \(X_t\) are independent, the intersection term is small, so the union bound is relatively tight.
This leads to a crucial intuition: independence is the almost worst case for the supremum. This is in contrast with the case where we study the summation of random variables.
Infinite index sets
As our analysis of the finite case demonstrated, in the worst-case scenario, \(\E{\sup X_t}\) grows to infinity as \(|T|\) increases. This leads to a natural question: under what conditions is \(\E{\sup X_t}\) bounded for an infinite index set \(T\)?
When correlations exist, the “effective” number of variables is much smaller than the cardinality \(|T|\). To handle infinite sets (e.g., \(T\) is a continuous interval or a subset of \(\mathbb{R}^n\)), we cannot simply count points. Instead, we must quantify the “effective size” of \(T\) based on the geometry induced by the variables.
We regard the index set \(T\) as a metric space \((T, d)\). Our philosophy is as follows: if the variables \(\{X_t\}_{t\in T}\) is “sufficiently continuous” with respect to the metric \(d\), then the supremum should be determined by the “complexity” of \(T\).
If this is true, we can select a smaller set of representative points \(\+N \subset T\). Then bound the supremum over the set \(\+N\) and bound the approximation error between the remaining points in \(T\) and their nearest neighbors in \(\+N\).
First, we formalize “sufficiently continuous.”
Definition 1 Let \((T, d)\) be a metric space. We say the random variables \(\{X_t\}_{t\in T}\) are \(C\)-Lipschitz with respect to \(d\) if for all \(t,s\in T\), \[ \abs{X_s-X_t}\leq C\cdot d(s,t). \] Here, \(C\) can be a non-negative random variable.
Next, we define the “representative” points using the notion of \(\varepsilon\)-nets.
Definition 2 (\(\eps\)-net and covering number) A subset \(\mathcal{N} \subseteq T\) is called an \(\varepsilon\)-net for \((T,d)\) if every point in \(T\) is within distance \(\varepsilon\) of some point in \(\mathcal{N}\): \[ \forall t\in[T],\ \exists \pi(t)\in \+N,\ \mbox{s.t. } d(t,\pi(t)) \leq \eps. \] The covering number, denoted as \(N(T,d,\varepsilon)\), is the smallest possible cardinality of an \(\varepsilon\)-net.
Geometrically, saying \(\mathcal{N}\) is an \(\varepsilon\)-net is equivalent to saying that \(T\) is covered by the union of balls centered at \(\mathcal{N}\) with radius \(\varepsilon\), i.e., \(T\subseteq \bigcup_{a\in N}B_a(\eps)\). In the following part, we will assume \(N(T,d,\varepsilon)\) is always finite for the cases we care about.
We can now prove that if the random variables are sub-Gaussian and are Lipschitz w.r.t. \(d\), the expected supremum is bounded by a trade-off between the scale \(\varepsilon\) and the complexity \(N(T, d, \varepsilon)\).
Lemma 1 Suppose the \(\{X_t\}_{t\in T}\) are \(C\)-Lipschitz and that for every fixed \(t \in T\), the variable \(X_t\) is \(\sigma^2\)-sub-Gaussian. Then \[ \E{\sup_{t\in T} X_t}\leq \inf_{\eps>0} \tp{\eps \E{C} + \sqrt{2\sigma^2\cdot \log N(T,d,\eps)}}. \]
Proof. Let \(\mathcal{N}\) be a minimal \(\varepsilon\)-net of \(T\), so \(|\mathcal{N}| = N(T, d, \varepsilon)\). For any \(t \in T\), let \(\pi(t)\) be the element in \(\mathcal{N}\) closest to \(t\). Then we have \[ \sup_{t\in T} X_t \leq \sup_{t\in T} \tp{X_t-X_{\pi(t)}} + \sup_{t\in \+N} X_t \leq C\cdot \eps + \sup_{t\in \+N} X_t. \] Note that the size of \(\+N\) is \(N(T,d,\eps)\), from the inequality derived in the previous section, \(\E{\sup_{t\in \+N} X_t}\leq \sqrt{2\sigma^2\cdot \log N(T,d,\eps)}\). Therefore, \[ \E{\sup_{t\in T} X_t} \leq \eps \E{C}+ \sqrt{2\sigma^2\cdot \log N(T,d,\eps)}. \]
Since this holds for any \(\varepsilon > 0\), we can take the infimum over \(\varepsilon\) to get the tightest bound.
Application to the norm of random matrices
We now apply the above techniques to estimate the expected operator norm of a random matrix.
Consider an \(n\times m\) random matrix \(M\) where each entry is independent and identically distributed as \(M_{i,j}\sim \+N(0,\sigma^2)\). We consider its operator norm \[ \norm{M}_{\!{op}} \defeq \sup_{v\in \bb R^n,w\in \bb R^m\atop \norm{v}\leq 1, \norm{w}\leq 1} \inner{v}{Mw}. \] Our goal is to bound \(\E{\norm{M}_{\!{op}}}\). For simplicity, we will denote the operator norm as \(\norm{M}\) when the context is clear.
We define the index set as the product of the unit balls: \(T = B_0^n(1)\times B^m_0(1)\), where \(B_0^d(1)\) denotes the unit Euclidean ball centered at \(0\) in \(\mathbb{R}^d\). For each pair of vectors \((v,w)\in T\), we define the random variable: \[ X_{v,w}=v^{\top} M w = \sum_{i=1}^n \sum_{j=1}^m v_i w_j M_{i,j}. \] The expected operator norm is exactly the expectation of the supremum of \(\set{X_{v,w}}_{v,w\in T}\): \[ \E{\norm{M}} = \E{\sup_{v,w\in T} X_{v,w}}. \]
Since \(X_{v,w}\) is a linear combination of independent Gaussian variables, it is itself Gaussian with mean 0 and variance \[ \Var{X_{v,w}} = \sum_{i,j} (v_i w_j)^2 \sigma^2 = \sigma^2 \left(\sum_i v_i^2\right) \left(\sum_j w_j^2\right) = \sigma^2 \|v\|^2 \|w\|^2. \] For \((v,w) \in T\), we have \(\|v\|\le 1\) and \(\|w\|\le 1\), so the variance is at most \(\sigma^2\). Thus, \(X_{v,w}\) is \(\sigma^2\)-sub-Gaussian.
Next, we establish the Lipschitz continuity. We define the metric \(d\) on the product space \(T\) as the sum of the Euclidean distances: \[ d\tp{(v,w) , (v',w')} = \norm{v-v'} + \norm{w-w'}. \] Then we have \[ \begin{align*} \abs{X_{v,w} - X_{v',w'} } &= \abs{\inner{v}{Mw} - \inner{v'}{Mw'}}\\ &=\abs{\inner{v-v'}{Mw} - \inner{v'}{M(w'-w)}}\\ &\leq \abs{\inner{v-v'}{Mw}} + \abs{\inner{v'}{M(w'-w)}}\\ &\leq \norm{M}\cdot \norm{v-v'}\cdot \norm{w} + \norm{M}\cdot \norm{w-w'}\cdot \norm{v'}\\ &\leq \norm{M}\cdot \tp{\norm{v-v'} + \norm{w-w'}}. \end{align*} \] This shows that the random variables \(\{X_{v,w}\}\) is \(\norm{M}\)-Lipschitz with respect to \((T,d)\).
We now apply the lemma from the previous sub-section. For any \(\eps>0\), \[ \E{\norm{M}} \leq \eps\cdot \E{\norm{M}} + \sqrt{2\sigma^2\cdot \log N(T,d,\eps)}. \] Rearranging the inequality, we get \[ \E{\norm{M}} \leq \inf_{\eps\in (0,1)} \frac{1}{1-\eps}\cdot \sqrt{2\sigma^2\cdot \log N(T,d,\eps)}. \] It remains to calculate the covering number \(N(T,d,\varepsilon)\). Since \(T\) is a product space \(B_0^n(1) \times B_0^m(1)\) equipped with the additive metric, an \(\varepsilon\)-net for \(T\) can be constructed from the product of \(\frac{\eps}{2}\)-nets for the individual balls. Therefore, it suffices to estimate the covering numbers of the Euclidean unit balls.
Covering and packing
To calculate the covering number \(N(T, d, \varepsilon)\), it is often easier to work with its dual concept, packing.
Definition 3 (\(\eps\)-packing and packing number) A subset \(\+P\subseteq T\) is called an \(\eps\)-packing of \((T,d)\) if for every \(t,t'\in \+P\) and \(t\neq t'\), \(d(t,t')>\eps\). The packing number is defined as \[ P(T,d,\eps) = \sup\set{\abs{\+P}: \+P\mbox{ is an }\eps\mbox{-packing of }(T,d)}. \]
Intuitively, covering is about “how many balls do I need to hide the set?”, while packing is about “how many balls can I fit inside the set?”. These two quantities are closely related by the following inequalities.
Lemma 2 \[ P(T,d,2\eps)\leq N(T,d,\eps) \leq P(T,d,\eps). \]
Proof. For the first inequality, consider an arbitrary \(2\eps\)-packing \(\+P\). Let \(\+N\) be an \(\eps\)-net. From the definition of the \(\eps\)-net, for any \(p\in \+P\), there exists at least one \(t\in \+N\) such that \(t\in B_p(\eps)\). Moreover, for any \(p,p'\in\+P\), \(B_p(\eps)\cap B_{p'}(\eps)=\emptyset\). Therefore, \(P(T,d,2\eps)\leq N(T,d,\eps)\).
For the second inequality, let \(\mathcal{P}\) be a maximal \(\varepsilon\)-packing (i.e., no more points can be added to \(\mathcal{P}\) while maintaining the separation property). We claim that \(\mathcal{P}\) is also an \(\varepsilon\)-net. Suppose this not holds. Then there exists some \(x \in T\) such that \(d(x, p) > \varepsilon\) for all \(p \in \mathcal{P}\). But this implies we could add \(x\) to \(\mathcal{P}\) to form a larger packing \(\mathcal{P} \cup \{x\}\), which contradicts the maximality of \(\mathcal{P}\).
We can now estimate the covering number of the unit Euclidean ball \(B_0^n(1)\) by simply comparing the volumes.
Lemma 3 For \(\eps\geq 1\), \[ N\tp{B_0^n(1),\norm{\cdot},\eps}=1 \] and for \(0<\eps<1\), \[ \tp{\frac{1}{\eps}}^{n}\leq N\tp{B_0^n(1),\norm{\cdot},\eps} \leq \tp{\frac{3}{\eps}}^n. \]
Proof. The case \(\varepsilon \geq 1\) is trivial because a single ball of radius \(\varepsilon\) centered at the origin covers \(B_0^n(1)\). We focus on \(\varepsilon < 1\).
For the upper bound, using the previous lemma, it suffices to bound the packing number \(P\tp{B_0^n(1),\norm{\cdot},\eps}\). Let \(\mathcal{P}\) be an \(\varepsilon\)-packing of the unit ball. Consider the open balls of radius \(\varepsilon/2\) centered at points in \(\mathcal{P}\). These balls are disjoint and contained in \(B_0^n(1+\eps/2)\). Therefore, \[ P\tp{B_0^n(1),\norm{\cdot},\eps}\leq \frac{\!{vol}\tp{B_0^n(1+\eps/2)}}{\!{vol}\tp{B_0^n(\eps/2)}} = \tp{\frac{1+\eps/2}{\eps/2}}^n \leq \tp{\frac{3}{\eps}}^n. \] For the lower bound, let \(\mathcal{N}\) be an \(\varepsilon\)-net of \(B_0^n(1)\). By definition, the unit ball is covered by the union of balls centered at \(\mathcal{N}\): \[ B_0^n(1) \subseteq \bigcup_{y \in \mathcal{N}} B(y, \varepsilon). \] Taking volumes on both sides gives \[ \!{vol}\tp{B_0^n(1)} \leq N\tp{B_0^n(1),\norm{\cdot},\eps} \cdot \!{vol}\tp{B_0^n(\eps)}. \] This indicates that \(N\tp{B_0^n(1),\norm{\cdot},\eps} \geq \tp{\frac{1}{\eps}}^n\).
Then we see two applications of the covering and packing problems.
Bounding \(\E{\norm{M}}\)
Now we return to bound \(\E{\norm{M}}\). Recall our previous bound: \[ \E{\norm{M}} \leq \inf_{\eps\in(0,1)} \frac{1}{1-\eps}\cdot \sqrt{2\sigma^2\cdot \log N(T,d,\eps)}. \] To calculate \(N(T, d, \varepsilon)\), we construct an \(\eps\)-net for the product space \(T = B_0^n(1) \times B_0^m(1)\). Let \(\mathcal{N}_1\) be an \(\frac{\eps}{2}\)-net for \(B_0^n(1)\) and \(\mathcal{N}_2\) be an \(\frac{\eps}{2}\)-net for \(B_0^m(1)\). For any point \((v,w) \in T\), there exist approximations \(\pi(v) \in \mathcal{N}_1\) and \(\pi(w) \in \mathcal{N}_2\) such that \(\|v-\pi(v)\| \leq \frac{\eps}{2}\) and \(\|w-\pi(w)\| \leq \frac{\eps}{2}\). Consequently, \[ d\tp{(v,w),\pi((v,w))} = \norm{v-\pi(v)} + \norm{w-\pi(w)} \leq \eps. \] Thus, the product set \(\mathcal{N} = \mathcal{N}_1 \times \mathcal{N}_2\) is an \(\varepsilon\)-net for \((T,d)\). The size of this net is bounded by the product of the individual covering numbers \[ N(T,d,\eps)\leq \left(\frac{3}{\varepsilon/2}\right)^n \cdot \left(\frac{3}{\varepsilon/2}\right)^m = \tp{\frac{6}{\eps}}^{n+m}. \] Choosing a concrete value for \(\varepsilon\), say \(\varepsilon = 1/2\), this gives the upper bound \(\E{\norm{M}}\leq \+O\tp{\sqrt{\sigma^2(n+m)}}\).
On the other hand, we can show that this upper bound is asymptotically tight. Let \(e_1 = (1, 0, \dots, 0) \in \mathbb{R}^m\). By the definition of the operator norm: \[ \norm{M} \geq \frac{\norm{M e_1}}{\norm{e_1}} = \norm{M e_1}. \] Thus, \[ \frac{\E{\norm{M}}}{\sqrt{n}}\geq \E{\norm{\frac{\norm{M e_1}}{\sqrt{n}}}} = \E{\tp{\frac{\sum_{i=1}^n M_{i,1}^2}{n}}^{\frac{1}{2}}}. \] By the law of large numbers, \[ \frac{1}{n}\sum_{i=1}^n M_{i,1}^2 \xrightarrow{a.s.} \E{M_{1,1}^2} = \sigma^2. \] Since the square root function is continuous, we also have \[ \tp{\frac{\sum_{i=1}^n M_{i,1}^2}{n}}^{\frac{1}{2}} \xrightarrow{a.s.} \sigma. \] From the Jensen’s inequality, \[ \E{\tp{\frac{\sum_{i=1}^n M_{i,1}^2}{n}}^{\frac{1}{2}}} \leq \tp{\E{\frac{\sum_{i=1}^n M_{i,1}^2}{n}}}^{\frac{1}{2}} = \sigma. \] This indicates that the term is uniformly integrable and thus \[ \E{\tp{\frac{\sum_{i=1}^n M_{i,1}^2}{n}}^{\frac{1}{2}}} \xrightarrow{n\to \infty} \sigma. \] Consequently, for fixed \(m\), when \(n\) is large, \(\E{\norm{M}}\) behaves like \(\sigma \sqrt{n}\). By symmetry, it also behaves like \(\sigma \sqrt{m}\) for large \(m\). Therefore, our upper bound \(\+O(\sigma\sqrt{n+m})\) is tight asymptotically.
Lower bound for Johnson-Lindenstrauss lemma
Finally, we use the concept of packing numbers to prove that the dimension reduction guarantee provided by the Johnson-Lindenstrauss (JL) lemma is asymptotically optimal.
Recall the result from the previous class: Given \(n\) vectors \(x_1,\dots,x_n\in \mathbb{R}^n\), there exists a linear mapping \(T:\mathbb{R}^n \to \mathbb{R}^k\) with \(k = \+O\tp{\frac{\log n}{\varepsilon^2}}\) that preserves pairwise distances up to a factor of \((1 \pm \epsilon)\): \[ (1-\+O(\eps))\norm{x_i-x_j} \leq \norm{Tx_i - Tx_j} \leq (1+\eps)\norm{x_i-x_j}. \]
A natural question is: Can we do better? Can we embed \(n\) points into a dimension \(k\) significantly smaller than \(\log n\)?
We will show that for a constant distortion \(\varepsilon = \Theta(1)\), the target dimension \(k\) must be at least \(\Omega(\log n)\).
Consider the set of standard basis vectors in \(\mathbb{R}^n\): \(X = \{e_1, \dots, e_n\}\). Note that for any distinct \(i,j\), the distance is constant: \(\norm{e_i - e_j} = \sqrt{2}\). Let \(T: \mathbb{R}^n \to \mathbb{R}^k\) be any mapping that satisfies the JL property with a constant distortion (say, \(\varepsilon = 1/3\)). Let \(y_i = T(e_i)\) be the image vectors in \(\mathbb{R}^k\). We require \(\norm{y_i-y_j}=\Theta(1)\). We can rescale the mapping \(T\) by a constant without loss of generality so that the minimum distance is at least \(1\). The condition then becomes \[ 1\leq \norm{y_i-y_j}\leq c \] for some constant \(c\).
This implies that \(\set{y_i}_{i\in[n]}\) is a \(1\)-packing of a ball with radius \(c\) in \(\bb R^k\). Therefore, \(k\) should satisfies that \[ n\leq \frac{\!{vol}\tp{B^k(c+1/2)}}{\!{vol}\tp{B^k(1/2)}} = \tp{\frac{c+1/2}{c}}^k. \] Taking logarithms on both sides, this indicates that \(k\geq \Omega(\log n)\).