Lecture 2: Sub-Gaussian and Sub-Exponential Random Variables
The Upper Confidence Bounds (UCB) algorithm
In our discussion of the multi-armed bandit problem in the last class, we introduced the Explore-Then-Commit (ETC) algorithm. This algorithm operates in two phases: first, an exploration phase where each of the \(k\) arms is pulled \(L\) times. Following this, in the commitment phase, the algorithm exclusively pulls the arm with the highest empirical mean for the remainder of the game.
According to Hoeffding’s inequality, a larger value of \(L\) increases the probability of correctly identifying the optimal arm. However, this comes at the cost of incurring substantial regret during the initial exploration phase, which spans \(kL\) rounds. Therefore, the key to minimizing total regret with the ETC algorithm lies in effectively managing this exploration-exploitation trade-off by selecting an appropriate value for \(L\). According to the analysis in the last class, the optimal choice for each arm is the one minimizing the function \[ g(L,\Delta_i) = L + T\Delta_i\exp\tp{-\frac{L\Delta_i^2}{2}}, \] which should be on the order of \(\wt\Theta\tp{\frac{1}{\Delta_i^2}}\). However, in the ETC algorithm, we pick a uniform \(L\) for all arms, which results in a suboptimal regret. Ideally, we would like to choose \(L_i\) for each arm \(i\) separately, but this seems to be im ossible since we do not know \(\Delta_i\) in advance.
The brilliant idea of the Upper Confidence Bounds (UCB) algorithm in fact achieves this goal adaptively without knowing \(\Delta_i\) in advance. It maintains an interval \([a_i(t),b_i(t)]\) for each arm \(i\) at time \(t\) so that the mean \(\mu_i\) is within the interval with high probability based on the current knowledge on \(\mu_i\). Now suppose you already know such \([a_i(t), b_i(t)]\), which arm will you pull now? The name upper confidence bound suggests that we always choose the one with the highest upper bound \(b_i(t)\) in the \((t+1)\)-th round.
It is important to understand why this simple strategy works intuitively (and why the arm \(i\) can be played at most \(\wt\Theta(\frac{1}{\Delta_i^2})\) times). The reason is that, once an sub-optimal arm \(i\) has been played for \(\wt\Theta(\frac{1}{\Delta_i^2})\) times, then our estimation for its reward is well-concentrated around its true value (in a \(\pm \Theta(\Delta_i)\)). This means that it can be very unlikely that its upper bound \(b_i(t)\) is larger than the optimal mean \(\mu^*\). Therefore, after \(\wt\Theta(\frac{1}{\Delta_i^2})\) plays of arm \(i\), it will be very unlikely to be chosen again. This is exactly what we want.
In order to implement the idea, we have to specify how to maintain the interval for each arm. Formally, for \(i \in [k]\), at round \(t\) we not only track the empirical mean \(\hat{\mu}_i(t)\), but also maintain an interval \([a_i(t), b_i(t)]\) so that \(\mu_i \in [a_i(t), b_i(t)]\) with probability no less than \(1- 1/T^2\). Let \(a_i(t) \defeq \hat{\mu}_i(t) - c_i(t)\) and \(b_i(t)\defeq \hat{\mu}_i(t) + c_i(t)\). Let us see how to pick \(c_i(t)\).
Let \(n_i(t)\) be the times that the \(i\)-th arm is explored at time \(t\). By Hoeffding’s inequality, it is enough to choose \(c_i(t)\) that satisifies \[ \begin{align*} \Pr{|\mu _i - \hat{\mu }_i(t)| > c_i(t)} \le 2 \exp (-2n_i(t) c_i^2(t)) \le 1/T^2 , \end{align*} \] so we choose \(c_i(t) = \sqrt{\frac{\log(2T^2)}{2n_i(t)}}\). Then from the union bound, the overall probablity that there exists \(\mu_i\) falling outside its confidence interval \([a_i(t), b_i(t)]\) in round \(t\) can be bounded by \(\frac{k}{T}\).
Note that the upper bound \(b_i(t)=\hat\mu_i(t)+c_i(t)\) can be large (which means that we are more likely to explore the arm \(i\)) if either \(\hat \mu_i\) is large, or \(n_i(t)\) is small. This means the UCB algorithm naturally balances choosing arms that are promising (high \(\hat \mu_i\)) with those that are not well-understood (low \(n_i(t)\)). This dynamic trade-off allows UCB to achieve a smaller regret than the more rigid ETC algorithm. Specifically, the total regret for the UCB algorithm is \(\sqrt{kT \log T}\) (for a detailed proof, see Section 3 of this note).
We remark that this bound is still not optimal. We will see in the homework how to remove the \(\log T\) term and achieve the optimal bound of \(\sqrt{kT}\) via a variation of the UCB algorithm.
Sub-Gaussian distributions
Recall from our last class that a key technique for proving concentration bounds, like the Chernoff bound and Hoeffding’s inequality, involves using the exponential function. For a random variable \(X\), for any \(t>0\) and \(\beta>0\), \[ \begin{align*} \Pr{X-\E{X} \geq t} &= \Pr{e^{\beta(X-\E{X})} \geq e^{\beta t}} \\ \mr{$\+M_X(\beta)\defeq e^{\beta(X-\E{X})}$}&= \Pr{\+M_X(\beta) \geq e^{\beta t}}\\ \mr{Markov's inequality}&\leq \frac{\E{\+M_X(\beta)}}{e^{\beta t}}\\ \mr{$\psi_X(\beta)\defeq \log \E{\+M_X(\beta)}$}&= e^{-\beta t + \psi_X(\beta)}. \end{align*} \] This shows that we can establish a concentration inequality by finding a good upper bound on \(\psi_X(\beta)\). This observation motivates the following definition.
Definition 1 (Sub-Gaussian distributions) A random variable \(X\) is called \(\sigma^2\)-sub-Gaussian distribution if \(\psi_X(\beta) \leq \frac{\beta^2\sigma^2}{2}\) for all \(\beta\in \bb R\).
This definition immediately gives us a powerful tail bound. For a \(\sigma^2\)-sub-Gaussian random variable, we have \[ \forall \beta >0,\ \Pr{X-\E{X} \geq t}\leq e^{-\beta t + \frac{\beta^2\sigma^2}{2}}. \] To get the tightest bound, we can minimize the exponent with respect to \(\beta\). Choosing the optimal \(\beta = \frac{t}{\sigma^2}\) yields \(\Pr{X-\E{X} \geq t}\leq e^{-\frac{t^2}{2\sigma^2}}\). For the other side, we can apply the same logic to the random variable \(-X\). This gives a symmetric bound, \[ \Pr{X-\E{X} \leq -t} = \Pr{-X - \E{-X} \geq t} \leq e^{-\frac{t^2}{2\sigma^2}}. \]
Let’s look at two typical examples of sub-Gaussian random variables. :::info Example 1 (Gaussian distributions). Gaussians are the quintessential sub-Gaussian variables. Let \(X\sim \+N(\mu, \sigma^2)\). It is well known that the moment generating function of \(X-\mu\) is \(\E{\+M_X(\beta)} = e^{\frac{\beta^2\sigma^2}{2}}\). This means a Gaussian is, by definition, \(\sigma^2\)-sub-Gaussian. Plugging this into our general inequality yields the familiar Gaussian tail bound: \[ \Pr{X-\E{X}\geq t} \leq e^{-\frac{t^2}{2\sigma^2}}. \] Furthermore, this bound is asymptotically tight. A well-known asymptotic result for the Gaussian integral (the Mills ratio) shows that for large \(t\): \[ \Pr{X-\E{X}\geq t} = \int_t^{\infty} \frac{1}{\sqrt{2\pi}\sigma} \exp^{-\frac{s^2}{2\sigma^2}} \dd s \sim \frac{\sigma}{t\sqrt{2\pi}}e^{-\frac{t^2}{2\sigma^2}}. \]
Example 1 (Bounded distributions) Consider any random variable \(X\in [a,b]\). As we proved in the last class, Hoeffding’s Lemma states that: \[ \E{e^{\beta(X-\E{X})}} \leq e^{\frac{\beta^2(b-a)^2}{8}}, \] which indicates that \(X\) is \(\frac{(b-a)^2}{4}\)-sub-Gaussian.
The sub-Gaussian property is fundamentally about the tails of a distribution decaying at least as fast as a Gaussian. In fact, there are several equivalent ways to define a sub-Gaussian variable. For simplicity, let’s assume \(X\) is centered (\(\E{X}=0\)). A random variable \(X\) is sub-Gaussian if any of the following equivalent conditions hold: * There exists \(\sigma\) such that for all \(\beta\in \bb R\), \(\psi_X(\beta)\leq \frac{\beta^2\sigma^2}{2}\). * Tail bound: there exists a universal constant \(c_1\) such that \(\forall t\geq 0\), \(\Pr{|X|\geq t}\leq 2e^{-\frac{c_1 t^2}{\sigma^2}}\). * Moment bound: there exists a universal constant \(c_2\) such that for all integers \(p\geq 1\), \(\left(\E{|X|^p}\right)^{1/p} \leq c_2\sigma\sqrt{p}\).
A crucial property of sub-Gaussian random variables is that they are closed under addition.
Lemma 1 Suppose \(X\) and \(Y\) are independent random variables that are \(\sigma_X^2\)-sub-Gaussian and \(\sigma_Y^2\)-sub-Gaussian distributions respectively. Then \(X+Y\) is \((\sigma_X^2 + \sigma_Y^2)\)-sub-Gaussian.
The proof of this lemma is straightforward, and the additivity property can be naturally extended to a sum of \(n\) mutually independent random variables via induction. Hoeffding’s inequality is then a direct corollary of this property, combined with the result from Example 2 that bounded random variables are sub-Gaussian.
A similar result also holds for dependent random variables. In this case, we can still prove that \(X+Y\) is \(2(\sigma_X^2 + \sigma_Y^2)\)-sub-Gaussian (see Here for the proof).
Sub-exponential distributions
The sub-Gaussian property requires tails to decay quadratically (\(e^{-ct^2}\)), which is a strong condition. Many common random variables may not satisfy this property. A typical example is the \(\chi^2\)-distribution. Let \(X\sim \+N(0,1)\) and \(Y=X^2\). We have \[\begin{align*} \E{e^{\beta (Y - \E{Y})}} &= \frac{e^{-\beta \E{X^2}}}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{\beta x^2}\cdot e^{-\frac{x^2}{2}} \dd x\\ \mr{$y=\sqrt{1-2\beta}x$} &= \frac{e^{-\beta}}{\sqrt{2\pi}}\cdot \int_{-\infty}^{\infty} e^{-\frac{y^2}{2}}\cdot \frac{1}{\sqrt{1-2\beta}} \dd y\\ &=\begin{cases} \frac{e^{-\beta}}{\sqrt{1-2\beta}}, &\beta<\frac{1}{2}\\ \infty, &\beta\geq \frac{1}{2} \end{cases}. \end{align*}\] Since \(\E{e^{\beta (Y - \E{Y})}}\) is finite only on a limited interval, the \(\chi^2\)-distribution is not sub-Gaussian. This motivates the definition of sub-exponential distributions.
Definition 2 (Sub-exponential distributions) A random variable \(X\) is called \((\nu,\alpha)\)-sub-exponential if \(\psi_X(\beta) \leq \frac{\beta^2\nu}{2}\) for any \(\abs{\beta}\leq \frac{1}{\alpha}\). When \(\alpha = \frac{1}{\sqrt{\nu}}\), we also call it \(\nu\)-sub-exponential.
Let’s continue with the \(\chi^2\) distribution.
Example 2 (\(\chi^2\) distribution) Consider the \(\chi^2\) distribution mentioned above, for any \(\abs{\beta}\leq \frac{1}{4}\), \[ \begin{align*} \psi_Y(\beta) &= -\beta - \frac{1}{2}\log (1-2\beta)\\ \mr{Taylor's expansion} &=-\beta + \frac{1}{2} \sum_{k\geq 1} \frac{(2\beta)^k}{k}\\ &= \frac{\beta^2}{2}\cdot 4\sum_{k\geq 2} \frac{(2\beta)^{k-2}}{k}\\ &\leq \frac{\beta^2}{2}\cdot 4\sum_{k\geq 2} \frac{(2\beta)^{k-2}}{2}\\ &= \frac{\beta^2}{2}\cdot \frac{2}{1-2\beta} \\ \mr{$\abs{\beta}\leq \frac{1}{4}$}&\leq \frac{\beta^2}{2}\cdot 4. \end{align*} \] This calculation shows that \(Y\) is \((4,4)\)-sub-exponential.
The following theorem formalizes the tail behavior of sub-exponential variables. For small values of \(t\), the tail exhibits a Gaussian-like decay (i.e., quadratic in the exponent). For large \(t\), it transitions to a heavier, purely exponential decay (i.e., linear in the exponent). Notably, the decay rate in this second regime is determined solely by the parameter \(\alpha\) and is independent of \(\nu\).
Theorem 1 Suppose \(X\) is \((\nu,\alpha)\)-sub-exponential. Then for any \(t\geq 0\) \[ \Pr{X-\E{X}\geq t}\leq \begin{cases} \exp\set{-\frac{t^2}{2\nu}}, &\mbox{ if }t\in\left[0, \frac{\nu}{\alpha}\right]\\ \exp\set{-\frac{t}{2\alpha}}, &\mbox{ if }t\in\left(\frac{\nu}{\alpha}, \infty\right) \end{cases}. \]
Proof. For any \(\beta\in (0, 1/\alpha]\), \[ \begin{align*} \Pr{X-\E{X}\geq t} &= \Pr{e^{\beta(X-\E{X})} \geq e^{\beta t}}\\ &\leq e^{-\beta t + \frac{\beta^2 \nu}{2}}. \end{align*} \] We optimize \(\beta\) over this interval: * If \(t\in\left[0, \frac{\nu}{\alpha}\right]\), choosing \(\beta = \frac{t}{\nu}\), we have \(\Pr{X-\E{X}\geq t} \leq \exp\set{-\frac{t^2}{2\nu}}\). * If \(t\in\left(\frac{\nu}{\alpha}, \infty\right)\), choosing \(\beta = \frac{1}{\alpha}\), then we have \(\Pr{X-\E{X}\geq t} \leq \exp\set{-\frac{t}{2\alpha}}\).
Like sub-Gaussian variables, the sub-exponential property is preserved under summation. Notably, in the following theorem, the second parameter is determined by the maximum term, \(\max_{i\in[N]} \abs{w_i}\alpha_i\), rather than the sum, \(\sum_{i\in[N]} \abs{w_i}\alpha_i\). This distinction is advantageous when \(N\) is large.
Theorem 2 Suppose \(X_1,\dots, X_N\) are mutually independent and each \(X_i\) is \((\nu_i,\alpha_i)\)-sub-exponential. Then for any constants \(w_1,\dots, w_N\in \bb R\), the weighted sum \(\sum_{i=1}^N w_i X_i\) is \(\tp{\sum_{i=1}^N w_i^2 \nu_i, \max_{i\in[N]} \abs{w_i}\alpha_i}\)-sub-exponential.
Proof. Since \(X_i\)’s are mutually independent, for any \(\beta\) such that \(\abs{\beta} \leq \frac{1}{\max_{i\in[N]} \abs{w_i}\alpha_i}\), we have \[ \begin{align*} \psi_{\sum_{i=1}^N w_i X_i}(\beta) &= \sum_{i=1}^N \psi_{w_iX_i}(\beta)\\ &= \sum_{i=1}^N \psi_{X_i}(w_i \beta)\\ &\leq \sum_{i=1}^N \frac{w_i^2 \beta^2}{2} \nu_i. \end{align*} \]
Bernstein inequality
Combining the additivity property and the tail bound gives a powerful concentration result known as Bernstein inequality.
Theorem 3 (Generalized Bernstein inequality) Suppose \(X_1,\dots, X_N\) are mutually independent and each \(X_i\) is \((\nu_i,\alpha_i)\)-sub-exponential. For any constants \(w_1,\dots, w_N\in \bb R\), letting \(S_n=\sum_{i=1}^N w_i X_i\), we have \[ \Pr{S_n-\E{S_n}\geq t} \leq \begin{cases} \exp\set{-\frac{t^2}{2\sum_{i=1}^N w_i^2 \nu_i}}, &\mbox{ if }t\in\left[0, \frac{\sum_{i=1}^N w_i^2 \nu_i}{\max_{i\in[N]} \abs{w_i}\alpha_i}\right]\\ \exp\set{-\frac{t}{2\max_{i\in[N]} \abs{w_i}\alpha_i}}, &\mbox{ otherwise } \end{cases}. \]
A particularly useful case is for bounded random variables.
Corollary 1 (Bernstein inequality for bounded variables) Suppose \(X_1,\dots,X_N\) are mutually independent random variables with each \(\E{X_i}=\mu_i\), \(\Var{X_i}=\sigma_i^2\) and \(\abs{X_i-\mu_i}\leq c\). Let \(S_N=\sum_{i=1}^N X_i\). Then \(S_n\) is \(\tp{2\sum_{i=1}^N\sigma_i^2, 2c}\)-sub-exponential and consequently, \[ \Pr{S_n-\E{S_n}\geq t} \leq \begin{cases} \exp\set{-\frac{t^2}{4\sum_{i=1}^N \sigma_i^2}}, &\mbox{ if }t\in\left[0, 4c\sum_{i=1}^N \sigma_i^2 \right]\\ \exp\set{-\frac{t}{4c}}, &\mbox{ otherwise } \end{cases}. \]
Proof. We only need to prove the sub-exponential property of \(S_n\). For each \(X_i\), for any \(\beta\) such that \(\abs{\beta}<\frac{1}{2c}\), \[ \begin{align*} \E{e^{\beta(X_i-\mu_i)}} &= \sum_{k\geq 0} \frac{\beta^k}{k!}\cdot \E{(X_i-\mu_i)^k}\\ &\leq 1+0 + \frac{\beta^2\sigma_i^2}{2} + \sum_{k\geq 3} \frac{\beta^k}{k!}\cdot \E{(X_i-\mu_i)^k}\\ &\leq 1+ \frac{\beta^2\sigma_i^2}{2} + \sum_{k\geq 3} \frac{\beta^k}{3!}\cdot \sigma_i^2 \cdot c^{k-2}\\ \mr{$\abs{\beta}<1/(2c)$}&\leq 1 + \beta^2\sigma_i^2 \leq e^{\beta^2\sigma_i^2}. \end{align*} \] Therefore, each \(X_i\) is \((2\sigma_i^2, 2c)\)-sub-exponential and the summation \(S_n\) is \(\tp{2\sum_{i=1}^N\sigma_i^2, 2c}\)-sub-exponential.
This theorem for bounded variables is a common form of Bernstein inequality. Compared to Hoeffding’s inequality, Bernstein inequality incorporates variance information. This gives it a significant advantage when the variances are small relative to the bound \(c\). Here is an example.
Consider the Erdős-Rényi Graph \(G(n, p_n)\). In this model, a graph with \(n\) vertices is generated by connecting each pair of vertices independently with probability \(p_n\). The degree of a fixed vertex \(v\), denoted as \(d_v\), follows a Binomial distribution, \(d_v\sim \!{Bin}(n-1,p_n)\) with an expected value of \(\E{d_v}=(n-1)p_n\). Denote the vertices set as \(V\). Our goal is to find a high-probability range for the maximum degree of the graph \(d_n=\max_{v\in V} d_v\). Specifically, we prove that for large \(n\) and any \(\eps>0\), \[ \Pr{\abs{d_n-(n-1)p_n} \geq 2\sqrt{(1+\eps)np_n\log n}} \leq \frac{1}{n^{\eps}} = o(1). \] To prove this, we can use the union bound to relate the maximum degree to the degrees of individual vertices: \[ \begin{align*} &\phantom{{}={}}\Pr{\abs{d_n-(n-1)p_n} \geq 2\sqrt{(1+\eps)np_n\log n}} \\ &\leq \Pr{\exists v\in V, \mbox{ s.t. } \abs{d_v-(n-1)p_n} \geq 2\sqrt{(1+\eps)np_n\log n}}\\ &\leq \sum_{v\in V} \Pr{\abs{d_v-(n-1)p_n} \geq 2\sqrt{(1+\eps)np_n\log n}}\\ &\leq \sum_{v\in V} \Pr{\abs{d_v-(n-1)p_n} \geq 2\sqrt{(1+\eps)(n-1)p_n(1-p_n)\log n}}. \end{align*} \]
Let \(X_1,\dots, X_{n-1}\) be \(n-1\) independent Bernoulli random variables with mean \(p_n\). Then we can regard \(d_v = \sum_{i=1}^{n-1} X_i\). We can now apply Bernstein inequality for bounded distributions to \(d_v\). For each \(X_i\), we have \(\E{X_i}=p_n\), \(\Var{X_i} = p_n(1-p_n)\) and \(\abs{X_i-\E{X_i}}\leq 1\). From Bernstein inequality, \[ \Pr{\abs{d_v-(n-1)p_n} \geq 2\sqrt{(1+\eps)(n-1)p_n(1-p_n)\log n}} \leq \frac{1}{n^{1+\eps}} \] for large \(n\).
It is insightful to compare this result with Hoeffding’s inequality. Applying Hoeffding’s inequality would yield: \[ \Pr{\abs{d_v-(n-1)p_n} \geq \sqrt{\frac{1}{2}\cdot (1+\eps)(n-1)\log n}}\leq \frac{1}{n^{1+\eps}}. \] This shows the power of Bernstein’s inequality: when the variance is small (i.e., \(p_n\to 0\)), it provides a much tighter concentration bound, correctly capturing that the summation is less volatile than the worst-case scenario assumed by Hoeffding’s inequality.