Lecture 1: Basic Concentration Inequalities

Author

instructed by Chihao Zhang, scribed by Yuchen He

We first recall the Markov’s inequality and the Chebyshev’s inequality we met in the probability theory course.

Proposition 1 (Markov’s Inequality) For any non-negative random variable \(X\) and for any \(a>0\), \[ \Pr{X\ge a}\le \frac{\E{X}}{a}. \]

Proof

Since \(X\) is non-negative, we have \[ \E{X} \ge a\cdot\Pr{X\ge a} + 0\cdot\Pr{X<a}. \] This is equivalent to \[ \Pr{X\ge a}\le \frac{\E{X}}{a}. \]

Proposition 2 (Chebychev’s inequality) For any random variable with bounded \(\E{X}\) and for any \(a\ge 0\), it holds that \[ \Pr{\abs{X-\E{X}}\ge a} \le \frac{\Var{X}}{a^2} \]

Proof

Let \(Y=\abs{X-\E{X}}\), then clearly \(Y\ge 0\). Therefore \[ \begin{align*} \Pr{\abs{X-\E{X}}\ge a} &=\Pr{Y\ge a} = \Pr{Y^2\ge a^2} \le \frac{\E{Y^2}}{a^2}\\ &=\frac{\E{(X-\E{X})^2}}{a^2} = \frac{\Var{X}}{a^2}. \end{align*} \]

In order to motivate the study of concentration inequalities, let’s first look at the streaming model.

The streaming model

Suppose we have a router with limited memory, but need to solve some computational tasks with large input data such as monitoring the id of devices visiting it. The following questions are usually asked.

How many numbers in a given data streaming?
How many distinct numbers?
What is the most frequent number?

In order to study these problems systematically, we need formally define the streaming model. In the streaming model, the input is a sequence \(\sigma=\left\langle a_1, a_2, \cdots, a_m\right\rangle\) where each \(a_i \in[n]\). We should notice that the data arrive one by one as suggested by the word “streaming” in the name. We now focus the basic problem: How many numbers in the stream (What is \(m\))?

Clearly we can maintain a counter \(k\), and whenever a number \(a_i\) arrives, increase \(k\) by one. It is not hard to see that we need \(\left\lceil\log _2 m\right\rceil\) bits of memory.

Can we design a more clever algorithm with only \(o(\log m)\) memory? It turns out that computing the exact answer is impossible even with \(\left\lceil\log _2 m\right\rceil-1\) memory. The reason is as follows: suppose we have an algorithm \(\mathcal{M}\) using only \(\left\lceil\log _2 m\right\rceil-1\) memory. Denote \(\mathcal{M}(i)\) as the output of the algorithm with a input \(\sigma\) of length \(i\). Then there exists \(i, j \in[m]\) such that \(i \neq j\) while \(\mathcal{M}(i)=\mathcal{M}(j)\).

Even though we can not get a better algorithm for the exact answer, it is possible to save lots of memories if approximation is allowed. That is, for every \(\varepsilon>0\), the algorithm computes a number \(\widehat{m}\) such that \[ 1-\varepsilon \leq \frac{\widehat{m}}{m} \leq 1+\varepsilon \] with high probability.

The Morris’ algorithm

The Morris’ algorithm is presented as follows.

Morris’ Algorithm

Input : An instance \(\sigma=\langle a_1,a_2,\cdots,a_m \rangle\) where each \(a_i\in [n]\).

Output: The length \(m\) of the sequence \(\sigma\).

\(X\leftarrow 0\);
On each input: \(X \leftarrow X + 1\) with probability \(2^{-X}\);
Return \(2^X-1\).

This is a randomized algorithm with approximately \(O(\log \log m)\) bits of memory. We first look at the expectation of its output.

Theorem 1 The output of Morris’ algorithm \(\widehat{m}\) satisfies \(\E{\widehat{m}}=m\).

Proof. We prove it by induction on \(m\). Since \(X=1\) when \(m=1\), we have \(\E{\widehat{m}}=1\). Assume it is true for smaller \(m\), let \(X_i\) denote the value of \(X\) after processing the \(i\)-th input. We have the following fact, \[ \begin{aligned} \E{\wh{m}} &=\E{2^{X_m}}-1 \\ &=\sum_{i=0}^m \Pr{X_m=i} \cdot 2^i-1 \\ &=\sum_{i=0}^m\left(\Pr{X_{m-1}=i} \cdot\left(1-2^{-i}\right)+\Pr{X_{m-1}=i-1} \cdot 2^{1-i}\right) \cdot 2^i-1 \\ &=\sum_{i=0}^{m-1} \Pr{X_{m-1}=i} \cdot\left(2^i+1\right)-1 \\ &=\E{2^{X_{m-1}}} \\ &=m, \end{aligned} \] where the last equation holds due to the induction hypothesis.

It is now clear that Morris’ algorithm is an unbiased estimator for \(m\). However, for a practical randomized algorithm, we further require its output to concentrate on the expectation. That is, we want to establish concentration inequality of the form \[\begin{align*} \Pr{|\widehat{m}-m|>\varepsilon}\leq \delta \end{align*}\] for \(\varepsilon, \delta>0\). It is natural to see that for fixed \(\varepsilon\), the smaller \(\delta\) is, the better the algorithm is.

To use the Chebyshev’s inequality to analyse the concentration of the Morris’ algorithm, we have to compute the variance of \(\hat{m}\).

Lemma 1 \[\begin{align*} \E{\left(2^{X_m}\right)^2}=\frac{3}{2} m^2+\frac{3}{2} m+1 \end{align*}\]

Proof. We can prove the claim using an induction argument similar to our proof for the expectation. When \(m=1, \mathrm{E}\left[\left(2^{X_m}\right)^2\right]=4\). We assume it is true for smaller \(m\) and use the same notation \(X_i\). We have that \[\begin{align*} \mathrm{E}[\widehat{m}] &=\mathrm{E}\left[2^{X_m}\right]-1 \\ &=\sum_{i=0}^m \operatorname{Pr}\left[X_m=i\right] \cdot 2^{2 i} \\ &=\sum_{i=0}^m\left(\operatorname{Pr}\left[X_{m-1}=i\right]\left(-2^{-i}\right)+\operatorname{Pr}\left[X_{m-1}=i-1\right] \cdot 2^{1-i}\right) \cdot 2^{2 i} \\ &=\sum_{i=0}^m\left(\operatorname{Pr}\left[X_{m-1}=i\right]\left(2^{2 i}-2^i\right)+\operatorname{Pr}\left[X_{m-1}=i-1\right] \cdot 2^{i+1}\right) \\ &=\sum_{i=0}^{m-1} \operatorname{Pr}\left[X_{m-1}=i\right]\left(2^{2 i}+3 \cdot 2^i\right) \\ &=\mathrm{E}\left[\left(2^{X_{m-1}}\right)^2\right]+3 \mathrm{E}\left[2^{X_{m-1}}\right] \\ &=\frac{3}{2} m^2+\frac{3}{2} m+1 \end{align*}\]

With the above lemma, we can compute the variance as follows, \[\begin{align*} \Var{\widehat{m}}=\E{\widehat{m}^2}-\E{\widehat{m}^2}=\E{\left(2^{X_m}-1\right)^2}-m^2 \leq \frac{m^2}{2}. \end{align*}\] Applying Chebyshev’s inequality, we obtain that for every \(\varepsilon>0\), \[\begin{align*} \Pr{|\widehat{m}-m| \geq \varepsilon m} \leq \frac{1}{2 \varepsilon^2}. \end{align*}\] However, we observe that as \(\varepsilon\) becomes smaller, the above bound is not useful. Thus, it is necessary to improve the concentration of the algorithm. We now introduce to common tricks to achieve this.

The averaging trick

The Chebyshev’s inequality tells us that we can improve the concentration by reducing the variance. Let’s first review some properties of variances. Let \(X\) be a random variable, we have \[ \Var{a \cdot X}=a^2 \cdot \Var{X}, \] for any constant \(a\). For any two independent random variables \(X\) and \(Y\), we have \[ \Var{X+Y}=\Var{X}+\Var{Y}. \]

We can design a new algorithm by independently running Morris’s algorithm \(t\) time in parallel. Denote the corresponding outputs be \(\widehat{m}_1, \cdots, \widehat{m}_t\). The final output is \[\widehat{m}^*:=\frac{\sum_{i=1}^t \widehat{m}_i}{t}.\]

By the above two properties, we have \(\Var{\wh{m}^*}=\frac{\Var{\wh m_1}}{t}\). We can apply Chebyshev’s inequality to \(\widehat{m}^*\) and obtain that \[\begin{align*} \Pr {\abs{\widehat{m}^{*}-m} \geq \eps m}\leq \frac{1}{t\cdot 2\eps^2}. \end{align*}\] For \(t\geq \frac{1}{2\eps^2\delta}\), we have \[\begin{align*} \Pr {\abs{\widehat{m}^{*}-m}\geq \eps m}\leq \delta. \end{align*}\]

The new algorithm uses \(\+O\tp{\frac{\log \log m}{\eps^2\delta}}\) bits of memory. It shows a trade-off between the accurancy of the randomized algorithm and the consumption of the memory space. We can further improve the dependence on \(\delta\) using the Chernoff bound below.

Chernoff Bound

Like Chebyshev’s inequality, if we choose \(f(x)=e^{\alpha x}\) for \(\alpha>0\) and apply Markov inequality on \(f(X)\), the bound amounts to bound \(\E{e^{\alpha X}}\) which is the of \(X\). In case \(\E{e^{\alpha X}}\) can be well bounded, we obtain sharp concentration bounds.

Proposition 3 (Chernoff bound) Let \(X_1,\dots,X_n\) be independent random variables such that \(X_i \sim \-{Ber}(p_i)\) for each \(i=1,2,\dots,n\). Let \(X = \sum_{i = 1}^n X_i\) and denote \(\mu \defeq \E X = \sum_{i = 1}^n p_i\), we have \[\begin{align*} \Pr{X \geq (1 + \delta)\mu} \leq \left(\frac{e^{\delta}}{(1 + \delta)^{1 + \delta}}\right)^\mu. \end{align*}\] If \(0<\delta<1\), then we have \[\begin{align*} \Pr{X \le (1 - \delta)\mu} \leq \left(\frac{e^{-\delta}}{(1 - \delta)^{1 - \delta}}\right)^\mu. \end{align*}\]

Proof. We only prove the upper tail bound and the proof of lower tail bound is similar. For every \(\alpha>0\), we have \[\begin{align*} \Pr{X \geq (1 + \delta)\mu} = \Pr{e^{\alpha X} \geq e^{\alpha (1 + \delta)\mu}} \leq \frac{\E{e^{\alpha X}}}{e^{\alpha (1 + \delta)\mu}}. \end{align*}\] Therefore, we need to estimate the moment generating function \(\E{e^{\alpha X}}\). Since \(X=\sum_{i=1}^n X_i\) is the sum of independent Bernoulli variables, we have \[\begin{align*} \E{e^{\alpha X}} = \E{e^{\alpha \sum_{i = 1}^n X_i}} = \E{\prod_{i = 1}^n e^{\alpha X_i}} = \prod_{i = 1}^n \E{ e^{\alpha X_i}}. \end{align*}\] Since \(X_i\sim\-{Ber}(p_i)\), we can compute \(\E{e^{\alpha X_i}}\) directly: \[\begin{align*} \E{ e^{\alpha X_i}} = p_i e^{\alpha} +(1 - p_i) = 1 + (e^\alpha - 1)p_i \le \exp\left((e^\alpha - 1)p_i\right). \end{align*}\] Therefore, \[\begin{align*} \E{e^{\alpha X}} \le\prod_{i=1}^n\exp\tp{(e^\alpha-1)p_i}= \exp{\left((e^\alpha - 1)\sum_{i = 1}^n p_i \right)} = \exp{\left( (e^\alpha - 1) \mu \right)}. \end{align*}\] Therefore, \[\begin{align*} \Pr{X \leq (1 + \delta)\mu} \leq \frac{\E{e^{\alpha x}}}{e^{\alpha (1 + \delta)\mu}} \leq \left( \frac{\exp{(e^\alpha - 1)}}{\exp{(\alpha(1 + \delta))}} \right)^\mu. \end{align*}\] Note that above holds for any \(\alpha>0\). Therefore, we can choose \(\alpha\) so as to minimize \(\frac{\exp{(e^\alpha - 1)}}{\exp{(\alpha(1 + \delta))}}\). To this end, we let \[\begin{align*} \left( \frac{\exp{(e^\alpha - 1)}}{\exp{(\alpha(1 + \delta))}} \right)' &= \exp{(e^{\alpha} - 1 - \alpha - \alpha \delta) \cdot (e^\alpha - 1 - \delta)} = 0. \end{align*}\] This gives \(\alpha = \log(1 + \delta)\). Therefore
\[\begin{align*} \Pr{X \leq (1 + \delta)\mu} \leq \left( \frac{\exp{(e^\alpha - 1)}}{\exp{(\alpha(1 + \delta))}} \right)^\mu = \left( \frac{e^{\delta}}{(1 + \delta)^{(1 + \delta)}} \right)^\mu . \end{align*}\]

The following form of Chernoff bound is more convenient to use (but weaker):

Corollary 1 (Corollary) For any \(0<\delta<1\), \[\begin{align*} \Pr{X \geq (1 + \delta)\mu} &\leq \exp{\left(-\frac{\delta^2}{3}\mu\right)};\\ \Pr{X \leq (1 - \delta)\mu} &\leq \exp{\left(-\frac{\delta^2}{2}\mu\right)}. \end{align*}\]

Proof. We only prove the upper tail. It suffices to verify that for \(0 < \delta < 1\), we have \[\begin{align*} \frac{e^{\delta}}{(1 + \delta)^{(1 + \delta)}} \leq \exp{\left(-\frac{\delta^2}{3}\right)}. \end{align*}\] Taking logarithm of both sides, this is equivalent to \[\begin{align*} \delta - (1 + \delta)\ln(1 + \delta) \leq -\frac{\delta^2}{3}. \end{align*}\] Let \(f(\delta) = \delta - (1 + \delta)\ln(1 + \delta) + \frac{\delta^2}{3}\) and note that \[\begin{align*} f'(\delta) = - \ln(1 + \delta) + \frac{2}{3} \delta, \quad f''(\delta) = - \frac{1}{1 + \delta} + \frac{2}{3}. \end{align*}\] Then for \(0 < \delta < 1/2\), \(f''(\delta) < 0\), and for \(1/2 < \delta < 1\), \(f''(\delta) > 0\). Therefore, \(f'(\delta)\) first decreases and then increases in \([0,1]\). Also note that \(f'(0) = 0\), \(f'(1) < 0\) and \(f'(\delta) \leq 0\) when \(0 \leq \delta \leq 1\). Therefore \(f(\delta) \leq f(0) = 0\).

The median trick

We can further boost the performance of Morris’s algorithm using the the median trick. We choose \(t=\frac{3}{2\eps^2}\) in the algorithm introduced in the averaging trick and independently run it \(s\) time in parallel. Denote the outputs as \(\widehat{m}_1^{*},\widehat{m}_2^{*},\cdots,\widehat{m}_s^{*}\) respectively. It holds that for every \(i=1,\cdots,s\), \[\begin{align*} \Pr{\abs{\widehat{m}_i^*-m}\geq \eps m]}\leq \frac{1}{3}. \end{align*}\] At last, we output the median \(\widehat{m}^{**}\) of \(\widehat{m}_1^{*},\widehat{m}_2^{*},\cdots,\widehat{m}_s^{*}\).

Then we can apply the Chernoff bound to analyze the result obtained by the median trick. For every \(i=1,\cdots, s\), we let \(Y_i\) be the indicator of the (good) event that \[\begin{align*} \abs{\widehat{m}_i^{*}-m}<\eps \cdot m. \end{align*}\]

Then \(Y\defeq \sum_{i=1}^s Y_i\) satisfies \(\E{Y}\geq \frac{2}{3}s\). If the median \(\widehat{m}^{**}\) is bad (namely \(\abs{\widehat{m}^{**}-m}\geq \eps \cdot m\)), then at least half of \(\widehat{m}^{*}\)’s are bad. Equivalently, \(Y\leq \frac{1}{2}s\). By Chernoff bound, \[\begin{align*} \Pr{\abs{Y-\E{Y}}\geq \frac{1}{6}s}\leq 2\exp\tp{-\frac{s}{72}}. \end{align*}\]

Therefore, for \(t=\+O\tp{\frac{1}{\eps^2}}\) and \(s=O\tp{\log \frac{1}{\delta}}\), we have \[\begin{align*} \Pr {\abs{\widehat{m}^{**}-m}\geq \eps m}\leq \delta. \end{align*}\]

This new algorithm uses \(\+O(\frac{1}{\eps^2}\cdot \log \frac{1}{\delta} \cdot \log \log m)\) bits of memory.

Hoeffding’s inequality

One of annoying restrictions of Chernoff bound is that each \(X_i\) needs to be a Bernoulli random variable. Hoeffding’s inequality generalizes Chernoff bound by allowing \(X_i\) to follow any distribution, provided its value is almost surely bounded.

Proposition 4 (Hoeffding’s inequality) Let \(X_1, \ldots, X_n\) be independent random variables where each \(X_i\in [a_i, b_i]\) for certain \(a_i\le b_i\) with probability \(1\). Assume \(\E {X_i} = p_i\) for every \(1\le i\le n\). Let \(X = \sum_{i=1}^{n} X_i\) and \(\mu\defeq \E{X}=\sum_{i=1}^n p_i\), then \[ \Pr {| X - \mu | \geq t} \leq 2 \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right) \] for all \(t\geq 0\).

We learnt from the proof of the Chernoff bound that the key to establish concentration inequalities of this form is to obtain a nice upper bound on the moment generating function. Therefore, the following Hoeffding’s lemma will be the main technical ingredient to prove the inequality.

Lemma 2 (Hoeffding’s lemma) Let \(X\) be a random variable with \(\E{X}=0\) and \(X \in [a,b]\). Then it holds that \[ \E{e^{\lambda X}} \leq \exp\left(\frac{\lambda^2 (b-a)^2}{8}\right) \text{ for all } \lambda \in \mathbb{R}. \]

Proof. Let \(\psi(\lambda) = \log \E{e^{\lambda X}}\). We first compute its derivatives: \[ \psi'(\lambda) = \frac{\E{Xe^{\lambda X}}}{\E{e^{\lambda X}}} \] and \[ \psi''(\lambda) = \frac{\E{X^2e^{\lambda X}}}{\E{e^{\lambda X}}} - \frac{\E{Xe^{\lambda X}}^2}{\E{e^{\lambda X}}^2}. \] When \(\lambda=0\), using the condition \(\E{X}=0\), we have \(\psi(0) =\psi'(0)=0\). Let \(P\) be the distribution of \(X\). Note that we can interprete \(\psi''\) as the variance of a tilted random variable, that is, \(\psi''(\lambda)=\Var[Q]{X}\) where \(\frac{\dd Q}{\dd P}(x) = \frac{e^{\lambda x}}{\E{e^{\lambda X}}}\). Since \(X\) is supported on \([a,b]\), for any \(\lambda\), \[ \Var[Q]{X}\leq \E[Q]{\tp{X-\frac{a+b}{2}}^2} \leq \frac{(b-a)^2}{4}. \] Therefore, \[\begin{align*} \psi(\lambda) &= \int_0^{\lambda} \psi'(t)\dd t \\ &= \int_0^{\lambda}\int_0^t \psi''(s)\dd s\dd t \\ &\leq \int_0^{\lambda}\int_0^t \frac{(b-a)^2}{4} \dd s\dd t \\ &= \frac{\lambda^2(b-a)^2}{8}. \end{align*}\]

Armed with Hoeffding’s lemma, it is routine to prove Hoeffding’s inequality.

Proof of Hoeffding’s inequality

First note that we can assume \(\E{X_i} = 0\) and therefore \(\mu=0\) (if not so, replace \(X_i\) by \(X_i-\E{X_i}\)). By symmetry, we only need to prove that \(\Pr{X\geq t} \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)\). Since \[ \Pr{X \geq t} \overset{\lambda > 0}{=} \Pr{e^{\lambda X} \geq e^{\lambda t}} \leq \frac{\E{e^{\lambda X}}}{e^{\lambda t}} \] and \[ \E {e^{\lambda X}} = \E {e^{\lambda \sum_{i=1}^n X_i}} = \prod_{i=1}^n \E{e^{\lambda X_i} }, \] applying Hoeffding’s lemma for each \(\E{e^{\lambda X_i}}\) yields \[ \E{e^{\lambda X_i}} \leq \exp \left(\frac{\lambda^2(b_i-a_i)^2}{8}\right). \] Let \(\lambda = \frac{4t}{\sum_{i=1}^n (b_i - a_i)^2}\), we have, \[ \Pr{X \geq t} \leq \frac{\prod_{i=1}^n \E{e^{\lambda X_i}}}{e^{\lambda t}} \leq \exp\left(-\lambda t + \frac{\lambda^2}{8} \sum_{i=1}^n (b_i - a_i)^2 \right) = \exp \left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right). \]

Multi-Armed Bandit

Suppose there is a \(k\)-arm bandit, and the reward of each arm follows some distribution \(f_i \in [0, 1]\) with mean \(\mu_i\). We assume without loss of generality that \(\mu_1 \ge \mu_2 \ge \cdots \ge \mu_k\). Now suppose you can pull the bandit for \(T\) rounds and the goal is to obtain maximum reward in expectation. If we know \(\mu_1, \dots, \mu_k\) well, the optimal strategy is to pull the arm \(1\) for \(T\) times, and the expected reward is \(T\mu_1\). However, in case that we do not know the distribution, we have to design some strategy to explore the bandit first.

Denote by \(a_t\) the arm pulled at the round \(t\), and thus we have the reward in the \(t\)-th round \(X_t \sim f_{a_t}\). The regret of a strategy is defined as the gap between \(T\mu_1\) and expected rewards of the strategy in \(T\) rounds, namely the regret of not always choosing the first arm: \[\begin{align*} R(T) &\defeq T \mu _1 - \E{\sum_{t=1}^T X_t} \ge 0. \end{align*}\]

For every \(i\in [k]\), we denote \(\Delta_i\defeq \mu_1-\mu_i\ge 0\) as the gap between the reward of the \(i\)-th arm and the optimal arm. The naive strategy that pulls each arm for equal times is bad, since its regret \(R(T) = \frac{\sum _{i=1}^k \Delta_i}{k}\cdot T\), which is linear in \(T\). We consider a strategy/algorithm good if it holds that \(\lim_{T\to\infty}R(T)/T=0\) or equivalently \(R(T)=o(T)\).

Proposition 5 For every \(t \in [T]\), let \(n_i(t) \defeq \sum _{s=1}^t \*1[a_s = i]\) denote the number of times that the arm \(i\) is pulled in the first \(t\) rounds. Then \[\begin{align*} R(T) = \sum _{i=2}^k \Delta _i \cdot \E{n_i(T)}. \end{align*}\]

Proof. \[\begin{align*} R(T)&= T\mu_1 - \E{\sum _{t=1}^T X_t}\\ &= T\mu_1 - \sum _{t=1}^T \E[a_t]{\mu_{a_t}}\\ &= \sum _{t=1}^T \sum _{i=1}^k \Delta_i \cdot \E{\*1[a_t = i]}\\ &= \sum _{i=1}^k \Delta_i \cdot \E{\sum _{t=1}^T \*1[a_t = i]}\\ &= \sum _{i=1}^k \Delta_i \cdot \E{n_i(T)}. \end{align*}\]

We also write \(R_i(T)\defeq \Delta_i\cdot\E{n_i(T)}\) for every \(i\in[k]\), and then \(R(T)=\sum_{i=1}^k R_i(T)\).

The Explore-then-Commit (ETC) Algorithm

To get small regret, our strategy should identify the best arm as soon as possible. The most straightforward way to find the best arm is to try each arm a few times and pick the one with best empirical reward. The Explore-then-Commit algorithm implements this idea: Pull every arm \(i\) for \(L\) times (so \(k L\) times in total for exploration), and calculate \(\hat{\mu}_i\) (the average reward gained in that \(L\) times). After this, always pull the arm with greatest \(\hat{\mu}_i\). We can write its regret as \[\begin{align*} R(T)&= L \sum_{i=1}^{k} \Delta _i + \sum _{i=2}^k \Delta _i \cdot \sum _{t = kL+1}^T \Pr{\hat{\mu}_i > \max _{j \neq i} \hat{\mu}_j} \notag \\ &= L \sum_{i=1}^{k} \Delta _i + \sum _{i=2}^k \Delta _i \cdot (T - kL) \Pr{\hat{\mu}_i > \max _{j \neq i} \hat{\mu}_j}. \end{align*}\]

When \(i\neq 1\), \[\begin{align*} \Pr{\hat{\mu}_i > \max _{j \neq i} \hat{\mu}_j} \le \Pr{\hat{\mu}_i > \hat{\mu}_1}. \end{align*}\]

We bound above probability by concentration inequalities. To this end, let \(X_j\) be the \(j\)-th reward of \(f_i\), \(Y_j\) be the \(j\)-th reward of \(f_1\). Let \(Z_j = X_j - Y_j \in [-1, 1]\), then \(\E{Z_j} = - \Delta _i \le 0\). Let \(Z = \sum _{j=1}^L Z_j\), then \(\E Z = - L \Delta _i\). By Hoeffding’s Inequality, \[\begin{align*} \Pr{\hat{\mu}_i > \hat{\mu}_1} &= \Pr{Z > 0} = \Pr{Z - \E Z \ge L \Delta _i} \le \exp\tp{-\frac{2(L \Delta _i)^2}{\sum _{j=1}^L 2^2}}= \exp\tp{-\frac{L \Delta _i^2}{2}}. \end{align*}\]

Therefore we have \[\begin{align*} R(T) &\le L \sum _{i=1}^k \Delta _i + (T - kL) \sum _{i=2}^k \Delta _i \exp\tp{-\frac{L \Delta _i^2}{2}}\\ &\le \sum _{i=1}^k \tp{L \Delta _i + T \Delta _i \exp\tp{-\frac{L\Delta _i^2}{2}}} \le \sum _{i=1}^k \tp{L + T \Delta _i \exp\tp{-\frac{L\Delta _i^2}{2}}} .\end{align*}\]

To further upper bound \(R(T)\), we define \[\begin{align*} g(L, \Delta _i)\defeq L + T \Delta _i \exp\tp{-\frac{L\Delta _i^2}{2}}. \end{align*}\]

We would like to determine \(L\) minimizing the upper bound of \(R(T)\) among all possible \(\Delta _i\), i.e., \(\min_L \max_{\Delta _i} R(T)\). First we calculate \(\max_{\Delta _i} g(L, \Delta _i)\): \[\begin{align*} \frac{\partial g(L, \Delta _i)}{\partial \Delta _i} = T(1 - L \Delta _i^2) \exp\tp{-\frac{L \Delta_i^2}{2}}. \end{align*}\]

We have \(\frac{\partial g(L, \Delta _i)}{\partial \Delta _i} > 0\) when \(0 \le \Delta _i < \frac{1}{\sqrt{L} }\), and \(\frac{\partial g(L, \Delta _i)}{\partial \Delta _i} < 0\) when \(1 \ge \Delta _i > \frac{1}{\sqrt{L} }\). Thus, for all \(L > 1\), \[\begin{align*} g(L, \Delta _i) \le g\tp{L, \frac{1}{\sqrt{L} }} = L + \frac{T e^{-1 / 2}}{\sqrt{L} }. \end{align*}\]

Finally, \[\begin{align*} R(T) \le \sum _{i=1}^k \tp{L + \frac{T e^{-1 / 2}}{\sqrt{L} }} = \Theta(k T^{\frac{2}{3}}) ,\end{align*}\] by setting \(L = \Theta(T^{\frac{2}{3}})\).

The Explore-then-Commit algorithm enjoys sublinear reget, which is good, but still suboptimal. The main disadvantage is that it treats all arms equally in the exploration step and pulls each of them for fixed \(L\) times regardless of the rewards already obtained.