$$ \def\*#1{\mathbf{#1}} \def\+#1{\mathcal{#1}} \def\-#1{\mathrm{#1}} \def\!#1{\mathsf{#1}} \def\@#1{\mathscr{#1}} \newcommand{\mr}[1]{\mbox{\scriptsize \color{RedViolet}$\triangleright\;$#1}\quad\quad} $$

Lecture 14: Sparsification

Authors

instructed by Chihao Zhang

scribed by Yuchen He

Part of this note follows the course material of James R. Lee.

The motivation of sparsification

Consider a matrix \(A \in \bb R^{m \times n}\) where \(m \gg n\). We can view \(A\) as a collection of \(m\) row vectors: \[ A=\left[\begin{matrix}a_1^{\top}\\\vdots \\ a_m^{\top} \end{matrix}\right], \] where each \(a_i\in \bb R^n\). Let \(b \in \mathbb{R}^m\) be a target vector. A fundamental problem is to solve the least squares regression: \[ \min_{x\in \bb R^n} \norm{Ax-b}_2. \] When \(m\) is extremely large, computing the exact solution becomes computationally prohibitive. This motivates the idea of sparsification.

We want to select a small subset of \(M\) rows from \(A\) (where \(n \le M \ll m\)) to form a smaller matrix \(\tilde{A} \in \mathbb{R}^{M \times n}\) that preserves the geometric structure of \(A\). Specifically, we desire that for all \(x \in \mathbb{R}^n\), \(\norm{Ax}_2\approx \norm{\tilde A x}_2\).

If such an \(\tilde{A}\) exists, we can solve the reduced problem: \[ \min_{x \in \bb{R}^n} \|\tilde{A}x - \tilde{b}\|_2, \] where \(\tilde{b} \in \mathbb{R}^M\) consists of the entries of \(b\) corresponding to the selected rows.

We can abstract the linear regression example into a broader function approximation problem. Assume we have a global objective function \(F: \mathbb{R}^n \to \mathbb{R}\) that decomposes into a sum of \(m\) simpler functions: \[ F(x)=\sum_{i=1}^m f_i(x). \] Here \(m\) is very large, making the evaluation of \(F(x)\) expensive.

Our goal is to find a sparse approximation \(\tilde{F}\). Instead of summing all \(m\) terms, we assign a non-negative weight \(c_i \geq 0\) to each function \(f_i\) to construct: \[ \tilde F(x)=\sum_{i=1}^m c_if_i(x). \] We say the weight vector \(c = (c_1, \dots, c_m)\) is \(M\)-sparse if the number of non-zero entries is small, i.e., \(\|c\|_0 \le M\).

This framework unifies many important problems. For example, consider an undirected graph \(G=(V,E)\) with non-negative edge weights \(w_{i,j}\) for each edge \((i,j) \in E\). Without loss of generality, assume \(V=[n]\). Define the component function for each edge as: \[ f_{i,j}(x)=w_{i,j}\abs{x_i-x_j},\quad \mbox{for } x\in \set{0,1}^n \] Then, the global function \(F(x)\) represents the cut function: \[ F(x) = \sum_{(i,j) \in E} w_{i,j} |x_i - x_j| = \text{weight of the cut defined by } x. \] Minimizing \(F(x)\) over non-trivial binary vectors corresponds to finding the global minimum cut of the graph. If we can sparsify \(F(x)\), we effectively construct a sparse graph \(H\) (with fewer edges \(M \ll |E|\)) such that the value of every cut in \(H\) is approximately the same as in \(G\). This allows us to run expensive graph algorithms (like max-flow or min-cut) on the sparse graph \(H\), significantly reducing computational complexity from depending on \(|E|\) to depending on \(M\).

In this lecture, we consider the linear model with \(\ell^2\)-loss. That is, we assume each \(f_i(x) = \inner{a_i}{x}^2\) for some vector \(a_i\in \bb R^n\).

Importance sampling

A natural first attempt at sparsification is to select \(M\) rows of \(A\) uniformly at random. However, this naive approach often fails.

Consider a matrix \(A\) where one row \(a_1\) has a very large norm, while all other rows \(a_i\) are negligible. The function \(F(x) = \|Ax\|^2\) is dominated by the term \(\langle a_1, x \rangle^2\). If we sample uniformly, the probability of picking \(a_1\) is small. Most of the time, we will miss \(a_1\) entirely, resulting in an approximation \(\tilde{F}(x) \approx 0\), which is a huge relative error. To capture the geometry of \(A\), we must bias our sampling towards important rows.

We assign a probability \(p_i\) to each index \(i \in [m]\) such that \(\sum p_i = 1\). Denote this distribution as \(\mu\). We select indices \(i_1, \dots, i_M\) independently according to \(\mu\). To ensure the approximation is correct in expectation, we must reweight the selected terms.

Define the single-sample estimator \[ \hat{F}_I(x) = \frac{f_I(x)}{p_I}, \quad \text{where } I \sim \mu. \]

It is easy to verify this is unbiased: \[ \E{\hat{F}_I(x)} = \sum_{i=1}^m p_i \frac{f_i(x)}{p_i} = \sum_{i=1}^m f_i(x) = F(x). \]

To reduce variance, we average \(M\) independent copies. Our final estimator is \[ \hat F(x) = \frac{1}{M}\sum_{k=1}^M \frac{f_{i_k}(x)}{p_{i_k}}. \]

Statistical leverage scores

We choose \(p_i\propto \alpha_i\) where in our \(\ell^2\)-loss case, \[ \alpha_i\defeq \sup_{x: F(x)\neq 0} \frac{f_i(x)}{F(x)} = \sup_{x:F(x)\leq 1} f_i(x). \] Intuitively, \(\alpha_i\) measures the worst-case importance of row \(i\). If \(\alpha_i\) is close to 1, there exists a direction \(x\) where row \(i\) is the dominant contributor to the norm, so we must sample it.

For the matrix sparsification case where \(f_i(x) = \langle a_i, x \rangle^2\), these sensitivities have a closed-form solution known as leverage scores.

Assume without loss of generality that \(A\) has full rank \(n\) (otherwise, replace the inverse with the pseudoinverse in the following calculations). We have \[ \begin{align*} \alpha_i &=\sup_{x: \norm{Ax}^2\leq 1} \inner{a_i}{x}^2\\ &= \sup_{x: \inner{(A^{\top}A)^{\frac{1}{2}}x}{(A^{\top}A)^{\frac{1}{2}}x}\leq 1} \inner{a_i}{x}^2\\ \mr{$y=(A^{\top}A)^{\frac{1}{2}}x$} &= \sup_{y:\norm{y}\leq 1} \inner{a_i}{(A^{\top}A)^{-\frac{1}{2}}y}^2 \\ &=\norm{(A^{\top}A)^{-\frac{1}{2}} a_i}_2^2. \end{align*} \] This quantity, denoted as \(\sigma_i(A)\), is the \(i\)-th statistical leverage score of \(A\).

To determine the exact value of \(p_i\), we need to calculate the normalizing factor. We have \[ \begin{align*} \sum_{i=1}^m \sigma_i(A) &= \sum_{i=1}^m \inner{a_i}{(A^{\top}A)^{-1} a_i} \\ &=\!{Tr}\tp{(A^{\top}A)^{-1}\cdot \sum_{i=1}^m a_i a_i^{\top}}\\ &\leq \!{Tr}\tp{(A^{\top}A)^{-1} \cdot A^{\top}A}\\ &=\!{rank}(A)\\ &=n. \end{align*} \] This suggests \(p_i=\frac{a_i^{\top} \tp{A^{\top}A}^{-1}a_i}{n}\).

Using these probabilities, our sparse approximation \(\hat{F}(x)\) corresponds to a reweighted random matrix. Recall that
\[ F(x) = \sum_{i=1}^n \inner{a_i}{x}^2 = x^{\top} \tp{\sum_{i=1}^n a_i a_i^{\top}} x = x^\top (A^\top A) x. \] The estimator is: \[ \hat F(x) = \frac{1}{M}\sum_{k=1}^M \frac{\inner{a_{i_k}}{x}^2}{p_{i_k}} = x^{\top} \tp{\frac{1}{M}\sum_{k=1}^M \frac{a_{i_k} a_{i_k}^{\top}}{p_{i_k}}} x. \] Thus, constructing the sparse function \(\hat{F}\) is equivalent to constructing the random matrix \[ \hat{A}^\top \hat{A} \defeq \frac{1}{M} \sum_{k=1}^M \frac{a_{i_k} a_{i_k}^\top}{p_{i_k}}. \]

There are two main approaches to analyse the approximation error:

  • Using matrix concentration bounds to analyse \(\hat{A}^\top \hat{A}\),
  • Applying Dudley’s theorem to bound \(\E{\sup_{x: F(x)\leq 1} |\hat{F}(x) - F(x)|}\).

To align with the topic of this course, we focus on the second approach.

Applying Dudley’s theorem.

We now use Dudley’s theorem to bound the expected supreme error.

Let \(\eps_1, \dots, \eps_M\) be independent Rademacher variables which are uniformly distributed in \(\{\pm 1\}\). From the result of the previous lecture, we can replace the deviation with the magnitude of the randomized process \[ \E{\sup_{x\in \Omega} \abs{F(x) - \hat F(x)}} \leq 2\E{\sup_{x\in \Omega} \abs{ x^{\top}\tp{\frac{1}{M}\sum_{k=1}^M \eps_k \frac{a_{i_k}^{\top} a_{i_k}}{p_{i_k}}} x }}, \] where \(\Omega=\set{x\in \bb R^n: \abs{F(x)}\leq 1}\) is the unit ellipsoid defined by \(A^{\top}A\).

We fix the sampled indices \(\vec{i} = (i_1, \dots, i_M)\) and consider the randomness only over the signs \(\eps_1,\dots,\eps_M\). Define the process indexed by \(x\): \[ Z_x = \abs{ x^{\top}\tp{\frac{1}{M}\sum_{k=1}^M \eps_k \frac{a_{i_k}^{\top} a_{i_k}}{p_{i_k}}} x }. \] Consider the increment between two points \(x,y\): \[\begin{align*} Z_x-Z_y &\leq \abs{\frac{1}{M}\sum_{k=1}^M \eps_k \tp{\frac{\inner{a_{i_k}}{x}^2 - \inner{a_{i_k}}{y}^2}{p_{i_k}}}}\\ &= \abs{\frac{1}{M}\sum_{k=1}^M \eps_k \cdot \frac{\inner{a_{i_k}}{x+y}\cdot\inner{a_{i_k}}{x-y}}{p_{i_k}}}. \end{align*}\] Since this is the absolute value of a sum of independent bounded random variables (conditioned on \(\vec{i}\)), \(Z_x - Z_y\) is a sub-Gaussian random variable. Its variance proxy \[ \sigma_{x,y}^2\leq \frac{1}{M^2}\sum_{k=1}^M \frac{\inner{a_{i_k}}{x+y}^2\inner{a_{i_k}}{x-y}^2}{p_{i_k}^2}. \]

Furthermore, \[ \begin{align*} \sigma^2_{x,y} &\leq \max_{k\in[M]} \frac{\inner{a_{i_k}}{x-y}^2}{p_{i_k}} \cdot \frac{1}{M^2}\cdot \sum_{k=1}^M \frac{\inner{a_{i_k}}{x+y}^2}{p_{i_k}} \\ &\leq \max_{k\in[M]} \frac{\inner{a_{i_k}}{x-y}^2}{p_{i_k}} \cdot \frac{4}{M^2}\cdot \sup_{x\in \Omega} \sum_{k=1}^M \frac{\inner{a_{i_k}}{x}^2}{p_{i_k}}\\ &= \max_{k\in[M]} \frac{\inner{a_{i_k}}{x-y}^2}{p_{i_k}} \cdot \frac{4}{M}\cdot \sup_{x\in \Omega} \hat F(x). \end{align*} \]

The bound on \(\sigma^2_{x,y}\) naturally suggests a specific distance metric on \(\Omega\). For any \(x,y\in \Omega\), define the distance \[ d_{\infty}^{\vec{i}} (x,y) = \max_{k\in[M]} \abs{\frac{\inner{a_{i_k}}{x-y}}{\sqrt{p_{i_k}}}}. \] Similarly, we define the scaling factor \(\sigma\) corresponding to the “radius” of the process: \[ \sigma^2 = \frac{4}{M}\cdot \sup_{x\in \Omega} \hat F(x). \] With these definitions, we have shown that the process is a \(\sigma^2\)-sub-Gaussian process. Therefore, by Dudley’s theorem, the expected supremum is bounded by: \[ \E{\sup_{x \in \Omega} Z_x} \lesssim \sigma \int_0^{\infty} \sqrt{\log N(\Omega, d_{\infty}^{\vec{i}}, \varepsilon)} \, \mathrm{d}\varepsilon. \]

The problem is now reduced to calculating the covering number \(N\tp{\Omega, d_{\infty}^{\vec{i}}, \eps}\).

The covering number

We rely on the following lemma, which bounds the complexity of covering a Euclidean ball with polytopes defined by linear constraints. We will prove this lemma in the next lecture.

Lemma 1 (Covering lemma) Let \(U \in \mathbb{R}^{M \times n}\) be a matrix with rows \(u_1^\top, \dots, u_M^\top\). Define the distance \[ d_U(x,y) = \norm{U(x-y)}_{\infty} = \max_{k \in [M]} |\langle u_k, x-y \rangle|. \] Then the covering number of the Euclidean unit ball \(B_2^n\) satisfies \[ \log N\tp{B_2^n, d_U, \eps}\leq \frac{\log M}{\eps^2}\cdot \max_{k\in[M]}\norm{u_k}^2_2. \]

Our domain \(\Omega = \{x : \|Ax\|_2 \le 1\}\) is an ellipsoid. To apply the lemma, we apply a change of variables to map \(\Omega\) to the standard unit ball \(B_2^n\). Let \(y = (A^{\top}A)^{1/2} x\). Then \[ F(x) = x^\top A^\top A x = y^\top y = \|y\|_2^2. \] Thus, \(x \in \Omega \iff y \in B_2^n\). Substituting \(x = (A^{\top}A)^{-1/2} y_x\) and \(z = (A^{\top}A)^{-1/2} y_z\) into our distance metric \(d_{\infty}^{\vec{i}}\), \[ \begin{align*} d_{\infty}^{\vec{i}}(x,z) &= \max_{k\in[M]} \left| \frac{\langle a_{i_k}, x-z \rangle}{\sqrt{p_{i_k}}} \right| \\ &= \max_{k\in[M]} \left| \left\langle \frac{a_{i_k}}{\sqrt{p_{i_k}}}, (A^{\top}A)^{-1/2} (y_x - y_z) \right\rangle \right| \\ &= \max_{k\in[M]} \left| \left\langle (A^{\top}A)^{-1/2} \frac{a_{i_k}}{\sqrt{p_{i_k}}}, y_x - y_z \right\rangle \right|. \end{align*} \] This is exactly the form \(d_U(y_x, y_z)\) if we define the matrix \(U\) with rows \[ u_k = (A^{\top}A)^{-1/2} \frac{a_{i_k}}{\sqrt{p_{i_k}}}. \] Crucially, the squared norm of these rows is determined by our leverage score sampling probabilities. Recall that \(p_{i_k} = \frac{\|(A^\top A)^{-1/2} a_{i_k}\|^2}{n}\). Therefore, \[ \|u_k\|_2^2 = \frac{1}{p_{i_k}} \|(A^{\top}A)^{-1/2} a_{i_k}\|_2^2 = n. \]

Applying the covering lemma with \(\max \|u_k\|^2 = n\), we obtain our first bound \[ \log N\tp{\Omega, d_{\infty}^{\vec{i}}, \eps} = \log N(B_2^n, d_U,\eps)\lesssim \frac{n\log M}{\eps^2}. \]

A bound for small \(\eps\)

While the bound above is powerful for large \(\varepsilon\), if we substitute this into Dudley’s integral \(\int \sqrt{\log N}\ \d\eps\), we get \(\int \frac{1}{\eps} d\varepsilon\), which diverges near zero. We need a tighter bound for very small \(\eps\).

For small scales, we use a simple volumetric argument. We observe that our random metric is dominated by the scaled Euclidean metric. For any \(x, y \in \Omega\): \[ \begin{align*} d_{\infty}^{\vec{i}}(x,y) &= \max_{k\in[M]} \frac{\inner{a_{i_k}}{x-y}}{\sqrt{p_{i_k}}}\\ &= \max_{k\in[M]} \frac{\inner{\tp{A^{\top}A}^{\frac{1}{2}}(x-y)}{\tp{A^{\top}A}^{-\frac{1}{2}}a_{i_k}}}{\sqrt{p_{i_k}}} \\ \mr{Cauchy-Schwarz}&\leq \max_{k\in[M]} \frac{\sqrt{\inner{x-y}{A^{\top}A(x-y)}}\cdot \sqrt{\inner{a_{i_k}}{\tp{A^{\top}A}^{-1}a_{i_k}}}}{\sqrt{p_{i_k}}} \\ \mr{$p_{i_k} = \frac{\|(A^\top A)^{-1/2} a_{i_k}\|^2}{n}$}&= \sqrt{n}\cdot \norm{A(x-y)}_2 \\ &\leq 2\sqrt{n}. \end{align*} \] Thus, the diameter of \(\Omega\) in this metric is at most \(2\sqrt{n}\). Therefore, \[ \Omega \subseteq 2\sqrt{n}\cdot B_{\infty}^{\vec i}, \] where \(B_{\infty}^{\vec i}=\set{x: d_{\infty}^{\vec{i}}(x,0)\leq 1}\).

To cover \(\Omega\) with \(d_{\infty}^{\vec{i}}\)-balls of radius \(\varepsilon\), it suffices to cover the set \(\{x : \|x\|_2 \le \sqrt{n}\}\) with balls of Euclidean radius \(\eps\).

Therefore, \[ N\tp{\Omega, d_{\infty}^{\vec{i}}, \eps} \leq \tp{\frac{c\sqrt{n}}{\eps}}^n \] for some universal constant \(c\).

Applying Dudley’s theorem

We are now ready to combine our covering number bounds and apply Dudley’s theorem to derive the final error bound. Recall Dudley’s bound: \[ \begin{align*} \E{\sup_{x \in \Omega} Z_x} &\lesssim \E{\sigma} \int_0^{\infty} \sqrt{\log N(\Omega, d_{\infty}^{\vec{i}}, \varepsilon)} \ \dd \varepsilon\\ &\lesssim \frac{1}{\sqrt{M}} \cdot \E{\sqrt{\sup_{x\in \Omega} \hat F(x)}} \cdot \int_0^{\infty} \sqrt{\log N(\Omega, d_{\infty}^{\vec{i}}, \varepsilon)} \ \dd \varepsilon. \end{align*} \] Note that the diameter of \(\Omega\) is \(2\sqrt{n}\), therefore, for \(\eps>2\sqrt{n}\), \(N(\Omega, d_{\infty}^{\vec{i}}, \varepsilon)=1\). Without loss of generality, assume the choice of \(M\) satisfies \(\log M<2n\). Therefore, \[ \begin{align*} \E{\sup_{x: F(x)\leq 1} Z_x} &\lesssim \frac{\E{\sqrt{\sup_{x\in \Omega} \hat F(x)}}}{\sqrt{M}} \cdot \sqrt{n}\int_{0}^{2\sqrt{n}} \sqrt{\log \frac{\sqrt{n}}{\eps} \wedge \frac{\log M}{\eps^2}}\ \dd \eps\\ &\lesssim \frac{\E{\sqrt{\sup_{x\in \Omega} \hat F(x)}}}{\sqrt{M}} \cdot \sqrt{n}\int_{0}^{2\sqrt{n}} \sqrt{\frac{\sqrt{n}}{\eps} \wedge \frac{\log M}{\eps^2}}\ \dd \eps\\ \mr{$r=\frac{\log M}{\sqrt{n}}$}&= \E{\sqrt{\sup_{x\in \Omega} \hat F(x)}} \cdot \sqrt{\frac{n}{M}}\cdot \tp{ \int_0^{r}\frac{n^{\frac{1}{4}}}{\sqrt{\eps}}\ \dd \eps + \int_r^{2\sqrt{n}} \frac{\sqrt{\log M}}{\eps} \ \dd \eps}\\ &\lesssim \E{\sqrt{\sup_{x\in \Omega} \hat F(x)}} \cdot \underbrace{\sqrt{\frac{n}{M}}\cdot \sqrt{\log M}\cdot \log n}_{\theta}. \end{align*} \] Let \(S=\E{\sup_{x\in \Omega} \abs{F(x)-\hat F(x)}}\). We have established that \[ \begin{align*} S&\lesssim \theta \cdot \E{\sqrt{\sup_{x\in \Omega} \hat F(x)}} \\ &\leq \theta \cdot\E{\sqrt{1+ \sup_{x\in \Omega} \abs{F(x)-\hat F(x)}}}\\ \mr{Jensen's inequality} &\leq \theta \cdot \tp{1+S}^{\frac{1}{2}}. \end{align*} \] Solve this inequality, we can get \[ S= \E{\sup_{x\in \Omega} \abs{F(x)-\hat F(x)}} \lesssim \theta =\sqrt{\frac{n}{m}}\cdot \sqrt{\log M}\cdot \log n. \]