The Streaming PCA Problem and Oja’s Algorithm
The Streaming PCA Problem
There are many situations in which one would like to compute the dominant eigenspace of a positive semidefinite matrix \Sigma \in \mathbb{R}^{d \times d} — principal component analysis. I personally became interested in this topic because of its close relationship and importance to factor modelling in quantitative finance. I won’t dwell on the finance motivations for now (I hope to write more about this in the future), but will skip to the problem setup I have in mind.
In practical applications, one typically does not have access to the matrix \Sigma, but rather some samples X_i, i \in [n] for which \Sigma is the covariance (or more precisely second moment) matrix: \mathbf{E} X_i X_i^\top = \Sigma \qquad i \in [n]. To make things tractable, I will assume that the data is drawn iid. Let us further assume that we are only interested in the dominant eigenvector u_1 of \Sigma (the 1-PCA problem) associated to the leading eigenvalue \lambda_1 (a future post might touch on the k-PCA problem). For future reference, I will denote the eigenvalues of \Sigma by \lambda_{1:d} (in descending order) and the associated eigenvectors by u_{1:d}.
In principle, a solution to this problem is given by computing the SVD of \begin{bmatrix} X_1, X_2, \dots, X_n \end{bmatrix}. It is not particularly hard to prove that this approach is at least reasonably—and I must confess to not having worked out the details—statistically efficient using standard tools from high-dimensional probability and the Davis-Kahan Theorem.
However, the computational complexity of this approach scales unfavorably with d (something like O(nd^2)), and moreover, it requires us to keep the entire batch in memory. In this post, I’ll take a look at an alternative approach which promises the same statistical efficiency, has lower computational complexity (O(nd)) and moreover is streaming, i.e., only uses O(1) memory, which is of course highly desirable if we are working with large datasets. The algorithm I will review is attributed to Oja and Oja and Karhunen.
The optimization problem and the continuous flow
My starting point is standard, and begins with the observation that the principal component u_1 of \Sigma is characterized by the following optimization problem: u_1 \in \argmax_{w \in \mathbb{S}^{d-1}} \underbrace{\frac{1}{2}w^\top \Sigma w}_{\triangleq J(w)}. The optimization problem naturally lends itself be solved iteratively by a gradient algorithm such as gradient descent (ascent). At first glance, one might think that the highly non-convex sphere poses us some problems, but this turns out to not be the case. In fact another motivation I have for studying Oja’s algorithm is that it’s an example of the convergence of gradient descent on a non-convex problem.
Before we proceed with the discrete stochastic gradient rule that is Oja’s algorithm, it is instructive to first consider the gradient flow. This allows us to gather some intuition before we proceed with stochastic approximation. In the case of the continuous flow, the projection operator ensuring that our updates remain on the sphere, is absorbed into the Riemannian gradient: \nabla_{{\mathbb{S}^{n-1}}} J(w) = (I_{d} - w w^{\top})\,\Sigma_{X}\, w.
Note that (I_{d} - w w^{\top}) is simply the projection onto the orthogonal complement of the span of the current iterate. In any case, it is straightforward to derive the solution of the corresponding gradient flow.
Proposition.
Let w_0 \in \mathbb{S}^{n-1} and consider the ODE for t \ge 0:
\frac{d}{dt} w(t) = (I_d - w(t) w(t)^\top)\Sigma w(t), \qquad w(0)=w_0. \tag{1}
Then w(t) \in \mathbb{S}^{n-1} for all t \geq 0, and moreover
w(t) = \frac{\exp(\Sigma t)\, w_0} {\left\| \exp(\Sigma t)\, w_0 \right\|}.
This proposition is instructive as it is easy to see from the above that the flow (Equation 1) converges to the principal component u_1. Namely, write using the spectral decomposition of \Sigma and by representing w_0 in the corresponding eigenbasis: \begin{aligned} \exp(\Sigma t) w_0 &= \sum_{i=1}^d \exp(\lambda_i t)\, u_i u_i^\top \sum_{j=1}^d (u_j^\top w_0)\, u_j \\ &= \sum_{i=1}^d \exp(\lambda_i t)\, (u_i^\top w_0)\, u_i \\ &= \exp(\lambda_1 t)\!\left( (u_1^\top w_0)\, u_1 + \sum_{i=2}^d \exp((\lambda_i-\lambda_1)t)\, (u_i^\top w_0)\, u_i \right) \end{aligned} \tag{2}
from which it is easy to see that the dominant timescale is aligned with u_1 as long as the spectral gap \lambda_1 - \lambda_2 is strictly positive. Our first sanity check has passed and the continuous version of Oja’s algorithm indeed converges to the desired quantity.
Oja’s Algorithm
Oja’s algorithm is the projected stochastic gradient algorithm corresponding to the flow in Equation 1 in which we only have sample access X_{1:n}—as opposed to knowing \Sigma. Namely, for a sequence of step-sizes \alpha_{1:\infty} \subset \R we write w_{k+1} = \Pi_{S^{n-1}}\left( w_k +\alpha_k \left( X_{k+1}X_{k+1}^\top w_k \right) \right), \qquad w_0 = w_{\mathrm{init}} \in \mathbb{S}^{n-1}. \tag{3} Where we recognize \nabla \hat J_{k+1}(w) = X_{k+1}X_{k+1}^\top w as the stochastic gradient of J(w) = \frac{1}{2} w^\top\mathbf{E}_X [X X^\top ]w. Moreover, it is worthwhile to point out that for every x \in \R^d, \Pi_{\mathbb{S}^{n-1}} (x) = x / \|x\|. Note that in contrast to the continous time algorithm, the discrete algorithm has an explicit projection step. A trivial but key observation is that Equation 3 can be written in terms of the un-normalized dynamics. In other words, we only need to compute the projection at the final step. Although note that there are reasons of numerical precision for why one would like to do this more frequently in practice — the iterates v_k below have a potentially exponential time-scale depending on the step sizes \alpha_k.
Proposition.
As long as v_0 = w_{\mathrm{init}}, the Oja update is equivalent to:
\begin{aligned}
v_{k+1} &= v_k + \alpha_k\, X_{k+1}X_{k+1}^\top v_k \\
w_k &= v_k / \|v_k\|
\end{aligned}
\tag{4}

A Roadmap to Controlling the Squared Sine Distance to the Principal Component
In the sequel we seek to control the squared distance \mathrm{d}^2_{\sin} between the iterates w_k and the principal component u_1. For two unit vectors u,w \in \mathbb{S}^{d-1} \mathrm{d}^2_{\sin} (u,w) = 1- (u^\top w)^2.
Using our observation Equation 4, we see that we can write out the angular distance to the principal component of our iterates w_k as \mathrm{d}^2_{\sin} (u_1,w_k) = 1- (u_1^\top w_k)^2 =1 - \frac{(u_1^\top v_k)^2}{\|v_k\|^2} = \frac{\|v_k\|^2 - (u_1^\top v_k)^2}{\|v_k\|^2} = \frac{\sum_{i=2}^d(u_i^\top v_k)^2}{\sum_{i=1}^d(u_i^\top v_k)^2}.
I’d like to first prove in expectation guarantees on the expected distance-squared, as this is typically easier whenever possible.
The that I will pursue here is quite simple, let me first rewrite \mathrm{d}^2_{\sin} (u_1,w_k) = \frac{\sum_{i=2}^d(u_i^\top v_k)^2}{\sum_{i=1}^d(u_i^\top v_k)^2}=\frac{\frac{1}{\gamma_{0:k}(1)}\sum_{i=2}^d(u_i^\top v_k)^2}{\frac{1}{\gamma_{0:k}(1)}\sum_{i=1}^d(u_i^\top v_k)^2}, \qquad \gamma_{0:T}(1) = \prod_{s=0}^{k-1}(1+\alpha_s \lambda_i).
I divide by \gamma_{0:k}(1) to isolate the dominant timescale. I will argue that the numerator decays at a statistical rate with carefully chosen step-sizes and that the denominator is of constant order of magnitude with high probability provided that we have a good initial guess, say \mathrm{d}^2_{\sin} (u_1,w_k) \approx 1/2. We thus have our work cut out for us:
- Control the expectation of the numerator $_{i=2}d(u_iv_k)^2 $.
- Argue that with sufficiently high probability, the denominator $_{i=1}d(u_iv_k)^2 $ is lower bounded by a constant given a good initial guess.
- Figure out how to make a good initial guess — this will involve running Oja with a different step-size than during the main phase.
I will walk through all of this in the next (possibly several) posts, but before we get there I feel like a few remarks are in order. First, let me note that the appendix below clarifies how one might divine that dividing by \gamma_{0:k}(1) is the correct thing to do — it is again a matter of isolating the dominant time-scale. Second, we can get away with arguing that the denominator is of constant order with high probability and bounding the expectation of the numerator on the complement of this event because we always trivially have that the angular distance is bounded by 1. Finally, the need for a good initial guess comes from the observation that a random guess, say uniform over the sphere, scales unfavorably in high dimensions which would yield a poor dimensional dependence in our analysis.
Appendix: The Noisy Linear Systems View
I haven’t exactly decided to use the material below in my future posts on this topic, but I find it a helpful crutch to think about the problem from a different angle.
Recall that in Equation 2 we were able to expand the solution of the continuous algorithm in the eigenbasis of \Sigma. This seems like a good first step, but how should we mimic this here? There is clearly no guarantee that the random operators \prod_{i=0}^k (I+\alpha_{k} X_{k+1}X_{k+1}^\top) are diagonalizable in the basis of \Sigma. A fairly standard trick to get around this issue from optimization and control is to rewrite the update in Equation 4 as:
v_{k+1} = \underbrace{\left(I_d+\alpha_k \Sigma\right)}_{=A_k} v_k + \underbrace{\alpha_k \underbrace{\left(X_{k+1}X_{k+1}-\Sigma \right)}_{=\Delta_{k+1}}v_k}_{=\epsilon_{k+1}} \tag{5} where for convenience I have introduced A_k= I_d+\alpha_k \Sigma, \Delta_k = X_{k}X_{k}-\Sigma and \epsilon_k= \alpha_{k-1} \Delta_k v_k. The point of the above is that we have isolated a deterministic drift term and a signal dependent diffusion term. In particular, the products \prod_{i=0}^k (I+\alpha_{k} \Sigma) are all simultaneously diagonalizable with \Sigma. The fact that the noise term \left(X_{k+1}X_{k+1}-\Sigma \right) hits the state multiplicatively, means that this is what we in linear system theory call a multiplicative noise model. Anyway, we can now leverage the closed form solution to Equation 5:
v_{k} = \Phi_{0:k} v_0 + \sum_{t=0}^{k-1} \Phi_{t+1:k} \epsilon_t
where \Phi_{a:b} := A_{b-1} \times \dots \times A_a for integers b\geq a and \Phi_{a:a} := I_d. I have kind of lied to you here, in calling the above a solution, since the right hand side still depends on v_{1:k-1} through \epsilon. I do however find the display above useful as it allows us to isolate the dominant time-scales in a rather crisp way if we notice that: \Phi_{a:b} =\sum_{i=1}^d \underbrace{\prod_{s=a}^{b-1}(1+\alpha_s \lambda_i)}_{\triangleq \gamma_{a:b}(i)} u_i u_i^\top = \sum_{i=1}^d \gamma_{a:b}(i) u_i u_i^\top. Hence:
v_{k} = \sum_{i=1}^d \gamma_{0:k}(i) u_i u_i^\top v_0 + \sum_{i=1}^d \sum_{t=0}^{k-1} \gamma_{t+1:k}(i) u_i u_i^\top \epsilon_t. \tag{6}
Equation 6 again shows that there is a dominant time-scale associated to \lambda_1 as long as the spectral gap \lambda_1 - \lambda_2 is strictly positive. a There’s a whole method to the madness of linear systems with multiplicative errors in the literature, and its something I quite frankly haven’t looked much at before because the model (with iid multiplicative errors) always seemed a little contrived to me. My interest is somewhat piqued now though that it has naturally appeared in the construction of a practical algorithm.