The Streaming PCA Problem and Oja’s Algorithm

In the first post of this series, I introduce Oja’s algorithm and the streaming PCA problem. We start from the Riemannian gradient flow, derive an exact solution, and then show how the stochastic gradient version inherits the same time-scale structure. I then sketch out a roadmap for proving that the expected squared sine distance between the last iterate of the algorithm and the principal component of the data is small under an iid generative assumption.

Published

November 29, 2025

The Streaming PCA Problem

There are many situations in which one would like to compute the dominant eigenspace of a positive semidefinite matrix \Sigma \in \mathbb{R}^{d \times d} — principal component analysis. I personally became interested in this topic because of its close relationship and importance to factor modelling in quantitative finance. I won’t dwell on the finance motivations for now (I hope to write more about this in the future), but will skip to the problem setup I have in mind.

In practical applications, one typically does not have access to the matrix \Sigma, but rather some samples X_i, i \in [n] for which \Sigma is the covariance (or more precisely second moment) matrix: \mathbf{E} X_i X_i^\top = \Sigma \qquad i \in [n]. To make things tractable, I will assume that the data is drawn iid. Let us further assume that we are only interested in the dominant eigenvector u_1 of \Sigma (the 1-PCA problem) associated to the leading eigenvalue \lambda_1 (a future post might touch on the k-PCA problem). For future reference, I will denote the eigenvalues of \Sigma by \lambda_{1:d} (in descending order) and the associated eigenvectors by u_{1:d}.

In principle, a solution to this problem is given by computing the SVD of \begin{bmatrix} X_1, X_2, \dots, X_n \end{bmatrix}. It is not particularly hard to prove that this approach is at least reasonably—and I must confess to not having worked out the details—statistically efficient using standard tools from high-dimensional probability and the Davis-Kahan Theorem.

However, the computational complexity of this approach scales unfavorably with d (something like O(nd^2)), and moreover, it requires us to keep the entire batch in memory. In this post, I’ll take a look at an alternative approach which promises the same statistical efficiency, has lower computational complexity (O(nd)) and moreover is streaming, i.e., only uses O(1) memory, which is of course highly desirable if we are working with large datasets. The algorithm I will review is attributed to Oja and Oja and Karhunen.

The optimization problem and the continuous flow

My starting point is standard, and begins with the observation that the principal component u_1 of \Sigma is characterized by the following optimization problem: u_1 \in \argmax_{w \in \mathbb{S}^{d-1}} \underbrace{\frac{1}{2}w^\top \Sigma w}_{\triangleq J(w)}. The optimization problem naturally lends itself be solved iteratively by a gradient algorithm such as gradient descent (ascent). At first glance, one might think that the highly non-convex sphere poses us some problems, but this turns out to not be the case. In fact another motivation I have for studying Oja’s algorithm is that it’s an example of the convergence of gradient descent on a non-convex problem.

Before we proceed with the discrete stochastic gradient rule that is Oja’s algorithm, it is instructive to first consider the gradient flow. This allows us to gather some intuition before we proceed with stochastic approximation. In the case of the continuous flow, the projection operator ensuring that our updates remain on the sphere, is absorbed into the Riemannian gradient: \nabla_{{\mathbb{S}^{n-1}}} J(w) = (I_{d} - w w^{\top})\,\Sigma_{X}\, w.

Note that (I_{d} - w w^{\top}) is simply the projection onto the orthogonal complement of the span of the current iterate. In any case, it is straightforward to derive the solution of the corresponding gradient flow.

Proposition.
Let w_0 \in \mathbb{S}^{n-1} and consider the ODE for t \ge 0:

\frac{d}{dt} w(t) = (I_d - w(t) w(t)^\top)\Sigma w(t), \qquad w(0)=w_0. \tag{1}

Then w(t) \in \mathbb{S}^{n-1} for all t \geq 0, and moreover

w(t) = \frac{\exp(\Sigma t)\, w_0} {\left\| \exp(\Sigma t)\, w_0 \right\|}.

Proof

We begin by checking that the sphere is an invariant set. Suppose that w(t)=w_t \in \mathbb{S}^{n-1}. Then

\begin{aligned} \frac{d}{dt} w_t^\top w_t &= 2 w_t^\top \dot{w_t} \\ &= 2 w_t^\top \left((I_d - w(t) w(t)^\top)\Sigma w(t) \right) \\ &= 2 w_t^\top\Sigma w(t) - 2 \underbrace{w_t^\top w(t)}_{=1} w(t)^\top\Sigma w(t) \\ &= 2 w_t^\top\Sigma w(t) - 2 w(t)^\top\Sigma w(t) = 0. \end{aligned}

Let’s now solve the ODE in Equation 1. It is clearly nonlinear but as we just saw, the sphere is an invariant set. We’ll thus make the inspired ansatz that w(t) = v(t)/\|v(t)\| and \dot v(t) = C v(t) for some matrix C. Note that this is not too hard to guess, since while the Riemannian gradient is nonlinear in w, the Euclidean gradient is just \nabla J(w) = \Sigma w.

Obviously v(t) = \exp(Ct) v_0 and the fact that our initial condition w_0 already lives on the sphere necessitates that v_0 = w_0. What choice of C guarantees that w(t) = v(t)/\|v(t)\| solves the ODE? Well, we can check this as follows. Let’s first note that \frac{d}{dt} \|v(t)\| = \frac{d}{dt} (v^\top(t) v(t))^{1/2} = \frac{v^\top (t)\dot v(t)}{\|v(t)\|}. Thus: \begin{aligned} \frac{d}{dt} w(t) = \frac{d}{dt} \frac{v(t)}{\|v(t)\|} & = \frac{\dot v(t)}{\|v(t)\|} - \frac{v(t)}{\|v(t)\|^2} \frac{d}{dt} \|v(t)\|\\ & = \frac{\dot v(t)}{\|v(t)\|} - \frac{v(t)}{\|v(t)\|^2}\frac{v^\top (t)\dot v(t)}{\|v(t)\|}\\ & = \left(I -\frac{v(t)v^\top(t)}{\|v(t)\|^2} \right)\frac{\dot v(t)}{\|v(t)\|} \\ & = \left(I -w(t)w^\top(t)\right) Cw(t). \end{aligned} The result follows by choosing C = \Sigma.

This proposition is instructive as it is easy to see from the above that the flow (Equation 1) converges to the principal component u_1. Namely, write using the spectral decomposition of \Sigma and by representing w_0 in the corresponding eigenbasis: \begin{aligned} \exp(\Sigma t) w_0 &= \sum_{i=1}^d \exp(\lambda_i t)\, u_i u_i^\top \sum_{j=1}^d (u_j^\top w_0)\, u_j \\ &= \sum_{i=1}^d \exp(\lambda_i t)\, (u_i^\top w_0)\, u_i \\ &= \exp(\lambda_1 t)\!\left( (u_1^\top w_0)\, u_1 + \sum_{i=2}^d \exp((\lambda_i-\lambda_1)t)\, (u_i^\top w_0)\, u_i \right) \end{aligned} \tag{2}

from which it is easy to see that the dominant timescale is aligned with u_1 as long as the spectral gap \lambda_1 - \lambda_2 is strictly positive. Our first sanity check has passed and the continuous version of Oja’s algorithm indeed converges to the desired quantity.

Oja’s Algorithm

Oja’s algorithm is the projected stochastic gradient algorithm corresponding to the flow in Equation 1 in which we only have sample access X_{1:n}—as opposed to knowing \Sigma. Namely, for a sequence of step-sizes \alpha_{1:\infty} \subset \R we write w_{k+1} = \Pi_{S^{n-1}}\left( w_k +\alpha_k \left( X_{k+1}X_{k+1}^\top w_k \right) \right), \qquad w_0 = w_{\mathrm{init}} \in \mathbb{S}^{n-1}. \tag{3} Where we recognize \nabla \hat J_{k+1}(w) = X_{k+1}X_{k+1}^\top w as the stochastic gradient of J(w) = \frac{1}{2} w^\top\mathbf{E}_X [X X^\top ]w. Moreover, it is worthwhile to point out that for every x \in \R^d, \Pi_{\mathbb{S}^{n-1}} (x) = x / \|x\|. Note that in contrast to the continous time algorithm, the discrete algorithm has an explicit projection step. A trivial but key observation is that Equation 3 can be written in terms of the un-normalized dynamics. In other words, we only need to compute the projection at the final step. Although note that there are reasons of numerical precision for why one would like to do this more frequently in practice — the iterates v_k below have a potentially exponential time-scale depending on the step sizes \alpha_k.

Proposition.
As long as v_0 = w_{\mathrm{init}}, the Oja update is equivalent to: \begin{aligned} v_{k+1} &= v_k + \alpha_k\, X_{k+1}X_{k+1}^\top v_k \\ w_k &= v_k / \|v_k\| \end{aligned} \tag{4}

An illustration of Oja’s algorithm: The step size controls the magnitude of the update. The above proposition shows that we can equivalently make several updates, thus leaving the sphere, and then project at the end.

Proof

We have that for step k:

\begin{aligned} w_{k+1} &= \frac{v_{k+1}}{\|v_{k+1}\|} \\ &= \frac{ v_k + \alpha_k\, X_{k+1}X_{k+1}^\top v_k }{ \left\|v_k + \alpha_k\, X_{k+1}X_{k+1}^\top v_k \right\| } \\ &= \frac{ \frac{v_k}{\|v_k\|} + \alpha_k\, X_{k+1}X_{k+1}^\top \frac{v_k}{\|v_k\|} }{ \left\| \frac{v_k}{\|v_k\|} + \alpha_k\, X_{k+1}X_{k+1}^\top \frac{v_k}{\|v_k\|} \right\| } \\ &= \frac{ w_k + \alpha_k\, X_{k+1}X_{k+1}^\top w_k }{ \left\| w_k + \alpha_k\, X_{k+1}X_{k+1}^\top w_k \right\| }. \end{aligned}

The result follows since the right-hand side is precisely the projected update in Equation 3.

A Roadmap to Controlling the Squared Sine Distance to the Principal Component

In the sequel we seek to control the squared distance \mathrm{d}^2_{\sin} between the iterates w_k and the principal component u_1. For two unit vectors u,w \in \mathbb{S}^{d-1} \mathrm{d}^2_{\sin} (u,w) = 1- (u^\top w)^2.

Using our observation Equation 4, we see that we can write out the angular distance to the principal component of our iterates w_k as \mathrm{d}^2_{\sin} (u_1,w_k) = 1- (u_1^\top w_k)^2 =1 - \frac{(u_1^\top v_k)^2}{\|v_k\|^2} = \frac{\|v_k\|^2 - (u_1^\top v_k)^2}{\|v_k\|^2} = \frac{\sum_{i=2}^d(u_i^\top v_k)^2}{\sum_{i=1}^d(u_i^\top v_k)^2}.

I’d like to first prove in expectation guarantees on the expected distance-squared, as this is typically easier whenever possible.

The that I will pursue here is quite simple, let me first rewrite \mathrm{d}^2_{\sin} (u_1,w_k) = \frac{\sum_{i=2}^d(u_i^\top v_k)^2}{\sum_{i=1}^d(u_i^\top v_k)^2}=\frac{\frac{1}{\gamma_{0:k}(1)}\sum_{i=2}^d(u_i^\top v_k)^2}{\frac{1}{\gamma_{0:k}(1)}\sum_{i=1}^d(u_i^\top v_k)^2}, \qquad \gamma_{0:T}(1) = \prod_{s=0}^{k-1}(1+\alpha_s \lambda_i).

I divide by \gamma_{0:k}(1) to isolate the dominant timescale. I will argue that the numerator decays at a statistical rate with carefully chosen step-sizes and that the denominator is of constant order of magnitude with high probability provided that we have a good initial guess, say \mathrm{d}^2_{\sin} (u_1,w_k) \approx 1/2. We thus have our work cut out for us:

Control the expectation of the numerator $_{i=2}^d(u_iv_k)^2 $.
Argue that with sufficiently high probability, the denominator $_{i=1}^d(u_iv_k)^2 $ is lower bounded by a constant given a good initial guess.
Figure out how to make a good initial guess — this will involve running Oja with a different step-size than during the main phase.

I will walk through all of this in the next (possibly several) posts, but before we get there I feel like a few remarks are in order. First, let me note that the appendix below clarifies how one might divine that dividing by \gamma_{0:k}(1) is the correct thing to do — it is again a matter of isolating the dominant time-scale. Second, we can get away with arguing that the denominator is of constant order with high probability and bounding the expectation of the numerator on the complement of this event because we always trivially have that the angular distance is bounded by 1. Finally, the need for a good initial guess comes from the observation that a random guess, say uniform over the sphere, scales unfavorably in high dimensions which would yield a poor dimensional dependence in our analysis.

Appendix: The Noisy Linear Systems View

I haven’t exactly decided to use the material below in my future posts on this topic, but I find it a helpful crutch to think about the problem from a different angle.

Recall that in Equation 2 we were able to expand the solution of the continuous algorithm in the eigenbasis of \Sigma. This seems like a good first step, but how should we mimic this here? There is clearly no guarantee that the random operators \prod_{i=0}^k (I+\alpha_{k} X_{k+1}X_{k+1}^\top) are diagonalizable in the basis of \Sigma. A fairly standard trick to get around this issue from optimization and control is to rewrite the update in Equation 4 as:

v_{k+1} = \underbrace{\left(I_d+\alpha_k \Sigma\right)}_{=A_k} v_k + \underbrace{\alpha_k \underbrace{\left(X_{k+1}X_{k+1}-\Sigma \right)}_{=\Delta_{k+1}}v_k}_{=\epsilon_{k+1}} \tag{5} where for convenience I have introduced A_k= I_d+\alpha_k \Sigma, \Delta_k = X_{k}X_{k}-\Sigma and \epsilon_k= \alpha_{k-1} \Delta_k v_k. The point of the above is that we have isolated a deterministic drift term and a signal dependent diffusion term. In particular, the products \prod_{i=0}^k (I+\alpha_{k} \Sigma) are all simultaneously diagonalizable with \Sigma. The fact that the noise term \left(X_{k+1}X_{k+1}-\Sigma \right) hits the state multiplicatively, means that this is what we in linear system theory call a multiplicative noise model. Anyway, we can now leverage the closed form solution to Equation 5:

v_{k} = \Phi_{0:k} v_0 + \sum_{t=0}^{k-1} \Phi_{t+1:k} \epsilon_t

where \Phi_{a:b} := A_{b-1} \times \dots \times A_a for integers b\geq a and \Phi_{a:a} := I_d. I have kind of lied to you here, in calling the above a solution, since the right hand side still depends on v_{1:k-1} through \epsilon. I do however find the display above useful as it allows us to isolate the dominant time-scales in a rather crisp way if we notice that: \Phi_{a:b} =\sum_{i=1}^d \underbrace{\prod_{s=a}^{b-1}(1+\alpha_s \lambda_i)}_{\triangleq \gamma_{a:b}(i)} u_i u_i^\top = \sum_{i=1}^d \gamma_{a:b}(i) u_i u_i^\top. Hence:

v_{k} = \sum_{i=1}^d \gamma_{0:k}(i) u_i u_i^\top v_0 + \sum_{i=1}^d \sum_{t=0}^{k-1} \gamma_{t+1:k}(i) u_i u_i^\top \epsilon_t. \tag{6}

Equation 6 again shows that there is a dominant time-scale associated to \lambda_1 as long as the spectral gap \lambda_1 - \lambda_2 is strictly positive. a There’s a whole method to the madness of linear systems with multiplicative errors in the literature, and its something I quite frankly haven’t looked much at before because the model (with iid multiplicative errors) always seemed a little contrived to me. My interest is somewhat piqued now though that it has naturally appeared in the construction of a practical algorithm.