Nuclear norm minimization

The Simons Institute for the Theory of Computing makes their workshop tutorials available online. In 2017, they held the "Bridging Continuous and Discrete Optimization Boot Camp." While watching the first talk on LP/SDP Hierarchies and Sum of Squares Proofs, the speaker Pablo Parrilo brings an interesting example: nuclear norm minimization via semidefinite programming. This post goes over deriving this SDP and an interesting application of nuclear norm minimization.

1. Preliminaries

The nuclear norm (also known as the trace norm) for a matrix $M$ is defined as $ \|M\|_* = \sum_i \sigma_i(M) $, where $ \sigma_i(M) $ are the singular values of $M$. If we were interested in minimizing the nuclear norm, Prof. Parrilo proposes the following semidefinite program:

\[ \begin{aligned} \min\;& \frac{1}{2}\bigl(\text{tr } U +\text{tr } V \bigr) \\[6pt] \text{s.t.}\;& \begin{bmatrix} U & M \\[3pt] M^T & V \end{bmatrix} \succeq 0\,. \end{aligned} \]

where we have matrix optimization variables $U, V \in \mathbf{S}^n$. Using tools from ECE236B, we derive the SDP above. Throughout this post, we choose the matrix inner product to be the trace and the singular value matrix $ \Sigma $ has its singular values in sorted order.

2. Derivation

2.1 Detour: the dual norm

For a given norm $ \| \ \| $, its dual norm is defined as $$ \|M\|_* = \sup_{ \|W\| \leq 1} \langle M, W \rangle $$ (note the unfortunate overloading of the asterisk subscript. A norm's dual norm and the nuclear norm are different, except in one case.) It turns out the dual norm of the nuclear norm is the operator norm, or matrix 2-norm, defined as $$ \|M\|_2 = \max_{x \neq 0} \frac{\|Mx\|_2}{\|x\|_2} = \max_{\|x\|_2 = 1} \|Mx\|_2 = \sigma_1(M) $$ which are all equivalent expressions. The operator norm is also used to define the condition number of a matrix ($ \kappa(A) = \| A\|_2 \| A^{-1} \|_2 $), which is useful for understanding the "stability" of a matrix, i.e. its sensitivity to perturbations. We now prove the equality (largely due to this math exchange post) of this expression by proving both directions of the inequality. First, we prove $$ \|M\|_* = \sum_i \sigma_i \leq \sup_{\sigma_1(W) \leq 1} \text{tr } M^T W $$ Take $M = U \Sigma V^T$ to be the SVD of $M$. If we choose $W = U I V^T$, notice that this satisfies $\sigma_1(W) \leq 1$ with equality (in fact, all singular values are 1). Then, we have $$ \text{tr } VIU^TU \Sigma V^T = \text{tr } V \Sigma V^T = \text{tr } \Sigma = \sum_i \sigma_i $$ where we apply the orthogonality properties of $U, V$ and the cyclic property of the trace. This directly means that $$ \sup_{\sigma_1(W) \leq 1} \langle W, M \rangle \geq \|M\|_* $$ We then turn to the other direction. Once again taking the SVD of $M$, we have the following chain of expressions $$ \sup_{\sigma_1(W) \leq 1} \text{tr } W^T M = \sup_{\sigma_1(W) \leq 1} \text{tr } W^T U \Sigma V^T = \sup_{\sigma_1(W) \leq 1} \text{tr } V^T W^T U \Sigma = \sup_{\sigma_1(W) \leq 1} \langle UWV^T, \Sigma \rangle $$ We can now decompose this inner product directly into a trace expressed via a sum. $$ \langle UWV^T, \Sigma \rangle = \sum_i \sigma_i (UWV^T)_{ii} = \sum_i \sigma_i u_i^T W v_i \leq \sum_i \sigma_i \sigma_1 (W) $$ Where the first inequality can be shown from noticing that $\|u_i\|_2 =|v_i\|_2 = 1$ and $$ u_i^T W v_i \leq \sup_{\|u_i\|_2 = \|v_i\|_2 = 1} u_i^T W v_i = \sigma_1(W) $$ Then, simplying applying $\sigma_1(W) \leq 1$ gives us the desired inequality $ \sup_{\sigma_1(W) \leq 1} \langle W, M \rangle \leq \|M\|_* $. Alternatively, for this bound, we have a quick application of the von Neumann trace inequality: $$ \sup_{\sigma_1(W) \leq 1} \text{tr } W^T M \leq \sum_i \sigma_i(W) \sigma_i(M) \leq \sum_i \sigma_i(M) = \|M\|_* $$ Thus, we have proven that the dual norm of the nuclear norm is the operator norm.

2.2 The primal

We have just spent too much time proving that $$ \|M\|_* = \sup_{\sigma_1(W) \leq 1} \langle W, M \rangle $$ However, we can now notice that we can desribe this expression as an optimization problem: \[ \begin{aligned} \max_W\;& \langle W, M \rangle \\[6pt] \text{s.t.}\;& \|W\|_2 \leq 1\,. \end{aligned} \] The constraint $ \|W\|_2 \leq 1 $ can be rewritten as a constraint on all of its singular values, since restricting $\sigma_1 \leq 1$ means that all singular values must be less than or equal to 1. As an inequality, we can write this as $$ I \succeq W^TW $$ to see this, take the eigendecomposition $W^TW = Q \Lambda Q^T $. Then,

$$ \begin{align*} I &\succeq Q \Lambda Q^T \\ Q^T Q &\succeq Q^T Q \Lambda Q^T Q \\ I &\succeq \Lambda \end{align*} $$

Since $\lambda_i (W^TW) = (\sigma_i(W))^2$, and $\sqrt{I_{ii}} = 1$ this constraint successfully bounds $\| W \|_2 \leq 1$. We can now re-write our constraint as $$ \begin{bmatrix} I & W \\ W^T & I \end{bmatrix} \succeq 0 $$ whose equivalence can be shown by easy application of the Schur complement. We can now write our optimization problem as the primal SDP \[ \begin{aligned} \max_W\;& \langle M, W \rangle \\[6pt] \text{s.t.}\;& \begin{bmatrix} I & W \\[3pt] W^T & I \end{bmatrix} \succeq 0\,. \end{aligned} \]

2.3 The dual

From our primal SDP, we can now derive the dual SDP. We introduce a dual (matrix) variable $$ Z = \begin{bmatrix} Z_{11} & Z_{12} \\[3pt] Z_{12}^T & Z_{22} \end{bmatrix} $$ for the primal LMI constraint, whose block matrix we call $B$. We form the Lagrangian $$ L(W, Z) = \langle M, W \rangle - \langle B, Z \rangle $$ Working out the matrix multiplication, taking the trace and gathering the terms, we have $$ L(W, Z) = \text{tr } W^T (M - 2 Z_{12}) - (\text{tr } Z_{11} + \text{tr } Z_{22}) $$ In order for $W$ to vanish, we require that $M - 2 Z_{12} = 0$. Then, our dual function is

$$ g(Z) \;=\; \begin{cases} -\text{tr }(Z_{11}) \;-\;\text{tr }(Z_{22}), & \text{if } M - 2\,Z_{12} = 0 \quad\text{and}\quad Z \succeq 0,\\[8pt] -\infty, & \text{otherwise.} \end{cases} $$

This means that $Z_{12} = \frac{1}{2} M$. Then, our dual SDP is \[ \begin{aligned} \min_{Z_{11}, Z_{22}}\;& \text{tr } Z_{11} + \text{tr } Z_{22}\\[6pt] \text{s.t.}\;& \begin{bmatrix} Z_{11} & \frac{M}{2} \\[3pt] \frac{M^T}{2} & Z_{22} \end{bmatrix} \succeq 0\,. \end{aligned} \] If we scale our constraint by 2, scale our objective by 1/2, rename $Z_{11} \to U$ and $Z_{22} \to V$, we have the desired SDP: \[ \begin{aligned} \min_{U, V}\;& \frac{1}{2}(\text{tr } U + \text{tr } V)\\[6pt] \text{s.t.}\;& \begin{bmatrix} U & M \\[3pt] M^T & V \end{bmatrix} \succeq 0\,. \end{aligned} \]

3. Applications

Still in the process of writing, will be discussing Candes and Recht's work on rank minimization, matrix completion, and other applications as I find them.