The Simons Institute for the Theory of Computing makes their workshop tutorials available online. In 2017, they held the "Bridging Continuous and Discrete Optimization Boot Camp." While watching the first talk on LP/SDP Hierarchies and Sum of Squares Proofs, the speaker Pablo Parrilo brings an interesting example: nuclear norm minimization via semidefinite programming. This post goes over deriving this SDP and some applications of nuclear norm minimization.
1. Preliminaries
The nuclear norm (also known as the trace norm) for a matrix \(M\) is defined as \( \|M\|_* = \sum_i \sigma_i(M) \), where \( \sigma_i(M) \) are the singular values of \(M\). If we were interested in minimizing the nuclear norm, Prof. Parrilo proposes the following semidefinite program:
\[ \begin{aligned} \min\;& \frac{1}{2}\bigl(\text{tr } U +\text{tr } V \bigr) \\[6pt] \text{s.t.}\;& \begin{bmatrix} U & M \\[3pt] M^T & V \end{bmatrix} \succeq 0\,. \end{aligned} \]where we have matrix optimization variables \(U, V \in \mathbf{S}^n\). Using tools from ECE236B, we derive the SDP above. Throughout this post, we choose the matrix inner product to be the trace and the singular value matrix \( \Sigma \) has its singular values in sorted order.
2. Derivation
2.1 Detour: the dual norm
For a given norm \( \| \ \| \), its dual norm is defined as $$ \|M\|_* = \sup_{ \|W\| \leq 1} \langle M, W \rangle $$ (note the unfortunate overloading of the asterisk subscript. A norm's dual norm and the nuclear norm are different, except in one case.) It turns out the dual norm of the nuclear norm is the operator norm, or matrix 2-norm, defined as $$ \|M\|_2 = \max_{x \neq 0} \frac{\|Mx\|_2}{\|x\|_2} = \max_{\|x\|_2 = 1} \|Mx\|_2 = \sigma_1(M) $$ which are all equivalent expressions. The operator norm is also used to define the condition number of a matrix (\( \kappa(A) = \| A\|_2 \| A^{-1} \|_2 \)), which is useful for understanding the "stability" of a matrix, i.e. its sensitivity to perturbations. We now prove the equality (largely due to this math exchange post) of this expression by proving both directions of the inequality. First, we prove $$ \|M\|_* = \sum_i \sigma_i \leq \sup_{\sigma_1(W) \leq 1} \text{tr } M^T W $$ Take \(M = U \Sigma V^T\) to be the SVD of \(M\). If we choose \(W = U I V^T\), notice that this satisfies \(\sigma_1(W) \leq 1\) with equality (in fact, all singular values are 1). Then, we have $$ \text{tr } VIU^TU \Sigma V^T = \text{tr } V \Sigma V^T = \text{tr } \Sigma = \sum_i \sigma_i $$ where we apply the orthogonality properties of \(U, V\) and the cyclic property of the trace. This directly means that $$ \sup_{\sigma_1(W) \leq 1} \langle W, M \rangle \geq \|M\|_* $$ We then turn to the other direction. Once again taking the SVD of \(M\), we have the following chain of expressions $$ \sup_{\sigma_1(W) \leq 1} \text{tr } W^T M = \sup_{\sigma_1(W) \leq 1} \text{tr } W^T U \Sigma V^T = \sup_{\sigma_1(W) \leq 1} \text{tr } V^T W^T U \Sigma = \sup_{\sigma_1(W) \leq 1} \langle UWV^T, \Sigma \rangle $$ We can now decompose this inner product directly into a trace expressed via a sum. $$ \langle UWV^T, \Sigma \rangle = \sum_i \sigma_i (UWV^T)_{ii} = \sum_i \sigma_i u_i^T W v_i \leq \sum_i \sigma_i \sigma_1 (W) $$ Where the first inequality can be shown from noticing that \(\|u_i\|_2 =|v_i\|_2 = 1\) and $$ u_i^T W v_i \leq \sup_{\|u_i\|_2 = \|v_i\|_2 = 1} u_i^T W v_i = \sigma_1(W) $$ Then, simplying applying \(\sigma_1(W) \leq 1\) gives us the desired inequality \( \sup_{\sigma_1(W) \leq 1} \langle W, M \rangle \leq \|M\|_* \). Alternatively, for this bound, we have a quick application of the von Neumann trace inequality: $$ \sup_{\sigma_1(W) \leq 1} \text{tr } W^T M \leq \sum_i \sigma_i(W) \sigma_i(M) \leq \sum_i \sigma_i(M) = \|M\|_* $$ Thus, we have proven that the dual norm of the nuclear norm is the operator norm.
2.2 The primal
We have just spent too much time proving that $$ \|M\|_* = \sup_{\sigma_1(W) \leq 1} \langle W, M \rangle $$ However, we can now notice that we can describe this expression as an optimization problem: \[ \begin{aligned} \max_W\;& \langle W, M \rangle \\[6pt] \text{s.t.}\;& \|W\|_2 \leq 1\,. \end{aligned} \] The constraint \( \|W\|_2 \leq 1 \) can be rewritten as a constraint on all of its singular values, since restricting \(\sigma_1 \leq 1\) means that all singular values must be less than or equal to 1. As an inequality, we can write this as $$ I \succeq W^TW $$ to see this, take the eigendecomposition \(W^TW = Q \Lambda Q^T \). Then,
2.3 The dual
From our primal SDP, we can now derive the dual SDP. We introduce a dual (matrix) variable $$ Z = \begin{bmatrix} Z_{11} & Z_{12} \\[3pt] Z_{12}^T & Z_{22} \end{bmatrix} $$ for the primal LMI constraint, whose block matrix we call \(B\). We form the Lagrangian $$ L(W, Z) = \langle M, W \rangle - \langle B, Z \rangle $$ Working out the matrix multiplication, taking the trace and gathering the terms, we have $$ L(W, Z) = \text{tr } W^T (M - 2 Z_{12}) - (\text{tr } Z_{11} + \text{tr } Z_{22}) $$ In order for \(W\) to vanish, we require that \(M - 2 Z_{12} = 0\). Then, our dual function is
3. Applications
The crux of the objective \( \min \|M\|_* \) is to provide a convex approach to induce some low-rank structure in a matrix (or, in maximization form, to extract low-rank structure from data). We discuss applications below.
3.1 Early Origins
An early paper from Boyd’s group in 2001 proposed the nuclear norm as a heuristic for reducing the order of controllers and systems. These problems typically involve seeking some low-rank solution to linear matrix inequalities. The observation made in [Fazel et al., 2001] is that if the decision matrix is positive semidefinite, then minimizing its trace is a convex proxy for minimizing rank — this is our nuclear norm (which the authors note is also sometimes called the Ky Fan norm). Importantly, they show that the nuclear norm is the tightest convex relaxation: for the set of matrices with operator norm (recall: the dual of the nuclear norm) bounded by 1, the nuclear norm is the convex envelope of the rank function, meaning it is the largest convex function that under-approximates rank on this set.
3.2 Low-Rank Matrix Completion
One of the more well-known applications of nuclear norm minimization is the matrix completion problem: one is given only a subset of entries of some (large) matrix and needs to infer the missing entries under the assumption that the complete matrix is low-rank. However, directly enforcing “find the lowest-rank matrix consistent with the known entries” is NP-hard [Recht et al., 2010]. To make this problem tractable, additional assumptions are required.
In [Candès & Recht, 2009] it was shown that a convex relaxation can exactly solve this problem under many cases. Namely, under mild randomness and incoherence assumptions on the position of the known entries and on the true low-rank matrix, solving the convex optimization problem
will with high probability recover the exact matrix \(M\). Their analysis showed that exact recovery via nuclear norm minimization is possible with about \(O(n^{1.2} r \log n)\) observed entries for an \(n \times n\) matrix of rank \(r\).
Those familiar with SDP solvers know that, generally, solving large SDP formulations does not scale well, especially for matrix completion problems where the matrix variables are of size equal to the full matrix. To approximate the matrix with minimum nuclear norm, one can use Singular Value Thresholding [Cai et al., 2010], which performs iterative soft-thresholding of singular values and scales better than general-purpose SDP solvers.
3.3 Robust Principal Component Analysis
Robust PCA deals with separating a matrix into low-rank and sparse components. If our data matrix is largely low-rank but has unusual sparse errors or outliers, classical PCA fails because these outliers can shift the low-rank subspace estimation adversarially. Robust PCA aims to recover the true low-rank component despite the outliers.
[Candès et al., 2011] introduced Principal Component Pursuit, a convex program that solves:
where \(D\) is the observed data matrix (assumed to be \(L_{\text{true}} + S_{\text{true}}\) with \(L_{\text{true}}\) low-rank and \(S_{\text{true}}\) sparse). The nuclear norm on \(L\) promotes a low-rank solution and the \(L_1\) norm promotes sparsity in \(S\). Under reasonable assumptions, this program exactly recovers both components \(L_{\text{true}}\) and \(S_{\text{true}}\) with high probability. More interestingly, this also implies that among all decompositions \(D = L + S\) the one with minimum nuclear norm plus \(L_1\) norm is the correct one.
References
- Maryam Fazel, Haitham Hindi, and Stephen Boyd. A rank minimization heuristic with application to minimum order system approximation. Proc. American Control Conference, 2001.
- Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501, 2010.
- Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
- Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
- Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?. Journal of the ACM, 58(3):1–37, 2011.