RMT and outliers
This example examines the effects of outliers on SVD performance for estimating a low-rank matrix from noisy data, from the perspective of random matrix theory, using the Julia language.
This page comes from a single Julia file: outlier1.jl
.
You can access the source code for such Julia documentation using the 'Edit on GitHub' link in the top right. You can view the corresponding notebook in nbviewer here: outlier1.ipynb
, or open it in binder here: outlier1.ipynb
.
Add the Julia packages that are need for this demo. Change false
to true
in the following code block if you are using any of the following packages for the first time.
if false
import Pkg
Pkg.add([
"InteractiveUtils"
"LaTeXStrings"
"LinearAlgebra"
"MIRTjim"
"Plots"
"Random"
"StatsBase"
])
end
Tell Julia to use the following packages for this example. Run Pkg.add()
in the preceding code block first, if needed.
using InteractiveUtils: versioninfo
using LaTeXStrings
using LinearAlgebra: Diagonal, norm, rank, svd, svdvals
using MIRTjim: jim, prompt
using Plots.PlotMeasures: px
using Plots: default, gui, plot, plot!, scatter!, savefig
using Random: seed!
using StatsBase: mean
default(markerstrokecolor=:auto, label="", widen=true, markersize = 6,
labelfontsize = 24, legendfontsize = 18, tickfontsize = 14, linewidth = 3,
)
seed!(0)
Random.TaskLocalRNG()
The following line is helpful when running this file as a script; this way it will prompt user to hit a key after each image is displayed.
isinteractive() && prompt(:prompt);
Image example
Apply an SVD-based low-rank approximation approach to some data with outliers.
Latent matrix
Make a matrix that has low rank:
tmp = [
zeros(1,20);
0 1 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 1 0;
0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0;
0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0;
0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0;
zeros(1,20)
]';
rank(tmp)
4
Turn it into an image:
Xtrue = kron(1 .+ 8*tmp, ones(9,9))
rtrue = rank(Xtrue)
5
plots with consistent size
jim1 = (X ; kwargs...) -> jim(X; size = (700,300),
leftmargin = 10px, rightmargin = 10px, kwargs...);
and consistent display range
jimc = (X ; kwargs...) -> jim1(X; clim=(0,9), kwargs...);
and with NRMSE label
nrmse = (Xh) -> round(norm(Xh - Xtrue) / norm(Xtrue) * 100, digits=1)
args = (xaxis = false, yaxis = false, colorbar = :none) # book
args = (;) # web
jime = (X; kwargs...) -> jimc(X; xlabel = "NRMSE = $(nrmse(X)) %",
args..., kwargs...,
)
bm = s -> "\\mathbf{\\mathit{$s}}"
title = latexstring("\$$(bm(:X))\$ : Latent image")
pt = jimc(Xtrue; title, xlabel = " ", args...)
Helper functions
Bernoulli outliers with magnitude τ
and probability p
:
function outliers(dims::Dims, τ::Real = 6, p::Real = 0.05)
Z = τ * sign.(randn(dims)) .* (rand(dims...) .< p)
return Z
end;
Noisy data
seed!(0)
(M, N) = size(Xtrue)
Z = outliers((M,N))
Y = Xtrue + Z
title = latexstring("\$$(bm(:Y))\$ : Corrupted image matrix\n(with outliers)")
py = jime(Y ; title)
Singular values
The first 3 singular values of $Y$ are well above the "noise floor" caused by outliers.
But $σ₄(X)$ is just barely above the threshold, and $σ₅(X)$ is below the threshold, so we cannot expect a simple SVD approach to recover them well.
ps1 = plot(
title = "Singular values",
xaxis = (L"k", (1, N), [1, 3, 6, N]),
yaxis = (L"σ_k",),
leftmargin = 15px, bottommargin = 20px, size = (600,350), widen = true,
)
sv_x = svdvals(Xtrue)
sv_y = svdvals(Y)
scatter!(sv_y, color=:red, label="Y (data)", marker=:dtriangle)
scatter!(sv_x, color=:blue, label="Xtrue", marker=:utriangle)
prompt()
Low-rank estimate
A simple low-rank estimate of $X$ from the first few SVD components of $Y$ works just so-so here. A simple SVD approach recovers the first 3 components well, but cannot estimate the 4th and 5th components.
r = 5
U,s,V = svd(Y)
Xr = U[:,1:r] * Diagonal(s[1:r]) * V[:,1:r]'
title = latexstring("Rank $r approximation of data \$$(bm(:Y))\$")
pr = jime(Xr ; title)
Examine singular vector estimates. The first 3 are quite good; the next two are poor.
sv1 = [
sum(svd(Xr).U[:,1:r] .* svd(Xtrue).U[:,1:r], dims=1).^2
sum(svd(Xr).V[:,1:r] .* svd(Xtrue).V[:,1:r], dims=1).^2
]
2×5 Matrix{Float64}:
0.997259 0.985004 0.981207 0.85263 0.606551
0.999109 0.993145 0.991356 0.942252 0.83442
Non-iterative "robust" PCA
Try simple outlier removal method. Look at the residual between $\hat{X}$ and $Y$:
residual = Xr - Y
pd = jim1(residual; clim = (-1,1) .* 7, cticks = (-1:1:1) * 8,
title = latexstring("Residual \$$(bm(:Y)) - \\hat{$(bm(:X))}\$"),
)
Identify "bad" pixels with large residual errors
badpixel = @. abs(residual) > 3
jim1(badpixel)
Replace "bad" pixels with typical image values
Ymod = copy(Y)
Ymod[badpixel] .= mean(Y[.!badpixel])
jime(Ymod) # already reduces NRMSE by a lot compared to Y itself!
Examine singular values of modified $Y$. The noise floor is lower.
ps2 = plot(
title = "Singular values",
xaxis = (L"k", (1, N), [1, 3, 6, N]),
yaxis = (L"σ_k",),
leftmargin = 15px, bottommargin = 20px, size = (600,350), widen = true,
)
sv_f = svdvals(Ymod)
scatter!(sv_f, color=:green, label="Y (modified)", marker=:hex)
scatter!(sv_x, color=:blue, label="Xtrue", marker=:utriangle)
prompt()
Applying low-rank matrix approximation to modified $Y$ leads to lower NRMSE.
Um,sm,Vm = svd(Ymod)
Xh = Um[:,1:r] * Diagonal(sm[1:r]) * Vm[:,1:r]'
title = latexstring("Rank $r approximation of modified data \$$(bm(:Y))\$")
ph = jime(Xh ; title)
All of the singular components are better recovered, including the ones that were near or below the noise threshold.
sv2 = [
sum(svd(Xh).U[:,1:r] .* svd(Xtrue).U[:,1:r], dims=1).^2
sum(svd(Xh).V[:,1:r] .* svd(Xtrue).V[:,1:r], dims=1).^2
]
2×5 Matrix{Float64}:
0.998662 0.99398 0.991446 0.964283 0.940057
0.999397 0.998669 0.998008 0.9779 0.954923
Summary
pa = jim(stack((Xtrue, abs.(Z), Y, Xr, 6*badpixel, Xh));
ncol=1, size=(600, 900), clim=(0,9))
More outliers
Now examine a case where the outliers are stronger and more prevalent.
pout2 = 0.1
Z = outliers((M,N), 50, pout2)
Y = Xtrue + Z
title = latexstring("\$$(bm(:Y))\$ : Corrupted image matrix\n(with $(100*pout2)% outliers)")
py2 = jime(Y ; title)
Singular values
Now all of the singular values of $X$ are below the "noise floor" caused by outliers.
ps3 = plot(
title = "Singular values",
xaxis = (L"k", (1, N), [1, 3, 6, N]),
yaxis = (L"σ_k",),
leftmargin = 15px, bottommargin = 20px, size = (600,350), widen = true,
)
sv_x = svdvals(Xtrue)
sv_y = svdvals(Y)
scatter!(sv_y, color=:red, label="Y (data)", marker=:dtriangle)
scatter!(sv_x, color=:blue, label="Xtrue", marker=:utriangle)
prompt()
Low-rank estimate
A simple low-rank estimate of $X$ from the first few SVD components of $Y$ does not work at all now for such heavily corrupted data.
r = 5
U,s,V = svd(Y)
Xr = U[:,1:r] * Diagonal(s[1:r]) * V[:,1:r]'
title = latexstring("Rank $r approximation of data \$$(bm(:Y))\$")
pr2 = jime(Xr ; title)
Examine singular vector estimates. The first one is so-so, the rest are useless.
sv3 = [
sum(svd(Xr).U[:,1:r] .* svd(Xtrue).U[:,1:r], dims=1).^2
sum(svd(Xr).V[:,1:r] .* svd(Xtrue).V[:,1:r], dims=1).^2
]
2×5 Matrix{Float64}:
0.696924 0.0419613 0.0178384 9.77862e-6 0.00170552
0.861749 0.113886 0.0130079 0.0265428 0.00290215
Non-iterative "robust" PCA
Try simple outlier removal method. Look at the residual between $\hat{X}$ and $Y$:
residual = Xr - Y
pd2 = jim1(residual; clim = (-1,1) .* 70, cticks = (-1:1:1) * 8,
title = latexstring("Residual \$$(bm(:Y)) - \\hat{$(bm(:X))}\$"),
)
Identify "bad" pixels with large residual errors. This is a nonlinear operation:
badpixel = @. abs(residual) > 10
jim1(badpixel)
Replace "bad" pixels with typical image values
Ymod = copy(Y)
Ymod[badpixel] .= mean(Y[.!badpixel])
jime(Ymod) # already reduces NRMSE by a lot compared to Y itself!
Examine singular values of modified $Y$. The noise floor is lower.
ps4 = plot(
title = "Singular values",
xaxis = (L"k", (1, N), [1, 3, 6, N]),
yaxis = (L"σ_k",),
leftmargin = 15px, bottommargin = 20px, size = (600,350), widen = true,
)
sv_f = svdvals(Ymod)
scatter!(sv_f, color=:green, label="Y (modified)", marker=:hex)
scatter!(sv_x, color=:blue, label="Xtrue", marker=:utriangle)
prompt()
Applying low-rank matrix approximation to modified $Y$ leads to lower NRMSE.
Um,sm,Vm = svd(Ymod)
Xh = Um[:,1:r] * Diagonal(sm[1:r]) * Vm[:,1:r]'
title = latexstring("Rank $r approximation of modified data \$$(bm(:Y))\$")
ph2 = jime(Xh ; title)
Now the first three singular components are better recovered.
sv4 = [
sum(svd(Xh).U[:,1:r] .* svd(Xtrue).U[:,1:r], dims=1).^2
sum(svd(Xh).V[:,1:r] .* svd(Xtrue).V[:,1:r], dims=1).^2
]
2×5 Matrix{Float64}:
0.981316 0.867786 0.762916 0.0147503 0.0583367
0.987833 0.968847 0.735703 0.00146397 0.139768
Let's try iterating to see if we can refine it. Indeed it does refine it, but at this point it starts to become an ad hoc iterative method. If we going to iterate, then it seems preferable to use a cost function like the one used in robust PCA, with a proper optimization algorithm.
residual = Xh - Y
badpixel = @. abs(residual) > 10
jim1(badpixel)
Ymod = copy(Y)
Ymod[badpixel] .= mean(Y[.!badpixel])
Um,sm,Vm = svd(Ymod)
Xh3 = Um[:,1:r] * Diagonal(sm[1:r]) * Vm[:,1:r]'
title = latexstring("Rank $r approximation of modified data \$$(bm(:Y))\$")
ph3 = jime(Xh3 ; title)
Reproducibility
This page was generated with the following version of Julia:
using InteractiveUtils: versioninfo
io = IOBuffer(); versioninfo(io); split(String(take!(io)), '\n')
11-element Vector{SubString{String}}:
"Julia Version 1.11.1"
"Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)"
"Build Info:"
" Official https://julialang.org/ release"
"Platform Info:"
" OS: Linux (x86_64-linux-gnu)"
" CPU: 4 × AMD EPYC 7763 64-Core Processor"
" WORD_SIZE: 64"
" LLVM: libLLVM-16.0.6 (ORCJIT, znver3)"
"Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)"
""
And with the following package versions
import Pkg; Pkg.status()
Status `~/work/book-la-demo/book-la-demo/docs/Project.toml`
[6e4b80f9] BenchmarkTools v1.5.0
[aaaa29a8] Clustering v0.15.7
[35d6a980] ColorSchemes v3.27.1
⌅ [3da002f7] ColorTypes v0.11.5
⌃ [c3611d14] ColorVectorSpace v0.10.0
[717857b8] DSP v0.7.10
[72c85766] Demos v0.1.0 `~/work/book-la-demo/book-la-demo`
[e30172f5] Documenter v1.7.0
[4f61f5a4] FFTViews v0.3.2
[7a1cc6ca] FFTW v1.8.0
[587475ba] Flux v0.14.25
[a09fc81d] ImageCore v0.10.4
[71a99df6] ImagePhantoms v0.8.1
[b964fa9f] LaTeXStrings v1.4.0
[7031d0ef] LazyGrids v1.0.0
[599c1a8e] LinearMapsAA v0.12.0
[98b081ad] Literate v2.20.1
[7035ae7a] MIRT v0.18.2
[170b2178] MIRTjim v0.25.0
[eb30cadb] MLDatasets v0.7.18
[efe261a4] NFFT v0.13.5
[6ef6ca0d] NMF v1.0.3
[15e1cf62] NPZ v0.4.3
[0b1bfda6] OneHotArrays v0.2.5
[429524aa] Optim v1.10.0
[91a5bcdd] Plots v1.40.9
[f27b6e38] Polynomials v4.0.11
[2913bbd2] StatsBase v0.34.3
[d6d074c3] VideoIO v1.1.0
[b77e0a4c] InteractiveUtils v1.11.0
[37e2e46d] LinearAlgebra v1.11.0
[44cfe95a] Pkg v1.11.0
[9a3f8284] Random v1.11.0
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`
This page was generated using Literate.jl.