Blind Source Separation
by Sparse Decomposition
Michael Zibulevsky 
michael@cs.unm.edu 
Barak A. Pearlmutter 
bap@cs.unm.edu 
Computer Science Dept, FEC 313
University of New Mexico
Albuquerque, NM 87131 USA
Abstract
The blind source separation problem is to extract the underlying
source signals
from a set of their linear mixtures, where the mixing matrix is unknown.
This situation is common, eg in acoustics, radio, and medical signal
processing.
We exploit the property of the sources to have a sparse representation
in a corresponding (possibly overcomplete) signal dictionary. Such a
dictionary
may consist of wavelets, wavelet packets, etc., or be obtained
by learning from a given family of signals.
Starting from the maximum posteriori framework, which is applicable
to the case of more sources than mixtures, we derive a few other
categories of objective functions, which provide faster and more
robust computations, when there are an equal number of sources and mixtures.
Our experiments with artificial
signals and with musical sounds demonstrate significantly better
separation than other known techniques.
Introduction
We consider the problem of blind separation of source signals
from a set of their linear mixtures, including the case when the number of
sources is larger than the number of mixtures.
This work can be considered a natural generalization
of the BellSejnowski Infomax [2] and maximum posteriori
[13,15] methods of blind source separation.
We assume that the source signals have a sparse representation
in a corresponding (possibly overcomplete) signal dictionary.
In this way independence and sparsity are not required from
the signals themselves, but rather from their
decomposition coefficients, which is more natural in many practical cases.
On the other hand our approach may be considered an extension of
basis pursuit [7] to the case of signal
mixtures.
This paper is organized as follows. Section 2 gives the
problem formulation and assumptions. In Section 3 we
present the maximum posteriori approach, which is applicable to
the case of more sources than mixtures. In Section 4 we
derive another objective function, which provides more robust
computations when there are an equal number of sources and mixtures.
Section 5 presents sequential source extraction using
quadratic programming with nonconvex quadratic constraints. Finally,
in Section 6 we derive a faster method for nonovercomplete
dictionaries and demonstrate highquality separation of synthetically
mixed musical sounds.
Problem Formulation and Assumptions
Let x(t) be an Ndimensional vector of sensor signals which is an
instantaneous
linear mixture of M unknown ``independent'' sources s(t),
possibly corrupted by additive noise :

(1) 
We will estimate the unknown mixing matrix of real numbers A (up to
row permutation and scaling) and the source signal s(t) (up to
component permutation and scaling.)
We take advantage of the fact that many natural
sources of signal have sparse representation in the proper
signal dictionary :

(2) 
The functions
are called atoms or elements
of the dictionary.
These elements do not have to be linearly independent, and instead may
form an
overcomplete dictionary. Important examples are
waveletrelated dictionaries (wavelet packets, etc.) [7,16].
Sparsity means that only a small number of the coefficients C_{ik}
differ significantly from zero.
In the discrete time case
we will use matrix
notation. For example, X will be a N x T matrix, with the
signal x_{i}(t) in row i. S will be an M x T matrix with
underlying source s_{j}(t) in row j, and
a K x T
matrix with atom
in row k, so that
X 
= 

(3) 
S 
= 

(4) 
We suppose that the coefficients C_{ik} are statistically independent random
variables with a probability density function (pdf) of an exponential type

(5) 
This kind of distribution is widely used for modeling sparsity
[15,18].
A reasonable choice of h(c) may be

(6) 
or its smooth approximations. In our computations we use a family of convex smooth
approximations to the absolute value
where
is a proximity parameter:
when
.
We also suppose a priory that the matrix A is uniformly
distributed over the
range of interest, and that the noise
is a spatially and temporally
uncorrelated Gaussian process^{1}with zero mean and variance .
Maximum Posteriori Approach
We wish to maximize the posterior distribution

(9) 
where
is the conditional probability of observing X
given A
and C. Taking into account (3), (4), and the white
Gaussian noise, we get

(10) 
By the statistical independence of
the coefficients C_{jk} and (5), the prior pdf of C is

(11) 
If the prior pdf P(A) is uniform, it can be dropped^{2} from
(9).
In this way we are left with the problem

(12) 
By substituting (10) and (11) into (12), taking the
logarithm, and inverting the sign, we obtain the following
optimization problem

(13) 
where
is the Frobenius
matrix norm.
One can consider this objective as a generalization of [18,17]
by incorporating the matrix ,
or as a generalization of [7]
by including the matrix A.
One problem with such a formulation is that it can lead to the degenerate
solution C=0 and .
We can overcome this difficulty in various ways. The first approach is to
force each row A_{i} of the mixing matrix A to be bounded in norm,

(14) 
The second way is to restrict the norm of the rows C_{j} from below

(15) 
A third way is to reestimate the parameters
based on the current
values of C_{j}. For example, this can be done using sampling
variance as follows:
for a given function
in the distribution (5),
express the
variance of C_{jk} as a function .
An
estimate of
can be obtained by applying the corresponding
inverse function
to the sampling variance,

(16) 
In particular, when
,
and

(17) 
Substituting
and
into (13), we obtain

(18) 
A remarkable property of this objective function is its invariance
to rescaling of the rows of C when a corresponding inverse rescaling
is applied to the columns of A.
In our experiments we used the standard wavelet packet dictionary with the basic
wavelet symmlet8. When the signal length is 64 samples, this
dictionary
consists of 448 atoms i.e. it is overcomplete by a factor of seven.
Examples of atoms and their images in the timefrequency phase plane
[9,16]
are shown in Figure 1.
We used the ATOMIZER [8]
and WAVELAB [4] MATLAB packages for fast multiplication
by
and .
Figure:
Examples of atoms: timefrequency phase plane (left) and time
plot (right.)

We created three sparse sources (Figure 2, top), each
composed of two or three atoms.
Figure:
Sources (top three), mixtures (center two), reconstructed
sources (bottom three), in both timefrequency phase plane (left) and time domain
(right).

The first two sources have significant crosscorrelation, equal to
0.34, which makes separation
difficult for conventional methods.
Two synthetic sensor signals (Figure 2, center) were obtained
as a linear mixture of the sources.
In order to measure the accuracy of the separation, we normalized the
original sources with
,
and the estimated sources with
.
The error was then computed as

(19) 
We tested two methods with this data. The first method used the
objective function
(13)
and the constraints (15), while the second method used the
objective function (18).
As a tool for constrained optimization we used the PBM method [3].
Unconstrained optimization was produced by the method of conjugate gradients
using the TOMLAB package [10]. The same tool was used for internal
unconstrained optimization in PBM.
In all the cases we used
defined by
(7) and (8), with the parameter
.
Another parameter
.
The resulting errors of the source estimates were 0.09% and 0.02% by the first and the
second method respectively.
The estimated sources are shown in the bottom three traces of Figure 2.
They are visually indistinguishable from the original sources, shown
in top three traces of Figure 2.
Equal number of sources and sensors: more robust
formulations
The main difficulty in a maximization problem like (13) is the
bilinear term ,
which destroys the convexity of the objective
function and makes convergence
unstable when optimization starts far from the solution. In this section we
consider more robust formulations for the case
when the number of sensors is
equal to the number of sources, N=M, and the mixing matrix is invertible
W=A^{1}.
In the case when the noise is small and the matrix A is far
from singular, WX gives a reasonable estimate of the source signals S.
Taking into account (4), we obtain a least square term
,
so the separation objective may be written

(20) 
We also need to add a constraint, which enforces the nonsingularity of W.
For example, we can restrict from below its minimal singular value
r_{min}(W):

(21) 
It can be shown, that in the noiseless case,
,
the problem (20)(21)
is equivalent to the maximum posteriori formulation (13)
with the constraint
Another possibility for ensuring the nonsingularity of W is to subtract
from the objective

(22) 
When the noise is zero and
is the identity matrix,
we can substitute C=WX and obtain the BellSejnowski Infomax
objective [2]

(23) 
We created two sparse sources (Figure 3, top) with strong
crosscorrelation of 0.52.
Figure:
Sources (top two), mixtures (center two), reconstructed
sources (bottom two), in both timefrequency phase plane (left) and time domain
(right).

Separation, produced by minimization of the objective function (22),
gave an error of 0.23%.
For comparison we tested the JADE [6,5],
FastICA [12,11] and BellSejnowski Infomax
[2,1]
algorithms on the
same signals. The Resulting relative errors (Figure 4)
confirm the
significant superiority of the sparse decomposition approach.
Figure:
Percent relative error of separation of the artificial sparse
sources recovered by (1) JADE, (2) Fast ICA, (3) BellSejnowski
Infomax, (4) Equation 22.

Sequential Extraction of Sources via Quadratic Programming
Let us ask what is the most ``sparse'' signal one can obtain
by a linear combination of the sensor signals
s = w^{T}X.
By sparsity, as before, we mean the ability of the signal to be
approximated by a linear combination of a small number of dictionary
elements
This will lead us to the following objective:

(24) 
where the term
may be considered as a penalty on
nonsparsity. In order to avoid the trivial solution of w=0 and c=0,
we need to add a constraint that separates w from zero.
It could be, for example,

(25) 
A similar constraint can be used as
a tool to extract all the sources sequentially:
the new separation vector w^{j} should have a component of unit norm
in the subspace orthogonal to the previously extracted
vectors

(26) 
where P^{j1} is an orthogonal projector onto
.
When
,
we can use the standard substitution
that transforms (24) and (26) into the following quadratic program
where e is a vector of ones.
Fast Solution in Nonovercomplete Dictionaries
In important applications, the sensor signals may have hundreds of
channels and hundreds of thousands of samples. This may make
separation computationally difficult. Here we present an approach
which compromises between statistical and computational efficiency.
In our experience this approach provides high quality of separation in
reasonable time. Suppose the dictionary is ``complete'', i.e. the it forms a basis in the space of discrete signals. This means that
the matrix
is square and nonsingular. As examples of such a
dictionary one can think of the Fourier basis, Gabor basis, various
waveletrelated bases, etc. We can also obtain an ``optimal''
dictionary by learning from given family of signals [15,14,18,17].
Let us denote the dual basis

(27) 
and suppose that coefficients of decomposition of the sources

(28) 
are sparse and statistically independent. This assumption is
reasonable for properly chosen dictionaries, although of course we
would lose the advantages of overcompleteness [15].
Let Y be the decomposition of the sensor signals

(29) 
Multiplying both sides of (3) by
from the right and taking
into account
(28) and (29), we obtain

(30) 
where
is decomposition of the noise

(31) 
In this paper we consider an ``easy''
situation, when
is a white
noise, that requires orthogonality of .
We can see that all the objective functions from the sections
35 remain valid if we remove from them
(substituting instead the identity matrix) and replace the sensor
signals X by their decomposition Y. For example, maximum posteriori
objectives (13) and (18) are transformed into

(32) 
and

(33) 
The objective (22) becomes

(34) 
In this case when the noise is zero, we can substitute C=WY and obtain
the BellSejnowski Infomax objective [2]

(35) 
Also other known methods (for example, [13,15]), which
normally assume sparsity of source signals, may be directly applied to
the decomposition Y of the sensor signals. This may be more
efficient than the traditional approach, and the reason is obvious:
typically, a properly chosen decomposition gives significantly higher
sparsity than the raw signals had originally. Also, statistical
independence of the coefficients is a more reasonable assumption than
statistical independence of the raw signal samples.
In our experiments we artificially mixed seven 5second fragments of
musical sound
recordings taken from commercial
digital audio CDs. Each of them included 40k samples after
downsampling by a factor of 5. (Figure 5).
Figure:
Separation of musical recordings taken from commercial
digital audio CDs (five second fragments.) Original
sources (left); mixtures (center); separated sources (right).

The easiest way to perform sparse decomposition of such sources is to
compute a spectrogram, the coefficients of a timewindowed
discrete Fourier transform.
(We used the function SPECGRAM from the MATLAB signal processing toolbox
with a time window of 1024 samples.)
The sparsity of the spectrogram coefficients (the histogram in
Figure 6, right) is much higher then the sparsity of the
original signal (Figure 6, left)
In this case Y (29) is a real matrix, with separate entries
for the real and
imaginary components of each spectrogram coefficient of the sensor
signals X. We used the objective function (35) with
and
defined by (7),(8) with the parameter
.
Unconstrained
minimization was performed by a BFGS QuasiNewton algorithm (MATLAB function
FMINU.)
This algorithm separated the sources with a relative error of 0.67%
for the least well separated source (error computed according to
(19).) We also applied the BellSejnowski
Infomax algorithm [2] implemented in [1] to the
spectrogram coefficients Y of the sensor signals. Separation errors
were slightly larger: 0.9%.
For comparison we tested JADE [6,5],
FastIca [12,11] and BellSejnowski Infomax algorithms on the
same signals. Resulting relative errors (Figure 7)
confirm the
significant (by a factor of more than 10) superiority of the sparse
decomposition approach.
Figure:
Histogram of sound source values (left) and
spectrogram coefficients (right), shown with
linear yscale (top), square root yscale (center) and
logarithmic yscale (bottom).
Figure 7:
Percent relative error of separation of seven musical sources
recovered by (1) JADE, (2) Fast ICA, (3) BellSejnowski Infomax,
(4) Infomax, applied to the spectrogram coefficients,
(5) BFGS minimization
of the objective (35) with the spectrogram coefficients.

We should mention an alternative to the maximum posteriori approach
(12).
Considering the mixing matrix A as a parameter, we can estimate it
by maximizing the probability of the observed signal X
The integral over all possible coefficients C may be approximated,
for example, by
MonteCarlo sampling or by a matching Gaussian,
in the spirit of [15,14].
It would be interesting to
compare this possibility to the other methods presented in this paper.
Another important direction give us the
problem of blind separationdeconvolution
of convolutive mixtures of signals (see for example [2].)
In this case
the matrices A and W will have linear filters as an elements,
and multiplication by the element will mean
convolution. Even in this matrixoffilters context most of the
formulae in this paper remain valid.
In this paper we showed that the use of sparse decomposition
in a corresponding
signal dictionary provides highquality blind source
separation. The maximum posteriori framework gives the most general approach,
including the situation of more sources than sensors.
Faster and computationally
robust solutions are possible in the case of an equal number of sources and
sensors.
We can also extract the sources sequentially
using quadratic programming with nonconvex quadratic constraints.
Finally, the fastest solution may be obtained using nonovercomplete
dictionaries. Our experiments with artificial
signals and digitally mixed musical sounds demonstrate a high quality
of source
separation, compared to other known techniques.
This research was partially supported by NSF CAREER award 9702311 to
BAP and by NSF CISE Research Infrastructure award CDA9503064, and by
postdoctoral funding for MZ from the Albuquerque High Performance
Computing Center.
 1

ICA/EEG toolbox.
Computational Neurobiology Laboratory, the Salk Institute.
http://www.cnl.salk.edu/tewon/ica_cnl.html.
 2

Anthony J. Bell and Terrence J. Sejnowski.
An informationmaximization approach to blind separation and blind
deconvolution.
Neural Computation, 7(6):11291159, 1995.
 3

A. BenTal and M. Zibulevsky.
Penalty/barrier multiplier methods for convex programming problems.
SIAM Journal on Optimization, 7(2):347366, 1997.
 4

J. Buckheit, S.S. Chen, D.L. Donoho, I. Johnstone, and J. Scargle.
About wavelab.
Technical report, Department of Statistics, Stanford University,
1995.
http://wwwstat.stanford.edu/donoho/Reports/.
 5

JeanFrançois Cardoso.
JADE for realvalued data.
http://sig.enst.fr:80
[0]/
[0]~
cardoso/guidesepso
u.html.
 6

JeanFrançois Cardoso.
Highorder contrasts for independent component analysis.
Neural Computation, 11(1):157192, 1999.
 7

S. S. Chen, D. L. Donoho, and M. A. Saunders.
Atomic decomposition by basis pursuit.
1996.
http://wwwstat.stanford.edu/donoho/Reports/.
 8

S. S. Chen, D. L. Donoho, M. A. Saunders, I. Johnstone, and J. Scargle.
About atomizer.
Technical report, Department of Statistics, Stanford University,
1995.
http://wwwstat.stanford.edu/donoho/Reports/.
 9

R.R. Coifman and M.V. Wickerhauser.
Entropybased algorithms for bestbasis selection.
IEEE Transactions on Information Theory, 38:713718, 1992.
 10

K. Holmstrom and M. Bjorkman.
The TOMLAB NLPLIB.
Advanced Modeling and Optimization, 1:7086, 1999.
http://www.ima.mdh.se
[0]/
[0]tom/.
 11

A. Hyvärinen.
The FastICA MATLAB package.
http:
[0]//
[0]www.
[0]cis.
[0]hut.fi/aapo/.
 12

A. Hyvärinen.
Fast and robust fixedpoint algorithms for independent component
analysis.
IEEE Transactions on Neural Networks, 10(3):626634, 1999.
 13

T.W. Lee, M.S. Lewicki, M. Girolami, and T.J. Sejnowski.
Blind source separation of more sources than mixtures using
overcomplete representations.
IEEE Sig. Proc. Lett., 1998.
to appear.
 14

M.S. Lewicki and B.A. Olshausen.
A probabilistic framework for the adaptation and comparison of image
codes.
Journal of the Optical Society of America, 1999.
in press.
 15

M.S. Lewicki and T.J. Sejnowski.
Learning overcomplete representations.
Neural Computation, 1998.
to appear.
 16

S. Mallat.
A Wavelet Tour of Signal Processing.
Academic Press, 1998.
 17

B.A. Olshausen and D.J. Field.
Emergence of simplecell receptive field properties by learning a
sparse code for natural images.
Nature, 381:607609, 1996.
 18

B.A. Olshausen and D.J. Field.
Sparse coding with an overcomplete basis set: A strategy employed by
v1?
Vision Research, 37:33113325, 1997.
This document was generated using the
LaTeX2HTML translator Version 98.2 beta6 (August 14th, 1998)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html split 0 spica7.tex
The translation was initiated by Michael Zibulevsky on 19990826
Footnotes
 ... process^{1}
 The assumption of the noise
whiteness is for simplicity of exposition and can be easily removed.
 ... dropped^{2}
 Otherwise,
if P(A) is some other known function,
we should use (9) directly.
Michael Zibulevsky
19990826