Blind Source Separation
by Sparse Decomposition
|Barak A. Pearlmutter
Computer Science Dept, FEC 313
University of New Mexico
Albuquerque, NM 87131 USA
The blind source separation problem is to extract the underlying
from a set of their linear mixtures, where the mixing matrix is unknown.
This situation is common, eg in acoustics, radio, and medical signal
We exploit the property of the sources to have a sparse representation
in a corresponding (possibly overcomplete) signal dictionary. Such a
may consist of wavelets, wavelet packets, etc., or be obtained
by learning from a given family of signals.
Starting from the maximum posteriori framework, which is applicable
to the case of more sources than mixtures, we derive a few other
categories of objective functions, which provide faster and more
robust computations, when there are an equal number of sources and mixtures.
Our experiments with artificial
signals and with musical sounds demonstrate significantly better
separation than other known techniques.
We consider the problem of blind separation of source signals
from a set of their linear mixtures, including the case when the number of
sources is larger than the number of mixtures.
This work can be considered a natural generalization
of the Bell-Sejnowski Infomax  and maximum posteriori
[13,15] methods of blind source separation.
We assume that the source signals have a sparse representation
in a corresponding (possibly overcomplete) signal dictionary.
In this way independence and sparsity are not required from
the signals themselves, but rather from their
decomposition coefficients, which is more natural in many practical cases.
On the other hand our approach may be considered an extension of
basis pursuit  to the case of signal
This paper is organized as follows. Section 2 gives the
problem formulation and assumptions. In Section 3 we
present the maximum posteriori approach, which is applicable to
the case of more sources than mixtures. In Section 4 we
derive another objective function, which provides more robust
computations when there are an equal number of sources and mixtures.
Section 5 presents sequential source extraction using
quadratic programming with non-convex quadratic constraints. Finally,
in Section 6 we derive a faster method for non-overcomplete
dictionaries and demonstrate high-quality separation of synthetically
mixed musical sounds.
Problem Formulation and Assumptions
Let x(t) be an N-dimensional vector of sensor signals which is an
linear mixture of M unknown ``independent'' sources s(t),
possibly corrupted by additive noise :
We will estimate the unknown mixing matrix of real numbers A (up to
row permutation and scaling) and the source signal s(t) (up to
component permutation and scaling.)
We take advantage of the fact that many natural
sources of signal have sparse representation in the proper
signal dictionary :
are called atoms or elements
of the dictionary.
These elements do not have to be linearly independent, and instead may
overcomplete dictionary. Important examples are
wavelet-related dictionaries (wavelet packets, etc.) [7,16].
Sparsity means that only a small number of the coefficients Cik
differ significantly from zero.
In the discrete time case
we will use matrix
notation. For example, X will be a N x T matrix, with the
signal xi(t) in row i. S will be an M x T matrix with
underlying source sj(t) in row j, and
a K x T
matrix with atom
in row k, so that
We suppose that the coefficients Cik are statistically independent random
variables with a probability density function (pdf) of an exponential type
This kind of distribution is widely used for modeling sparsity
A reasonable choice of h(c) may be
or its smooth approximations. In our computations we use a family of convex smooth
approximations to the absolute value
is a proximity parameter:
We also suppose a priory that the matrix A is uniformly
distributed over the
range of interest, and that the noise
is a spatially and temporally
uncorrelated Gaussian process1with zero mean and variance .
Maximum Posteriori Approach
We wish to maximize the posterior distribution
is the conditional probability of observing X
and C. Taking into account (3), (4), and the white
Gaussian noise, we get
By the statistical independence of
the coefficients Cjk and (5), the prior pdf of C is
If the prior pdf P(A) is uniform, it can be dropped2 from
In this way we are left with the problem
By substituting (10) and (11) into (12), taking the
logarithm, and inverting the sign, we obtain the following
is the Frobenius
One can consider this objective as a generalization of [18,17]
by incorporating the matrix ,
or as a generalization of 
by including the matrix A.
One problem with such a formulation is that it can lead to the degenerate
solution C=0 and .
We can overcome this difficulty in various ways. The first approach is to
force each row Ai of the mixing matrix A to be bounded in norm,
The second way is to restrict the norm of the rows Cj from below
A third way is to reestimate the parameters
based on the current
values of Cj. For example, this can be done using sampling
variance as follows:
for a given function
in the distribution (5),
variance of Cjk as a function .
can be obtained by applying the corresponding
to the sampling variance,
In particular, when
into (13), we obtain
A remarkable property of this objective function is its invariance
to rescaling of the rows of C when a corresponding inverse rescaling
is applied to the columns of A.
In our experiments we used the standard wavelet packet dictionary with the basic
wavelet symmlet-8. When the signal length is 64 samples, this
consists of 448 atoms i.e. it is overcomplete by a factor of seven.
Examples of atoms and their images in the time-frequency phase plane
are shown in Figure 1.
We used the ATOMIZER 
and WAVELAB  MATLAB packages for fast multiplication
Examples of atoms: time-frequency phase plane (left) and time
We created three sparse sources (Figure 2, top), each
composed of two or three atoms.
The first two sources have significant cross-correlation, equal to
0.34, which makes separation
difficult for conventional methods.
Two synthetic sensor signals (Figure 2, center) were obtained
as a linear mixture of the sources.
In order to measure the accuracy of the separation, we normalized the
original sources with
and the estimated sources with
The error was then computed as
Sources (top three), mixtures (center two), reconstructed
sources (bottom three), in both time-frequency phase plane (left) and time domain
We tested two methods with this data. The first method used the
and the constraints (15), while the second method used the
objective function (18).
As a tool for constrained optimization we used the PBM method .
Unconstrained optimization was produced by the method of conjugate gradients
using the TOMLAB package . The same tool was used for internal
unconstrained optimization in PBM.
In all the cases we used
(7) and (8), with the parameter
The resulting errors of the source estimates were 0.09% and 0.02% by the first and the
second method respectively.
The estimated sources are shown in the bottom three traces of Figure 2.
They are visually indistinguishable from the original sources, shown
in top three traces of Figure 2.
The main difficulty in a maximization problem like (13) is the
bilinear term ,
which destroys the convexity of the objective
function and makes convergence
unstable when optimization starts far from the solution. In this section we
consider more robust formulations for the case
when the number of sensors is
equal to the number of sources, N=M, and the mixing matrix is invertible
Equal number of sources and sensors: more robust
In the case when the noise is small and the matrix A is far
from singular, WX gives a reasonable estimate of the source signals S.
Taking into account (4), we obtain a least square term
so the separation objective may be written
We also need to add a constraint, which enforces the non-singularity of W.
For example, we can restrict from below its minimal singular value
It can be shown, that in the noiseless case,
the problem (20)-(21)
is equivalent to the maximum posteriori formulation (13)
with the constraint
Another possibility for ensuring the non-singularity of W is to subtract
from the objective
When the noise is zero and
is the identity matrix,
we can substitute C=WX and obtain the Bell-Sejnowski Infomax
We created two sparse sources (Figure 3, top) with strong
cross-correlation of 0.52.
Separation, produced by minimization of the objective function (22),
gave an error of 0.23%.
For comparison we tested the JADE [6,5],
FastICA [12,11] and Bell-Sejnowski Infomax
algorithms on the
same signals. The Resulting relative errors (Figure 4)
significant superiority of the sparse decomposition approach.
Sources (top two), mixtures (center two), reconstructed
sources (bottom two), in both time-frequency phase plane (left) and time domain
Percent relative error of separation of the artificial sparse
sources recovered by (1) JADE, (2) Fast ICA, (3) Bell-Sejnowski
Infomax, (4) Equation 22.
Let us ask what is the most ``sparse'' signal one can obtain
by a linear combination of the sensor signals
s = wTX.
By sparsity, as before, we mean the ability of the signal to be
approximated by a linear combination of a small number of dictionary
Sequential Extraction of Sources via Quadratic Programming
This will lead us to the following objective:
where the term
may be considered as a penalty on
non-sparsity. In order to avoid the trivial solution of w=0 and c=0,
we need to add a constraint that separates w from zero.
It could be, for example,
A similar constraint can be used as
a tool to extract all the sources sequentially:
the new separation vector wj should have a component of unit norm
in the subspace orthogonal to the previously extracted
where Pj-1 is an orthogonal projector onto
we can use the standard substitution
that transforms (24) and (26) into the following quadratic program
where e is a vector of ones.
In important applications, the sensor signals may have hundreds of
channels and hundreds of thousands of samples. This may make
separation computationally difficult. Here we present an approach
which compromises between statistical and computational efficiency.
In our experience this approach provides high quality of separation in
reasonable time. Suppose the dictionary is ``complete'', i.e. the it forms a basis in the space of discrete signals. This means that
is square and non-singular. As examples of such a
dictionary one can think of the Fourier basis, Gabor basis, various
wavelet-related bases, etc. We can also obtain an ``optimal''
dictionary by learning from given family of signals [15,14,18,17].
Fast Solution in Non-overcomplete Dictionaries
Let us denote the dual basis
and suppose that coefficients of decomposition of the sources
are sparse and statistically independent. This assumption is
reasonable for properly chosen dictionaries, although of course we
would lose the advantages of overcompleteness .
Let Y be the decomposition of the sensor signals
Multiplying both sides of (3) by
from the right and taking
(28) and (29), we obtain
is decomposition of the noise
In this paper we consider an ``easy''
is a white
noise, that requires orthogonality of .
We can see that all the objective functions from the sections
3-5 remain valid if we remove from them
(substituting instead the identity matrix) and replace the sensor
signals X by their decomposition Y. For example, maximum posteriori
objectives (13) and (18) are transformed into
The objective (22) becomes
In this case when the noise is zero, we can substitute C=WY and obtain
the Bell-Sejnowski Infomax objective 
Also other known methods (for example, [13,15]), which
normally assume sparsity of source signals, may be directly applied to
the decomposition Y of the sensor signals. This may be more
efficient than the traditional approach, and the reason is obvious:
typically, a properly chosen decomposition gives significantly higher
sparsity than the raw signals had originally. Also, statistical
independence of the coefficients is a more reasonable assumption than
statistical independence of the raw signal samples.
In our experiments we artificially mixed seven 5-second fragments of
recordings taken from commercial
digital audio CDs. Each of them included 40k samples after
down-sampling by a factor of 5. (Figure 5).
Separation of musical recordings taken from commercial
digital audio CDs (five second fragments.) Original
sources (left); mixtures (center); separated sources (right).
The easiest way to perform sparse decomposition of such sources is to
compute a spectrogram, the coefficients of a time-windowed
discrete Fourier transform.
(We used the function SPECGRAM from the MATLAB signal processing toolbox
with a time window of 1024 samples.)
The sparsity of the spectrogram coefficients (the histogram in
Figure 6, right) is much higher then the sparsity of the
original signal (Figure 6, left)
In this case Y (29) is a real matrix, with separate entries
for the real and
imaginary components of each spectrogram coefficient of the sensor
signals X. We used the objective function (35) with
defined by (7),(8) with the parameter
minimization was performed by a BFGS Quasi-Newton algorithm (MATLAB function
This algorithm separated the sources with a relative error of 0.67%
for the least well separated source (error computed according to
(19).) We also applied the Bell-Sejnowski
Infomax algorithm  implemented in  to the
spectrogram coefficients Y of the sensor signals. Separation errors
were slightly larger: 0.9%.
For comparison we tested JADE [6,5],
FastIca [12,11] and Bell-Sejnowski Infomax algorithms on the
same signals. Resulting relative errors (Figure 7)
significant (by a factor of more than 10) superiority of the sparse
Histogram of sound source values (left) and
spectrogram coefficients (right), shown with
linear y-scale (top), square root y-scale (center) and
logarithmic y-scale (bottom).
Percent relative error of separation of seven musical sources
recovered by (1) JADE, (2) Fast ICA, (3) Bell-Sejnowski Infomax,
(4) Infomax, applied to the spectrogram coefficients,
(5) BFGS minimization
of the objective (35) with the spectrogram coefficients.
We should mention an alternative to the maximum posteriori approach
Considering the mixing matrix A as a parameter, we can estimate it
by maximizing the probability of the observed signal X
The integral over all possible coefficients C may be approximated,
for example, by
Monte-Carlo sampling or by a matching Gaussian,
in the spirit of [15,14].
It would be interesting to
compare this possibility to the other methods presented in this paper.
Another important direction give us the
problem of blind separation-deconvolution
of convolutive mixtures of signals (see for example .)
In this case
the matrices A and W will have linear filters as an elements,
and multiplication by the element will mean
convolution. Even in this matrix-of-filters context most of the
formulae in this paper remain valid.
In this paper we showed that the use of sparse decomposition
in a corresponding
signal dictionary provides high-quality blind source
separation. The maximum posteriori framework gives the most general approach,
including the situation of more sources than sensors.
Faster and computationally
robust solutions are possible in the case of an equal number of sources and
We can also extract the sources sequentially
using quadratic programming with non-convex quadratic constraints.
Finally, the fastest solution may be obtained using non-overcomplete
dictionaries. Our experiments with artificial
signals and digitally mixed musical sounds demonstrate a high quality
separation, compared to other known techniques.
This research was partially supported by NSF CAREER award 97-02-311 to
BAP and by NSF CISE Research Infrastructure award CDA-9503064, and by
postdoctoral funding for MZ from the Albuquerque High Performance
Computational Neurobiology Laboratory, the Salk Institute.
Anthony J. Bell and Terrence J. Sejnowski.
An information-maximization approach to blind separation and blind
Neural Computation, 7(6):1129-1159, 1995.
A. Ben-Tal and M. Zibulevsky.
Penalty/barrier multiplier methods for convex programming problems.
SIAM Journal on Optimization, 7(2):347-366, 1997.
J. Buckheit, S.S. Chen, D.L. Donoho, I. Johnstone, and J. Scargle.
Technical report, Department of Statistics, Stanford University,
JADE for real-valued data.
High-order contrasts for independent component analysis.
Neural Computation, 11(1):157-192, 1999.
S. S. Chen, D. L. Donoho, and M. A. Saunders.
Atomic decomposition by basis pursuit.
S. S. Chen, D. L. Donoho, M. A. Saunders, I. Johnstone, and J. Scargle.
Technical report, Department of Statistics, Stanford University,
R.R. Coifman and M.V. Wickerhauser.
Entropy-based algorithms for best-basis selection.
IEEE Transactions on Information Theory, 38:713-718, 1992.
K. Holmstrom and M. Bjorkman.
The TOMLAB NLPLIB.
Advanced Modeling and Optimization, 1:70-86, 1999.
The Fast-ICA MATLAB package.
Fast and robust fixed-point algorithms for independent component
IEEE Transactions on Neural Networks, 10(3):626-634, 1999.
T.W. Lee, M.S. Lewicki, M. Girolami, and T.J. Sejnowski.
Blind source separation of more sources than mixtures using
IEEE Sig. Proc. Lett., 1998.
M.S. Lewicki and B.A. Olshausen.
A probabilistic framework for the adaptation and comparison of image
Journal of the Optical Society of America, 1999.
M.S. Lewicki and T.J. Sejnowski.
Learning overcomplete representations.
Neural Computation, 1998.
A Wavelet Tour of Signal Processing.
Academic Press, 1998.
B.A. Olshausen and D.J. Field.
Emergence of simple-cell receptive field properties by learning a
sparse code for natural images.
Nature, 381:607-609, 1996.
B.A. Olshausen and D.J. Field.
Sparse coding with an overcomplete basis set: A strategy employed by
Vision Research, 37:3311-3325, 1997.
This document was generated using the
LaTeX2HTML translator Version 98.2 beta6 (August 14th, 1998)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 spica7.tex
The translation was initiated by Michael Zibulevsky on 1999-08-26
- ... process1
- The assumption of the noise
whiteness is for simplicity of exposition and can be easily removed.
- ... dropped2
if P(A) is some other known function,
we should use (9) directly.