next up previous

Blind Source Separation
by Sparse Decomposition

Michael Zibulevsky
Barak A. Pearlmutter

Computer Science Dept, FEC 313
University of New Mexico
Albuquerque, NM 87131 USA

The blind source separation problem is to extract the underlying source signals from a set of their linear mixtures, where the mixing matrix is unknown. This situation is common, eg in acoustics, radio, and medical signal processing. We exploit the property of the sources to have a sparse representation in a corresponding (possibly overcomplete) signal dictionary. Such a dictionary may consist of wavelets, wavelet packets, etc., or be obtained by learning from a given family of signals. Starting from the maximum posteriori framework, which is applicable to the case of more sources than mixtures, we derive a few other categories of objective functions, which provide faster and more robust computations, when there are an equal number of sources and mixtures. Our experiments with artificial signals and with musical sounds demonstrate significantly better separation than other known techniques.


We consider the problem of blind separation of source signals from a set of their linear mixtures, including the case when the number of sources is larger than the number of mixtures. This work can be considered a natural generalization of the Bell-Sejnowski Infomax [2] and maximum posteriori [13,15] methods of blind source separation. We assume that the source signals have a sparse representation in a corresponding (possibly overcomplete) signal dictionary. In this way independence and sparsity are not required from the signals themselves, but rather from their decomposition coefficients, which is more natural in many practical cases. On the other hand our approach may be considered an extension of basis pursuit [7] to the case of signal mixtures.

This paper is organized as follows. Section 2 gives the problem formulation and assumptions. In Section 3 we present the maximum posteriori approach, which is applicable to the case of more sources than mixtures. In Section 4 we derive another objective function, which provides more robust computations when there are an equal number of sources and mixtures. Section 5 presents sequential source extraction using quadratic programming with non-convex quadratic constraints. Finally, in Section 6 we derive a faster method for non-overcomplete dictionaries and demonstrate high-quality separation of synthetically mixed musical sounds.

Problem Formulation and Assumptions

Let x(t) be an N-dimensional vector of sensor signals which is an instantaneous linear mixture of M unknown ``independent'' sources s(t), possibly corrupted by additive noise $\xi(t)$:

x(t) = A s(t) + \xi(t)
\end{displaymath} (1)

We will estimate the unknown mixing matrix of real numbers A (up to row permutation and scaling) and the source signal s(t) (up to component permutation and scaling.)

We take advantage of the fact that many natural sources of signal have sparse representation in the proper signal dictionary $\varphi $:

s_i(t) = \sum_{k=1}^K C_{ik}\varphi _k(t)
\end{displaymath} (2)

The functions $\varphi _k(t)$ are called atoms or elements of the dictionary. These elements do not have to be linearly independent, and instead may form an overcomplete dictionary. Important examples are wavelet-related dictionaries (wavelet packets, etc.) [7,16]. Sparsity means that only a small number of the coefficients Cik differ significantly from zero.

In the discrete time case $t = 1, 2, \ldots, T$ we will use matrix notation. For example, X will be a N x T matrix, with the signal xi(t) in row i. S will be an M x T matrix with underlying source sj(t) in row j, and $\Phi$ a K x T matrix with atom $\varphi _k(t)$ in row k, so that

X = $\displaystyle AS+\xi$ (3)
S = $\displaystyle C \Phi$ (4)

We suppose that the coefficients Cik are statistically independent random variables with a probability density function (pdf) of an exponential type
p_i(C_{ik}) \propto e^{-\beta_i h(C_{ik})}
\end{displaymath} (5)

This kind of distribution is widely used for modeling sparsity [15,18]. A reasonable choice of h(c) may be
h(c) = \vert c \vert^{1/\gamma} \hspace{5em} \gamma \geq 1
\end{displaymath} (6)

or its smooth approximations. In our computations we use a family of convex smooth approximations to the absolute value
h1(c) = $\displaystyle \vert c\vert-\log(1+\vert c\vert)$ (7)
$\displaystyle h_\lambda (c)$ = $\displaystyle \lambda h_1(c/\lambda )$ (8)

where $\lambda $ is a proximity parameter: $h_\lambda (c) \rightarrow \vert c\vert$ when $\lambda \rightarrow 0$.

We also suppose a priory that the matrix A is uniformly distributed over the range of interest, and that the noise $\xi(t)$ is a spatially and temporally uncorrelated Gaussian process1with zero mean and variance $\sigma^2$.

Maximum Posteriori Approach

We wish to maximize the posterior distribution $P( A,C \mid X)$

\max_{A,C} P( A,C \mid X) \propto \max_{A,C} P(X \mid A,C) P(A) P(C)
\end{displaymath} (9)

where $P(X \mid A,C)$ is the conditional probability of observing X given A and C. Taking into account (3), (4), and the white Gaussian noise, we get
P(X \mid A,C) \propto \prod_{i,t} \exp -\frac{(X_{it}-(AC\Phi)_{it})^2}{2\sigma^2}
\end{displaymath} (10)

By the statistical independence of the coefficients Cjk and (5), the prior pdf of C is
P(C) \propto \prod_{j,k} \exp ( -\beta_j h(C_{jk}) )
\end{displaymath} (11)

If the prior pdf P(A) is uniform, it can be dropped2 from (9). In this way we are left with the problem
\max_{A,C} P(X \mid A,C) P(C).
\end{displaymath} (12)

By substituting (10) and (11) into (12), taking the logarithm, and inverting the sign, we obtain the following optimization problem
\min_{A,C} \frac{1}{2\sigma^2} \Vert A C \Phi - X \Vert _F^2
+ \sum_{j,k}\beta_jh(C_{jk})
\end{displaymath} (13)

where $\Vert A \Vert _F = \sqrt{\sum_{i,j} A_{ij}^2}$ is the Frobenius matrix norm.

One can consider this objective as a generalization of [18,17] by incorporating the matrix $\Phi$, or as a generalization of [7] by including the matrix A. One problem with such a formulation is that it can lead to the degenerate solution C=0 and $A=\infty$. We can overcome this difficulty in various ways. The first approach is to force each row Ai of the mixing matrix A to be bounded in norm,

\Vert A_i \Vert \leq 1 \hspace{5em} i = 1, \ldots, N.
\end{displaymath} (14)

The second way is to restrict the norm of the rows Cj from below
\Vert C_j \Vert \geq 1 \hspace{5em} j = 1, \ldots, M.
\end{displaymath} (15)

A third way is to reestimate the parameters $\beta_j$ based on the current values of Cj. For example, this can be done using sampling variance as follows: for a given function $h(\cdot)$ in the distribution (5), express the variance of Cjk as a function $f_h(\beta)$. An estimate of $\beta$ can be obtained by applying the corresponding inverse function to the sampling variance,
\hat{\beta_j} = f_h^{-1}({K}^{-1}\sum_k{C_{jk}^2})
\end{displaymath} (16)

In particular, when $h(c)=\vert c \vert$, $\mbox{var}(c) = 2\beta^{-2}$ and
\hat{\beta_j} = \frac{2}{\sqrt{{K}^{-1}\sum_k{C_{jk}^2}}}
\end{displaymath} (17)

Substituting $h(\cdot)$ and $\hat{\beta}$ into (13), we obtain
\min_{A,C} \ \frac{1}{2\sigma^2} \Vert AC\Phi-X \Vert _F^2
...rac{2\sum_k\vert C_{jk}\vert}{\sqrt{{K}^{-1}\sum_k{C_{jk}^2}}}
\end{displaymath} (18)

A remarkable property of this objective function is its invariance to rescaling of the rows of C when a corresponding inverse rescaling is applied to the columns of A.

Computational experiments with more sources than mixtures

In our experiments we used the standard wavelet packet dictionary with the basic wavelet symmlet-8. When the signal length is 64 samples, this dictionary consists of 448 atoms i.e. it is overcomplete by a factor of seven. Examples of atoms and their images in the time-frequency phase plane [9,16] are shown in Figure 1. We used the ATOMIZER [8] and WAVELAB [4] MATLAB packages for fast multiplication by $\Phi$ and $\Phi^T$.

Figure: Examples of atoms: time-frequency phase plane (left) and time plot (right.)
\includegraphics[width=0.48\textwidth]{singatoms_plane} \includegraphics[width=0.5\textwidth]{singatoms_plot}

We created three sparse sources (Figure 2, top), each composed of two or three atoms.

Figure: Sources (top three), mixtures (center two), reconstructed sources (bottom three), in both time-frequency phase plane (left) and time domain (right).
\includegraphics[width=0.49\textwidth,height=60ex]{sr_sn_es_plane} \includegraphics[width=0.49\textwidth,height=62.5ex]{sr_sn_es_plot}

The first two sources have significant cross-correlation, equal to 0.34, which makes separation difficult for conventional methods. Two synthetic sensor signals (Figure 2, center) were obtained as a linear mixture of the sources. In order to measure the accuracy of the separation, we normalized the original sources with $\Vert S_j \Vert _2=1$, and the estimated sources with $\Vert
\widetilde {S}_j \Vert _2=1$. The error was then computed as
\mbox{Error} =
\frac{\Vert \widetilde {S}_j - S_j \Vert _2}{\Vert S_j \Vert _2} \cdot 100\%
\end{displaymath} (19)

We tested two methods with this data. The first method used the objective function (13) and the constraints (15), while the second method used the objective function (18). As a tool for constrained optimization we used the PBM method [3]. Unconstrained optimization was produced by the method of conjugate gradients using the TOMLAB package [10]. The same tool was used for internal unconstrained optimization in PBM.

In all the cases we used $h_\lambda (\cdot)$ defined by (7) and (8), with the parameter $\lambda = 0.01$. Another parameter $\sigma^2 = 0.0001$. The resulting errors of the source estimates were 0.09% and 0.02% by the first and the second method respectively. The estimated sources are shown in the bottom three traces of Figure 2. They are visually indistinguishable from the original sources, shown in top three traces of Figure 2.

Equal number of sources and sensors: more robust formulations

The main difficulty in a maximization problem like (13) is the bilinear term $AC\Phi$, which destroys the convexity of the objective function and makes convergence unstable when optimization starts far from the solution. In this section we consider more robust formulations for the case when the number of sensors is equal to the number of sources, N=M, and the mixing matrix is invertible W=A-1.

In the case when the noise is small and the matrix A is far from singular, WX gives a reasonable estimate of the source signals S. Taking into account (4), we obtain a least square term $\Vert C\Phi-WX \Vert _F^2$, so the separation objective may be written

\min_{W,C}\ \frac{1}{2}\Vert C\Phi-WX \Vert _F^2 + \mu\sum_{j,k}\beta_jh(C_{jk})
\end{displaymath} (20)

We also need to add a constraint, which enforces the non-singularity of W. For example, we can restrict from below its minimal singular value rmin(W):
r_{min}(W) \geq 1
\end{displaymath} (21)

It can be shown, that in the noiseless case, $\sigma \approx 0$, the problem (20)-(21) is equivalent to the maximum posteriori formulation (13) with the constraint \(
\Vert A \Vert _2 \leq 1.
\) Another possibility for ensuring the non-singularity of W is to subtract $K\log\vert\det W\vert$ from the objective
\min_{W,C}\ -K\log\vert\det W\vert + \frac{1}{2}\Vert C\Phi-WX \Vert _F^2 +
\end{displaymath} (22)

When the noise is zero and $\Phi$ is the identity matrix, we can substitute C=WX and obtain the Bell-Sejnowski Infomax objective [2]
\min_{W}\ -K\log\vert\det W\vert + \sum_{j,k}\beta_jh((WX)_{jk})
\end{displaymath} (23)

Computational experiments with equal number of sources and sensors

We created two sparse sources (Figure 3, top) with strong cross-correlation of 0.52.

Figure: Sources (top two), mixtures (center two), reconstructed sources (bottom two), in both time-frequency phase plane (left) and time domain (right).
\includegraphics[width=0.49\textwidth,height=60ex]{sr_sn_es_plane_logdet} \includegraphics[width=0.49\textwidth,height=62ex]{sr_sn_es_plot_logdet}

Separation, produced by minimization of the objective function (22), gave an error of 0.23%. For comparison we tested the JADE [6,5], FastICA [12,11] and Bell-Sejnowski Infomax [2,1] algorithms on the same signals. The Resulting relative errors (Figure 4) confirm the significant superiority of the sparse decomposition approach.

Figure: Percent relative error of separation of the artificial sparse sources recovered by (1) JADE, (2) Fast ICA, (3) Bell-Sejnowski Infomax, (4) Equation 22.

Sequential Extraction of Sources via Quadratic Programming

Let us ask what is the most ``sparse'' signal one can obtain by a linear combination of the sensor signals s = wTX. By sparsity, as before, we mean the ability of the signal to be approximated by a linear combination of a small number of dictionary elements $\phi_k$

s\approx c^T\Phi\ , \hspace{2cm} \mbox{$c$ sparse}

This will lead us to the following objective:
\min_{w,c} \frac{1}{2}\Vert c^T\Phi- w^TX \Vert _2^2 + \mu\sum_{k} h(c_k),
\end{displaymath} (24)

where the term $\sum_{k} h(c_k)$ may be considered as a penalty on non-sparsity. In order to avoid the trivial solution of w=0 and c=0, we need to add a constraint that separates w from zero. It could be, for example,
\Vert w \Vert _2^2 \geq 1 \ ,
\end{displaymath} (25)

A similar constraint can be used as a tool to extract all the sources sequentially: the new separation vector wj should have a component of unit norm in the subspace orthogonal to the previously extracted vectors $w^1, \ldots, w^{j-1}$
\Vert (I-P^{j-1})w^j \Vert _2^2 \geq 1 \ ,
\end{displaymath} (26)

where Pj-1 is an orthogonal projector onto $\mbox{Span}\{w^1, \ldots, w^{j-1}\}$.

When $h(c_k)=\vert c_k\vert$, we can use the standard substitution

c &=& c^+ - c^- \ , \ \ \ c^+ \geq 0\ , \ c^- \geq 0\\
...left(\begin{array}{ccc} \Phi\\
\end{array} \right) \ ,

that transforms (24) and (26) into the following quadratic program

\min_{w,\hat{c}} &&\frac{1}{2}\Vert \hat{c}^T\hat{\Phi}- w^TX ...
...o:} && \Vert w \Vert _2^2 \geq 1 \ , \\ \
&& \hat{c}\geq 0 \ ,

where e is a vector of ones.

Fast Solution in Non-overcomplete Dictionaries

In important applications, the sensor signals may have hundreds of channels and hundreds of thousands of samples. This may make separation computationally difficult. Here we present an approach which compromises between statistical and computational efficiency. In our experience this approach provides high quality of separation in reasonable time. Suppose the dictionary is ``complete'', i.e. the it forms a basis in the space of discrete signals. This means that the matrix $\Phi$ is square and non-singular. As examples of such a dictionary one can think of the Fourier basis, Gabor basis, various wavelet-related bases, etc. We can also obtain an ``optimal'' dictionary by learning from given family of signals [15,14,18,17].

Let us denote the dual basis

\end{displaymath} (27)

and suppose that coefficients of decomposition of the sources
\end{displaymath} (28)

are sparse and statistically independent. This assumption is reasonable for properly chosen dictionaries, although of course we would lose the advantages of overcompleteness [15].

Let Y be the decomposition of the sensor signals

\end{displaymath} (29)

Multiplying both sides of (3) by $\Psi$ from the right and taking into account (28) and (29), we obtain
Y= AC+ \zeta \ ,
\end{displaymath} (30)

where $\zeta$ is decomposition of the noise
\zeta = \xi \Psi \ .
\end{displaymath} (31)

In this paper we consider an ``easy'' situation, when $\zeta$ is a white noise, that requires orthogonality of $\Psi$. We can see that all the objective functions from the sections 3-5 remain valid if we remove from them $\Phi$ (substituting instead the identity matrix) and replace the sensor signals X by their decomposition Y. For example, maximum posteriori objectives (13) and (18) are transformed into
\min_{A,C}\ \frac{1}{2\sigma^2} \Vert AC-Y \Vert _F^2 +
\end{displaymath} (32)

\min_{A,C}\ \frac{1}{2\sigma^2} \Vert AC-Y \Vert _F^2 +
...rac{2\sum_k\vert C_{jk}\vert}{\sqrt{{K}^{-1}\sum_k{C_{jk}^2}}}
\end{displaymath} (33)

The objective (22) becomes
\min_{W,C}\ -K\log\vert\det W\vert + \frac{1}{2}\Vert C - WY \Vert _F^2 +
\end{displaymath} (34)

In this case when the noise is zero, we can substitute C=WY and obtain the Bell-Sejnowski Infomax objective [2]
\min_{W}\ -K\log\vert\det W\vert + \sum_{j,k}\beta_jh((WY)_{jk})
\end{displaymath} (35)

Also other known methods (for example, [13,15]), which normally assume sparsity of source signals, may be directly applied to the decomposition Y of the sensor signals. This may be more efficient than the traditional approach, and the reason is obvious: typically, a properly chosen decomposition gives significantly higher sparsity than the raw signals had originally. Also, statistical independence of the coefficients is a more reasonable assumption than statistical independence of the raw signal samples.

Computational experiments with musical sound sources

In our experiments we artificially mixed seven 5-second fragments of musical sound recordings taken from commercial digital audio CDs. Each of them included 40k samples after down-sampling by a factor of 5. (Figure 5).

Figure: Separation of musical recordings taken from commercial digital audio CDs (five second fragments.) Original sources (left); mixtures (center); separated sources (right).
\includegraphics[width=0.32\textwidth,height=40ex]{tf_7src} \includegraphics[width=0.32\textwidth,height=40ex]{tf_7sns} \includegraphics[width=0.32\textwidth,height=40ex]{tf_7est_src}

The easiest way to perform sparse decomposition of such sources is to compute a spectrogram, the coefficients of a time-windowed discrete Fourier transform. (We used the function SPECGRAM from the MATLAB signal processing toolbox with a time window of 1024 samples.) The sparsity of the spectrogram coefficients (the histogram in Figure 6, right) is much higher then the sparsity of the original signal (Figure 6, left)

In this case Y (29) is a real matrix, with separate entries for the real and imaginary components of each spectrogram coefficient of the sensor signals X. We used the objective function (35) with $\beta_j=1$ and $h_\lambda (\cdot)$ defined by (7),(8) with the parameter $\lambda = 10^{-4}$. Unconstrained minimization was performed by a BFGS Quasi-Newton algorithm (MATLAB function FMINU.)

This algorithm separated the sources with a relative error of 0.67% for the least well separated source (error computed according to (19).) We also applied the Bell-Sejnowski Infomax algorithm [2] implemented in [1] to the spectrogram coefficients Y of the sensor signals. Separation errors were slightly larger: 0.9%.

For comparison we tested JADE [6,5], FastIca [12,11] and Bell-Sejnowski Infomax algorithms on the same signals. Resulting relative errors (Figure 7) confirm the significant (by a factor of more than 10) superiority of the sparse decomposition approach.

Figure: Histogram of sound source values (left) and spectrogram coefficients (right), shown with linear y-scale (top), square root y-scale (center) and logarithmic y-scale (bottom).
Figure 7: Percent relative error of separation of seven musical sources recovered by (1) JADE, (2) Fast ICA, (3) Bell-Sejnowski Infomax, (4) Infomax, applied to the spectrogram coefficients, (5) BFGS minimization of the objective (35) with the spectrogram coefficients.
\includegraphics[width=0.49\textwidth]{hist_src} \includegraphics[width=0.49\textwidth]{hist_src_specgram}

Future research

We should mention an alternative to the maximum posteriori approach (12). Considering the mixing matrix A as a parameter, we can estimate it by maximizing the probability of the observed signal X

\begin{displaymath}\max_A \left[ P(X\mid A) =\int P(X\mid A,C)P(C)dC \right] \end{displaymath}

The integral over all possible coefficients C may be approximated, for example, by Monte-Carlo sampling or by a matching Gaussian, in the spirit of [15,14]. It would be interesting to compare this possibility to the other methods presented in this paper.

Another important direction give us the problem of blind separation-deconvolution of convolutive mixtures of signals (see for example [2].) In this case the matrices A and W will have linear filters as an elements, and multiplication by the element will mean convolution. Even in this matrix-of-filters context most of the formulae in this paper remain valid.


In this paper we showed that the use of sparse decomposition in a corresponding signal dictionary provides high-quality blind source separation. The maximum posteriori framework gives the most general approach, including the situation of more sources than sensors. Faster and computationally robust solutions are possible in the case of an equal number of sources and sensors. We can also extract the sources sequentially using quadratic programming with non-convex quadratic constraints. Finally, the fastest solution may be obtained using non-overcomplete dictionaries. Our experiments with artificial signals and digitally mixed musical sounds demonstrate a high quality of source separation, compared to other known techniques.


This research was partially supported by NSF CAREER award 97-02-311 to BAP and by NSF CISE Research Infrastructure award CDA-9503064, and by postdoctoral funding for MZ from the Albuquerque High Performance Computing Center.


ICA/EEG toolbox.
Computational Neurobiology Laboratory, the Salk Institute.

Anthony J. Bell and Terrence J. Sejnowski.
An information-maximization approach to blind separation and blind deconvolution.
Neural Computation, 7(6):1129-1159, 1995.

A. Ben-Tal and M. Zibulevsky.
Penalty/barrier multiplier methods for convex programming problems.
SIAM Journal on Optimization, 7(2):347-366, 1997.

J. Buckheit, S.S. Chen, D.L. Donoho, I. Johnstone, and J. Scargle.
About wavelab.
Technical report, Department of Statistics, Stanford University, 1995.

Jean-François Cardoso.
JADE for real-valued data.
[0]~cardoso/guidesepso u.html.

Jean-François Cardoso.
High-order contrasts for independent component analysis.
Neural Computation, 11(1):157-192, 1999.

S. S. Chen, D. L. Donoho, and M. A. Saunders.
Atomic decomposition by basis pursuit.

S. S. Chen, D. L. Donoho, M. A. Saunders, I. Johnstone, and J. Scargle.
About atomizer.
Technical report, Department of Statistics, Stanford University, 1995.

R.R. Coifman and M.V. Wickerhauser.
Entropy-based algorithms for best-basis selection.
IEEE Transactions on Information Theory, 38:713-718, 1992.

K. Holmstrom and M. Bjorkman.
Advanced Modeling and Optimization, 1:70-86, 1999.

A. Hyvärinen.
The Fast-ICA MATLAB package.

A. Hyvärinen.
Fast and robust fixed-point algorithms for independent component analysis.
IEEE Transactions on Neural Networks, 10(3):626-634, 1999.

T.W. Lee, M.S. Lewicki, M. Girolami, and T.J. Sejnowski.
Blind source separation of more sources than mixtures using overcomplete representations.
IEEE Sig. Proc. Lett., 1998.
to appear.

M.S. Lewicki and B.A. Olshausen.
A probabilistic framework for the adaptation and comparison of image codes.
Journal of the Optical Society of America, 1999.
in press.

M.S. Lewicki and T.J. Sejnowski.
Learning overcomplete representations.
Neural Computation, 1998.
to appear.

S. Mallat.
A Wavelet Tour of Signal Processing.
Academic Press, 1998.

B.A. Olshausen and D.J. Field.
Emergence of simple-cell receptive field properties by learning a sparse code for natural images.
Nature, 381:607-609, 1996.

B.A. Olshausen and D.J. Field.
Sparse coding with an overcomplete basis set: A strategy employed by v1?
Vision Research, 37:3311-3325, 1997.

About this document ...

This document was generated using the LaTeX2HTML translator Version 98.2 beta6 (August 14th, 1998)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 spica7.tex

The translation was initiated by Michael Zibulevsky on 1999-08-26


... process1
The assumption of the noise whiteness is for simplicity of exposition and can be easily removed.
... dropped2
Otherwise, if P(A) is some other known function, we should use (9) directly.

next up previous
Michael Zibulevsky