Theory of Gaussian Mixture models¶

In this part are detailed the equations used in each algorithm. We use the same notations as Bishop’s Pattern Recognition and Machine Learning.

Features:

$\{x_1,x_2,...,x_N\}$ is the set of points

Parameters:

$\mu_k$ is the center of the $k^{th}$ cluster
$\pi_k$ is the weight of the $k^{th}$ cluster
$\Sigma_k$ is the covariance matrix of the $k^{th}$ cluster
$K$ is the number of clusters
$N$ is the number of points
$d$ is the dimension of the problem

Other notations specific to the methods will be introduced later.

K-means¶

An iteration of K-means includes:

The E step : a label is assigned to each point (hard assignement) arcording to the means.
The M step : means are computed according to the parameters.
The computation of the convergence criterion : the algorithm uses the distortion as described below.

E step¶

The algorithm produces a matrix of responsibilities according to the following equation:

$r_{nk} = \left\{ \begin{split} & 1 \text{ if } k = \arg\min_{1 \leq j \leq k}\lVert x_n-\mu_j\rVert^2 \\ & 0 \text{ otherwise} \end{split} \right.$

The value of the case at the $i^{th}$ row and $j^{th}$ column is 1 if the $i^{th}$ point belongs to the $j^{th}$ cluster and 0 otherwise.

M step¶

The mean of a cluster is simply the mean of all the points belonging to this latter:

$\mu_{k} = \frac{\sum^N_{n=1}r_{nk}x_n}{\sum^N_{n=1}r_{nk}}$

The weight of the cluster k, which is the number of points belonging to this latter, can be expressed as:

$N_{k} = \sum^N_{n=1}r_{nk}$

The mixing coefficients, which represent the proportion of points in a cluster, can be expressed as:

$\pi_k = \frac{N_k}{N}$

Convergence criterion¶

The convergence criterion is the distortion defined as the sum of the norms of the difference between each point and the mean of the cluster it is belonging to:

$D = \sum^N_{n=1}\sum^K_{k=1}r_{nk}\lVert x_n-\mu_k \rVert^2$

The distortion should only decrease during the execution of the algorithm. The model stops when the difference between the value of the convergence criterion at the previous iteration and the current iteration is less or equal to a threshold $tol$ :

$D_{previous} - D_{current} \leq tol$

Gaussian Mixture Model (GMM)¶

An iteration of GMM includes:

The E step : $K$ probabilities of belonging to each cluster are assigned to each point
The M step : weights, means and covariances are computed according to the parameters.
The computation of the convergence criterion : the algorithm uses the loglikelihood as described below.

E step¶

The algorithm produces a matrix of responsibilities according to the following equation:

$r_{nk} = \frac{\pi_k\mathcal{N}(x_n|\mu_k,\Sigma_k)}{\sum^K_{j=1}\pi_j\mathcal{N}(x_n|\mu_j,\Sigma_j)}$

The value of the case at the $i^{th}$ row and $j^{th}$ column is the probability that the point i belongs to the cluster j.

M step¶

The weight of the cluster k, which is the number of points belonging to this latter, can be expressed as:

$N_{k} = \sum^N_{n=1}r_{nk}$

The mixing coefficients, which represent the proportion of points in a cluster, can be expressed as:

$\pi_k = \frac{N_k}{N}$

As in the Kmeans algorithm, the mean of a cluster is the mean of all the points belonging to this latter:

$\mu_{k} = \frac{\sum^N_{n=1}r_{nk}x_n}{N_k}$

The covariance of the cluster k can be expressed as:

$\Sigma_k = \frac{1}{N_k}\sum^N_{n=1}r_{nk}(x_n-\mu_k)(x_n-\mu_k)^T$

These results have been obtained by derivating the maximum loglikelihood described in the following section.

Convergence criterion¶

The convergence criterion used in the Gaussian Mixture Model algorithm is the maximum log likelihood:

$\sum^N_{n=1}\ln{\sum^K_{k=1}\pi_k\mathcal{N}(x_n|\mu_k,\Sigma_k)}$

Setting its derivatives to 0 gives the empirical terms described in the M step.

Variational Gaussian Mixture Model (VBGMM)¶

In this model, we introduce three new hyperparameters and two distributions which governs the three essential parameters of the model: the mixing coefficients, the means and the covariances.

The mixing coefficients are generated with a Dirichlet Distribution:

$q(\pi_k) = \text{Dir}(\pi|\alpha_k) = \text{C}(\alpha_k)\pi_k^{\alpha_k-1}$

The computation of $\alpha_k$ is described in the M step.

Then we introduce an independant Gaussian-Wishart law governing the mean and precision of each gaussian component:

$\begin{split} q(\mu_k,\Sigma_k) & = q(\mu_k|\Sigma_k)q(\Sigma_k) \\ & = \mathcal{N}(\mu_k|m_k,(\beta_k\Sigma_k)^{-1})\mathcal{W}(\Gamma_k|W_k,\nu_k) \end{split}$

The computation of the terms involved in this equation are described in the M step.

The aim of VBGMM is to reduce the variability of the model by introducing a bias: prior values on the parameters of the model.

E step¶

It is not possible to compute directly $r_{nk}$ , another quantity $\rho_{nk}$ is calculated instead and $r_{nk}$ is obtained after normalization.

$\ln{\rho_{nk}} = \mathbb{E}[\ln{\pi_k}] + \frac{1}{2}\mathbb{E}[\ln{\det{\Lambda_k}}] - \frac{d}{2}\ln{2\pi} - \frac{1}{2}\mathbb{E}_{\mu_k,\Lambda_k}[(x_n-\mu_k)^T\Lambda_k(x_n-\mu_k)]$

System Message: WARNING/2 (\ln{\rho_{nk}} &= \psi(\alpha_k) - \psi(\sum^K_{i=1}\alpha_i) + \frac{1}{2}\sum^d_{i=1}\psi(\frac{\nu_k+1-i}{2}) + \frac{1}{2}\ln{\det{W_k}} &- \frac{d}{2}\ln{\pi} - \frac{d}{2\beta_k} - \frac{\nu_k}{2}(x_n-m_k)^{T} W_k (x_n-m_k))

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017-04-15> Babel <3.18> and hyphenation patterns for 84 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (./math.aux) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucsencs.def) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsa.fd) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsb.fd) ! Extra alignment tab has been changed to \cr. <template> }$\hfill \endtemplate l.15 ...}{2}(x_n-m_k)^{T} W_k (x_n-m_k)\end{split} ! Extra alignment tab has been changed to \cr. <template> }$\hfill \endtemplate l.15 ...}{2}(x_n-m_k)^{T} W_k (x_n-m_k)\end{split} Overfull \hbox (129.87117pt too wide) detected at line 16 [] [1] (./math.aux) ) (see the transcript file for additional information) Output written on math.dvi (1 page, 1136 bytes). Transcript written on math.log.

M step¶

Convergence criterion¶

Dirichlet Process Gaussian Mixture Model (DPGMM)¶

E step¶

M step¶

Convergence criterion¶

Pitman-Yor Process Gaussian Mixture Model (PYPGMM)¶