API Reference¶

Batch versions of the algorithm¶

Kmeans¶

class megamix.batch.kmeans.Kmeans(n_components=1, init='plus', n_jobs=1)¶

Kmeans model.

Parameters:	n_components (int, defaults to 1.) – Number of clusters used. init (str, defaults to 'kmeans'.) – Method used in order to perform the initialization, must be in [‘random’, ‘plus’, ‘AF_KMC’].

name¶: str – The name of the method : ‘Kmeans’

means¶: array of floats (n_components,dim) – Contains the computed means of the model.

log_weights¶: array of floats (n_components,) – Contains the logarithm of the mixing coefficient of each cluster.

iter¶: int – The number of iterations computed with the method fit()

_is_initialized¶: bool – Ensures that the model has been initialized before using other methods such as distortion() or predict_assignements().

Raises:	ValueError : if the parameters are inconsistent, for example if the cluster number is negative, init_type is not in [‘resp’,’mcw’]...

References

‘Fast and Provably Good Seedings for k-Means’, O. Bachem, M. Lucic, S. Hassani, A.Krause ‘Lloyd’s algorithm <https://en.wikipedia.org/wiki/Lloyd’s_algorithm>’_ ‘The remarkable k-means++ <https://normaldeviate.wordpress.com/2012/09/30/the-remarkable-k-means/>’_

fit(points_data, points_test=None, n_iter_max=100, n_iter_fix=None, tol=0, saving=None, file_name='model', saving_iter=2)¶

The k-means algorithm

Other Parameters:
Parameters:	points_data (array (n_points,dim)) – A 2D array of points on which the model will be trained tol (float, defaults to 0) – The EM algorithm will stop when the difference between two steps regarding the distortion is less or equal to tol. n_iter_max (int, defaults to 100) – number of iterations maximum that can be done saving_iter (int \| defaults 2) – An int to know how often the model is saved (see saving below). file_name (str \| defaults model) – The name of the file (including the path).
	points_test (array (n_points_bis,dim) \| Optional) – A 2D array of points on which the model will be tested. n_iter_fix (int \| Optional) – If not None, the algorithm will exactly do the number of iterations of n_iter_fix and stop. saving (str \| Optional) – A string in [‘log’,’linear’]. In the following equations x is the parameter saving_iter (see above). If ‘log’, the model will be saved for all iterations which verify : log(iter)/log(x) is an int If ‘linear’ the model will be saved for all iterations which verify : iter/x is an int
Returns:
Return type:	None

predict_assignements(points)¶: This function return the hard assignements of points once the model is fitted.

score(points, assignements=None)¶

This method returns the distortion measurement at the end of the k_means.

Parameters:	points (an array (n_points,dim)) – assignements (an array (n_components,dim)) – an array containing the responsibilities of the clusters
Returns:	distortion
Return type:	(float)

megamix.batch.kmeans.dist_matrix(points, means)¶

Gaussian Mixture Model (GMM)¶

class megamix.batch.GaussianMixture(n_components=1, covariance_type='full', init='kmeans', reg_covar=1e-06, type_init='resp', n_jobs=1)¶

Gaussian Mixture Model

Representation of a Gaussian mixture model probability distribution. This class allows to estimate the parameters of a Gaussian mixture distribution.

Parameters:

n_components (int, defaults to 1.) – Number of clusters used.
init (str, defaults to 'kmeans'.) – Method used in order to perform the initialization, must be in [‘random’, ‘plus’, ‘AF_KMC’, ‘kmeans’].
reg_covar (float, defaults to 1e-6) – In order to avoid null covariances this float is added to the diagonal of covariance matrices.
type_init (str, defaults to 'resp'.) – The algorithm is initialized using this data (responsibilities if ‘resp’ or means, covariances and weights if ‘mcw’).

name¶: str – The name of the method : ‘GMM’

cov¶: array of floats (n_components,dim,dim) – Contains the computed covariance matrices of the mixture.

means¶: array of floats (n_components,dim) – Contains the computed means of the mixture.

log_weights¶: array of floats (n_components,) – Contains the logarithm of the mixing coefficient of each cluster.

iter¶: int – The number of iterations computed with the method fit()

convergence_criterion_data¶: array of floats (iter,) – Stores the value of the convergence criterion computed with data on which the model is fitted.

convergence_criterion_test¶: array of floats (iter,) | if _early_stopping only – Stores the value of the convergence criterion computed with test data if it exists.

_is_initialized¶: bool – Ensures that the method _initialize() has been used before using other methods such as score() or predict_log_assignements().

Raises:	ValueError : if the parameters are inconsistent, for example if the cluster number is negative, init_type is not in [‘resp’,’mcw’]...

References

‘Pattern Recognition and Machine Learning’, Bishop

fit(points_data, points_test=None, tol=0.001, patience=None, n_iter_max=100, n_iter_fix=None, saving=None, file_name='model', saving_iter=2)¶

The EM algorithm

Other Parameters:
Parameters:	points_data (array (n_points,dim)) – A 2D array of points on which the model will be trained tol (float, defaults to 1e-3) – The EM algorithm will stop when the difference between two steps regarding the convergence criterion is less than tol. n_iter_max (int, defaults to 100) – number of iterations maximum that can be done saving_iter (int \| defaults 2) – An int to know how often the model is saved (see saving below). file_name (str \| defaults model) – The name of the file (including the path).
	points_test (array (n_points_bis,dim) \| Optional) – A 2D array of points on which the model will be tested. patience (int \| Optional) – The number of iterations performed after having satisfied the convergence criterion n_iter_fix (int \| Optional) – If not None, the algorithm will exactly do the number of iterations of n_iter_fix and stop. saving (str \| Optional) – A string in [‘log’,’linear’]. In the following equations x is the parameter saving_iter (see above). * If ‘log’, the model will be saved for all iterations which verify : log(iter)/log(x) is an int If ‘linear’ the model will be saved for all iterations which verify : iter/x is an int
Returns:
Return type:	None

predict_log_resp(points)¶

This function returns the logarithm of each point’s responsibilities

Parameters:	points (array (n_points_bis,dim)) – a 1D or 2D array of points with the same dimension as the problem
Returns:	log_resp – the logarithm of the responsibilities
Return type:	array (n_points_bis,n_components)

read_and_init(group, points)¶

A method reading a group of an hdf5 file to initialize DPGMM

Parameters:	group (HDF5 group) – A group of a hdf5 file in reading mode

score(points)¶

This function return the score of the function, which is the logarithm of the likelihood for GMM and the logarithm of the lower bound of the likelihood for VBGMM and DPGMM

Parameters:	points (array (n_points_bis,dim)) – a 1D or 2D array of points with the same dimension as the problem
Returns:	score
Return type:	float

simplified_model(points)¶

A method creating a new model with simplified parameters: clusters unused are removed

Parameters:	points (an array (n_points,dim)) –
Returns:	GM
Return type:	an instance of the same type of self: GMM,VBGMM or DPGMM

write(group)¶

A method creating datasets in a group of an hdf5 file in order to save the model

Parameters:	group (HDF5 group) – A group of a hdf5 file in reading mode

Variational Gaussian Mixture Model (VBGMM)¶

class megamix.batch.VBGMM.VariationalGaussianMixture(n_components=1, init='kmeans', alpha_0=None, beta_0=None, nu_0=None, means_prior=None, cov_wishart_prior=None, reg_covar=1e-06, type_init='resp', n_jobs=1)¶

Variational Bayesian Estimation of a Gaussian Mixture

This class allows to infer an approximate posterior distribution over the parameters of a Gaussian mixture distribution.

The weights distribution is a Dirichlet distribution with parameter alpha (see Bishop’s book p474-486)

Parameters:

n_components (int, defaults to 1.) – Number of clusters used.
init (str, defaults to 'kmeans'.) – Method used in order to perform the initialization, must be in [‘random’, ‘plus’, ‘AF_KMC’, ‘kmeans’, ‘GMM’].
reg_covar (float, defaults to 1e-6) – In order to avoid null covariances this float is added to the diagonal of covariance matrices.
type_init (str, defaults to 'resp'.) – The algorithm is initialized using this data (responsibilities if ‘resp’ or means, covariances and weights if ‘mcw’).

Other Parameters:

alpha_0 (float, Optional | defaults to None.) – The prior parameter on the weight distribution (Dirichlet). A high value of alpha_0 will lead to equal weights, while a low value will allow some clusters to shrink and disappear. Must be greater than 0.

If None, the value is set to 1/n_components
beta_0 (float, Optional | defaults to None.) – The precision prior on the mean distribution (Gaussian). Must be greater than 0.

If None, the value is set to 1.0
nu_0 (float, Optional | defaults to None.) – The prior of the number of degrees of freedom on the covariance distributions (Wishart). Must be greater or equal to dim.

If None, the value is set to dim
means_prior (array (dim,), Optional | defaults to None) – The prior value to compute the value of the means.

If None, the value is set to the mean of points_data
cov_wishart_prior (type depends on covariance_type, Optional | defaults to None) – If covariance_type is ‘full’ type must be array (dim,dim) If covariance_type is ‘spherical’ type must be float The prior value to compute the value of the precisions.

If None, the value is set to the covariance of points_data

name¶: str – The name of the method : ‘VBGMM’

alpha¶: array of floats (n_components,) – Contains the parameters of the weight distribution (Dirichlet)

beta¶: array of floats (n_components,) – Contains coefficients which are multipied with the precision matrices to form the precision matrix on the Gaussian distribution of the means.

nu¶: array of floats (n_components,) – Contains the number of degrees of freedom on the distribution of covariance matrices.

_inv_prec¶: array of floats (n_components,dim,dim) – Contains the equivalent of the matrix W described in Bishop’s book. It is proportional to cov.

_log_det_inv_prec¶: array of floats (n_components,) – Contains the logarithm of the determinant of W matrices.

cov¶: array of floats (n_components,dim,dim) – Contains the computed covariance matrices of the mixture.

means¶: array of floats (n_components,dim) – Contains the computed means of the mixture.

log_weights¶: array of floats (n_components,) – Contains the logarithm of weights of each cluster.

iter¶: int – The number of iterations computed with the method fit()

convergence_criterion_data¶: array of floats (iter,) – Stores the value of the convergence criterion computed with data on which the model is fitted.

convergence_criterion_test¶: array of floats (iter,) | if _early_stopping only – Stores the value of the convergence criterion computed with test data if it exists.

_is_initialized¶: bool – Ensures that the method _initialize() has been used before using other methods such as score() or predict_log_assignements().

Raises:	ValueError : if the parameters are inconsistent, for example if the cluster number is negative, init_type is not in [‘resp’,’mcw’]...

References

‘Pattern Recognition and Machine Learning’, Bishop

fit(points_data, points_test=None, tol=0.001, patience=None, n_iter_max=100, n_iter_fix=None, saving=None, file_name='model', saving_iter=2)¶

The EM algorithm

Other Parameters:
Parameters:	points_data (array (n_points,dim)) – A 2D array of points on which the model will be trained tol (float, defaults to 1e-3) – The EM algorithm will stop when the difference between two steps regarding the convergence criterion is less than tol. n_iter_max (int, defaults to 100) – number of iterations maximum that can be done saving_iter (int \| defaults 2) – An int to know how often the model is saved (see saving below). file_name (str \| defaults model) – The name of the file (including the path).
	points_test (array (n_points_bis,dim) \| Optional) – A 2D array of points on which the model will be tested. patience (int \| Optional) – The number of iterations performed after having satisfied the convergence criterion n_iter_fix (int \| Optional) – If not None, the algorithm will exactly do the number of iterations of n_iter_fix and stop. saving (str \| Optional) – A string in [‘log’,’linear’]. In the following equations x is the parameter saving_iter (see above). * If ‘log’, the model will be saved for all iterations which verify : log(iter)/log(x) is an int If ‘linear’ the model will be saved for all iterations which verify : iter/x is an int
Returns:
Return type:	None

predict_log_resp(points)¶

This function returns the logarithm of each point’s responsibilities

Parameters:	points (array (n_points_bis,dim)) – a 1D or 2D array of points with the same dimension as the problem
Returns:	log_resp – the logarithm of the responsibilities
Return type:	array (n_points_bis,n_components)

read_and_init(group, points)¶

A method reading a group of an hdf5 file to initialize DPGMM

Parameters:	group (HDF5 group) – A group of a hdf5 file in reading mode

score(points)¶

This function return the score of the function, which is the logarithm of the likelihood for GMM and the logarithm of the lower bound of the likelihood for VBGMM and DPGMM

Parameters:	points (array (n_points_bis,dim)) – a 1D or 2D array of points with the same dimension as the problem
Returns:	score
Return type:	float

simplified_model(points)¶

A method creating a new model with simplified parameters: clusters unused are removed

Parameters:	points (an array (n_points,dim)) –
Returns:	GM
Return type:	an instance of the same type of self: GMM,VBGMM or DPGMM

write(group)¶

A method creating datasets in a group of an hdf5 file in order to save the model

Parameters:	group (HDF5 group) – A group of a hdf5 file in reading mode

Dirichlet Process Gaussian Mixture Model (DPGMM)¶

class megamix.batch.DPGMM.DPVariationalGaussianMixture(n_components=1, init='kmeans', alpha_0=None, beta_0=None, nu_0=None, means_prior=None, cov_wishart_prior=None, reg_covar=1e-06, type_init='resp', n_jobs=1, pypcoeff=0)¶

Variational Bayesian Estimation of a Gaussian Mixture with Dirichlet Process

This class allows to infer an approximate posterior distribution over the parameters of a Gaussian mixture distribution.

The weights distribution follows a Dirichlet Process with attribute alpha.

Parameters:

n_components (int, defaults to 1.) – Number of clusters used.
init (str, defaults to 'kmeans'.) – Method used in order to perform the initialization, must be in [‘random’, ‘plus’, ‘AF_KMC’, ‘kmeans’, ‘GMM’, ‘VBGMM’].
reg_covar (float, defaults to 1e-6) – In order to avoid null covariances this float is added to the diagonal of covariance matrices.
type_init (str, defaults to 'resp'.) – The algorithm is initialized using this data (responsibilities if ‘resp’ or means, covariances and weights if ‘mcw’).

Other Parameters:

alpha_0 (float, Optional | defaults to None.) – The prior parameter on the weight distribution (Beta). A high value of alpha_0 will lead to equal weights, while a low value will allow some clusters to shrink and disappear. Must be greater than 0.

If None, the value is set to 1/n_components
beta_0 (float, Optional | defaults to None.) – The precision prior on the mean distribution (Gaussian). Must be greater than 0.

If None, the value is set to 1.0
nu_0 (float, Optional | defaults to None.) – The prior of the number of degrees of freedom on the covariance distributions (Wishart). Must be greater or equal to dim.

If None, the value is set to dim
means_prior (array (dim,), Optional | defaults to None) – The prior value to compute the value of the means.

If None, the value is set to the mean of points_data
cov_wishart_prior (type depends on covariance_type, Optional | defaults to None) – If covariance_type is ‘full’ type must be array (dim,dim) If covariance_type is ‘spherical’ type must be float The prior value to compute the value of the precisions.
pypcoeff (float | defaults to 0) – If 0 the weights are generated according to a Dirichlet Process If >0 and <=1 the weights are generated according to a Pitman-Yor Process.

name¶: str – The name of the method : ‘VBGMM’

alpha¶: array of floats (n_components,2) – Contains the parameters of the weight distribution (Beta)

beta¶: array of floats (n_components,) – Contains coefficients which are multipied with the precision matrices to form the precision matrix on the Gaussian distribution of the means.

nu¶: array of floats (n_components,) – Contains the number of degrees of freedom on the distribution of covariance matrices.

_inv_prec¶: array of floats (n_components,dim,dim) – Contains the equivalent of the matrix W described in Bishop’s book. It is proportional to cov.

_log_det_inv_prec¶: array of floats (n_components,) – Contains the logarithm of the determinant of W matrices.

cov¶: array of floats (n_components,dim,dim) – Contains the computed covariance matrices of the mixture.

means¶: array of floats (n_components,dim) – Contains the computed means of the mixture.

log_weights¶: array of floats (n_components,) – Contains the logarithm of weights of each cluster.

iter¶: int – The number of iterations computed with the method fit()

convergence_criterion_data¶: array of floats (iter,) – Stores the value of the convergence criterion computed with data on which the model is fitted.

convergence_criterion_test¶: array of floats (iter,) | if _early_stopping only – Stores the value of the convergence criterion computed with test data if it exists.

_is_initialized¶: bool – Ensures that the method _initialize() has been used before using other methods such as score() or predict_log_assignements().

Raises:	ValueError : if the parameters are inconsistent, for example if the cluster number is negative, init_type is not in [‘resp’,’mcw’]...

References

‘Variational Inference for Dirichlet Process Mixtures’, D. Blei and M. Jordan

fit(points_data, points_test=None, tol=0.001, patience=None, n_iter_max=100, n_iter_fix=None, saving=None, file_name='model', saving_iter=2)¶

The EM algorithm

Other Parameters:
Parameters:	points_data (array (n_points,dim)) – A 2D array of points on which the model will be trained tol (float, defaults to 1e-3) – The EM algorithm will stop when the difference between two steps regarding the convergence criterion is less than tol. n_iter_max (int, defaults to 100) – number of iterations maximum that can be done saving_iter (int \| defaults 2) – An int to know how often the model is saved (see saving below). file_name (str \| defaults model) – The name of the file (including the path).
	points_test (array (n_points_bis,dim) \| Optional) – A 2D array of points on which the model will be tested. patience (int \| Optional) – The number of iterations performed after having satisfied the convergence criterion n_iter_fix (int \| Optional) – If not None, the algorithm will exactly do the number of iterations of n_iter_fix and stop. saving (str \| Optional) – A string in [‘log’,’linear’]. In the following equations x is the parameter saving_iter (see above). * If ‘log’, the model will be saved for all iterations which verify : log(iter)/log(x) is an int If ‘linear’ the model will be saved for all iterations which verify : iter/x is an int
Returns:
Return type:	None

predict_log_resp(points)¶

This function returns the logarithm of each point’s responsibilities

Parameters:	points (array (n_points_bis,dim)) – a 1D or 2D array of points with the same dimension as the problem
Returns:	log_resp – the logarithm of the responsibilities
Return type:	array (n_points_bis,n_components)

read_and_init(group, points)¶

A method reading a group of an hdf5 file to initialize DPGMM

Parameters:	group (HDF5 group) – A group of a hdf5 file in reading mode

score(points)¶

This function return the score of the function, which is the logarithm of the likelihood for GMM and the logarithm of the lower bound of the likelihood for VBGMM and DPGMM

Parameters:	points (array (n_points_bis,dim)) – a 1D or 2D array of points with the same dimension as the problem
Returns:	score
Return type:	float

simplified_model(points)¶

A method creating a new model with simplified parameters: clusters unused are removed

Parameters:	points (an array (n_points,dim)) –
Returns:	GM
Return type:	an instance of the same type of self: GMM,VBGMM or DPGMM

write(group)¶

A method creating datasets in a group of an hdf5 file in order to save the model

Parameters:	group (HDF5 group) – A group of a hdf5 file in reading mode

Online versions of the algorithm¶

Kmeans¶

class megamix.online.kmeans.Kmeans(n_components=1, window=1, kappa=1.0)¶

Kmeans model.

Parameters:

n_components (int, defaults to 1.) – Number of clusters used.
window (int, defaults to 1) – The number of points used at the same time in order to update the parameters.
kappa (double, defaults to 1.0) –
A coefficient in ]0.0,1.0] which give weight or not to the new points compared to the ones already used.
- If kappa is nearly null, the new points have a big weight and the model may
take a lot of time to stabilize.
- If kappa = 1.0, the new points won’t have a lot of weight and the model may
not move enough from its initialization.

name¶: str – The name of the method : ‘Kmeans’

log_weights¶: array of floats (n_components) – Contains the logarithm of the mixing coefficients of the model.

means¶: array of floats (n_components,dim) – Contains the computed means of the model.

N¶: array of floats (n_components,) – The sufficient statistic updated during each iteration used to compute log_weights (this corresponds to the mixing coefficients).

X¶: array of floats (n_components,dim) – The sufficient statistic updated during each iteration used to compute the means.

iter¶: int – The number of points which have been used to compute the model.

_is_initialized¶: bool – Ensures that the model has been initialized before using other methods such as fit(), distortion() or predict_assignements().

Raises:	ValueError : if the parameters are inconsistent, for example if the cluster number is negative, init_type is not in [‘resp’,’mcw’]...

References

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling, C. Dupuy & F. Bach ‘The remarkable k-means++ <https://normaldeviate.wordpress.com/2012/09/30/the-remarkable-k-means/>’_

fit(points, saving=None, file_name='model', saving_iter=2)¶

The k-means algorithm

Other Parameters:
Parameters:	points (array (n_points,dim)) – A 2D array of points on which the model will be trained. saving_iter (int \| defaults 2) – An int to know how often the model is saved (see saving below). file_name (str \| defaults model) – The name of the file (including the path).
	saving (str \| Optional) – A string in [‘log’,’linear’]. In the following equations x is the parameter saving_iter (see above). If ‘log’, the model will be saved for all iterations which verify : log(iter)/log(x) is an int If ‘linear’ the model will be saved for all iterations which verify : iter/x is an int
Returns:
Return type:	None

get(name)¶

initialize(points)¶

This method initializes the Gaussian Mixture by setting the values of the means, covariances and weights.

Parameters:	points_data (an array (n_points,dim)) – Data on which the model is fitted. points_test (an array (n_points,dim) \| Optional) – Data used to do early stopping (avoid overfitting)

predict_assignements(points)¶: This function return the hard assignements of points once the model is fitted.

score(points, assignements=None)¶

This method returns the distortion measurement at the end of the k_means.

Parameters:	points (an array (n_points,dim)) – assignements (an array (n_components,dim)) – an array containing the responsibilities of the clusters
Returns:	distortion
Return type:	(float)

megamix.online.kmeans.dist_matrix(points, means)¶