Contents

Gaussian mixture models

scikits.learn.gmm is a package which enables to create Gaussian Mixture Models (diagonal, spherical, tied and full covariance matrices supported), to sample them, and to estimate them from data using Expectation Maximization algorithm. It can also draw confidence ellipsoides for multivariate models, and compute the Bayesian Information Criterion to assess the number of clusters in the data. In a near future, I hope to add so-called online EM (ie recursive EM) and variational Bayes implementation.

It is implemented in python, and uses the excellent numpy and scipy packages. Numpy is a python packages which gives python a fast multi-dimensional array capabilities (ala matlab and the likes); scipy leverages numpy to build common scientific features for signal processing, linear algebra, statistics, etc...

GMM classifier

class scikits.learn.gmm.GMM(n_states=1, n_dim=1, cvtype='diag', weights=None, means=None, covars=None)

Gaussian Mixture Model

Representation of a Gaussian mixture model probability distribution. This class allows for easy evaluation of, sampling from, and maximum-likelihood estimation of the parameters of a GMM distribution.

Examples

>>> import numpy as np
>>> from scikits.learn.gmm import GMM
>>> g = GMM(n_states=2, n_dim=1)
>>> # The initial parameters are fixed.
>>> np.round(g.weights, 2)
array([ 0.5,  0.5])
>>> np.round(g.means, 2)
array([[ 0.],
       [ 0.]])
>>> np.round(g.covars, 2)
array([[[ 1.]],
<BLANKLINE>
       [[ 1.]]])
>>> # Generate random observations with two modes centered on 0
>>> # and 10 to use for training.
>>> np.random.seed(0)
>>> obs = np.concatenate((np.random.randn(100, 1),
...                       10 + np.random.randn(300, 1)))
>>> g.fit(obs) 
GMM(n_dim=1, cvtype='diag',
  means=array([[ ...],
       [ ...]]),
  covars=[array([[ ...]]), array([[ ...]])], n_states=2,
  weights=array([ 0.75,  0.25]))
>>> np.round(g.weights, 2)
array([ 0.75,  0.25])
>>> np.round(g.means, 2)
array([[ 9.94],
       [ 0.06]])
>>> np.round(g.covars, 2)
... 
array([[[ 0.96]],
       [[ 1.02]]])
>>> g.predict([[0], [2], [9], [10]])
array([1, 1, 0, 0])
>>> np.round(g.score([[0], [2], [9], [10]]), 2)
array([-2.32, -4.16, -1.65, -1.19])
>>> # Refit the model on new data (initial parameters remain the
>>> #same), this time with an even split between the two modes.
>>> g.fit(20 * [[0]] +  20 * [[10]])
GMM(n_dim=1, cvtype='diag',
  means=array([[ 10.],
       [  0.]]),
  covars=[array([[ 0.001]]), array([[ 0.001]])], n_states=2,
  weights=array([ 0.5,  0.5]))
>>> np.round(g.weights, 2)
array([ 0.5,  0.5])

Attributes

cvtype
n_states
weights
means
covars
n_dim int Dimensionality of the Gaussians.
labels list, len n_states Optional labels for each mixture component.

Methods

decode(X) Find most likely mixture components for each point in X.
eval(X) Compute the log likelihood of X under the model and the posterior distribution over mixture components.
fit(X) Estimate model parameters from X using the EM algorithm.
predict(X) Like decode, find most likely mixtures components for each observation in X.
rvs(n=1) Generate n samples from the model.
score(X) Compute the log likelihood of X under the model.
covars

Return covars as a full matrix.

cvtype

Covariance type of the model.

Must be one of ‘spherical’, ‘tied’, ‘diag’, ‘full’.

decode(obs)

Find most likely mixture components for each point in obs.

Parameters :

obs : array_like, shape (n, n_dim)

List of n_dim-dimensional data points. Each row corresponds to a single data point.

Returns :

logprobs : array_like, shape (n,)

Log probability of each point in obs under the model.

components : array_like, shape (n,)

Index of the most likelihod mixture components for each observation

eval(obs)

Evaluate the model on data

Compute the log probability of obs under the model and return the posterior distribution (responsibilities) of each mixture component for each element of obs.

Parameters :

obs : array_like, shape (n, n_dim)

List of n_dim-dimensional data points. Each row corresponds to a single data point.

Returns :

logprob : array_like, shape (n,)

Log probabilities of each data point in obs

posteriors: array_like, shape (n, n_states) :

Posterior probabilities of each mixture component for each observation

fit(X, n_iter=10, min_covar=0.001, thresh=0.01, params='wmc', init_params='wmc', **kwargs)

Estimate model parameters with the expectation-maximization algorithm.

A initialization step is performed before entering the em algorithm. If you want to avoid this step, set the keyword argument init_params to the empty string ‘’. Likewise, if you would like just to do an initialization, call this method with n_iter=0.

Parameters :

X : array_like, shape (n, n_dim)

List of n_dim-dimensional data points. Each row corresponds to a single data point.

n_iter : int, optional

Number of EM iterations to perform.

min_covar : float, optional

Floor on the diagonal of the covariance matrix to prevent overfitting. Defaults to 1e-3.

thresh : float, optional

Convergence threshold.

params : string, optional

Controls which parameters are updated in the training process. Can contain any combination of ‘w’ for weights, ‘m’ for means, and ‘c’ for covars. Defaults to ‘wmc’.

init_params : string, optional

Controls which parameters are updated in the initialization process. Can contain any combination of ‘w’ for weights, ‘m’ for means, and ‘c’ for covars. Defaults to ‘wmc’.

kwargs : keyword, optional

Keyword arguments passed to scipy.cluster.vq.kmeans2

means

Mean parameters for each mixture component.

n_states

Number of mixture components in the model.

predict(X)

Predict label for data.

Parameters :X : array-like, shape = [n_samples, n_features]
Returns :C : array, shape = [n_samples]
rvs(n=1)

Generate random samples from the model.

Parameters :

n : int

Number of samples to generate.

Returns :

obs : array_like, shape (n, n_dim)

List of samples

score(obs)

Compute the log probability under the model.

Parameters :

obs : array_like, shape (n, n_dim)

List of n_dim-dimensional data points. Each row corresponds to a single data point.

Returns :

logprob : array_like, shape (n,)

Log probabilities of each data point in obs

weights

Mixing weights for each mixture component.

Examples

See Gaussian Mixture Model Ellipsoids for an example on plotting the confidence ellipsoids.

See Density Estimation for a mixture of Gaussians for an example on plotting the density estimation.