Contents

scikits.learn.cluster.KMeans

class scikits.learn.cluster.KMeans(k=8, init='random', n_init=10, max_iter=300)

K-Means clustering

Parameters :

data : ndarray

A M by N array of M observations in N dimensions or a length M array of M one-dimensional observations.

k : int or ndarray

The number of clusters to form as well as the number of centroids to generate. If init initialization string is ‘matrix’, or if a ndarray is given instead, it is interpreted as initial cluster to use instead.

n_iter : int

Number of iterations of the k-means algrithm to run. Note that this differs in meaning from the iters parameter to the kmeans function.

init : {‘k-means++’, ‘random’, ‘points’, ‘matrix’}

Method for initialization, defaults to ‘k-means++’:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

‘random’: generate k centroids from a Gaussian with mean and variance estimated from the data.

‘points’: choose k observations (rows) at random from data for the initial centroids.

‘matrix’: interpret the k parameter as a k by M (or length k array for one-dimensional data) array of initial centroids.

Notes

The k-means problem is solved using the Lloyd algorithm.

The average complexity is given by O(k n T), were n is the number of samples and T is the number of iteration.

The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. (D. Arthur and S. Vassilvitskii, ‘How slow is the k-means method?’ SoCG2006)

In practice, the K-means algorithm is very fast (on of the fastest clustering algorithms available), but it falls in local minimas, and it can be useful to restarts it several times.

Attributes

cluster_centers_: array, [n_clusters, n_features] Coordinates of cluster centers
labels_: Labels of each point
inertia_: float The value of the inertia criterion associated with the chosen partition.

Methods

fit(X): Compute K-Means clustering
__init__(k=8, init='random', n_init=10, max_iter=300)
fit(X, **params)

Compute k-means