.. _example_cluster_plot_kmeans_silhouette_analysis.py:


===============================================================================
Selecting the number of clusters with silhouette analysis on KMeans clustering
===============================================================================

Silhouette analysis can be used to study the separation distance between the
resulting clusters. The silhouette plot displays a measure of how close each
point in one cluster is to points in the neighboring clusters and thus provides
a way to assess parameters like number of clusters visually. This measure has a
range of [-1, 1].

Silhoette coefficients (as these values are referred to as) near +1 indicate
that the sample is far away from the neighboring clusters. A value of 0
indicates that the sample is on or very close to the decision boundary between
two neighboring clusters and negative values indicate that those samples might
have been assigned to the wrong cluster.

In this example the silhouette analysis is used to choose an optimal value for
``n_clusters``. The silhouette plot shows that the ``n_clusters`` value of 3, 5
and 6 are a bad pick for the given data due to the presence of clusters with
below average silhouette scores and also due to wide fluctuations in the size
of the silhouette plots. Silhouette analysis is more ambivalent in deciding
between 2 and 4.

Also from the thickness of the silhouette plot the cluster size can be
visualized. The silhouette plot for cluster 0 when ``n_clusters`` is equal to
2, is bigger in size owing to the grouping of the 3 sub clusters into one big
cluster. However when the ``n_clusters`` is equal to 4, all the plots are more
or less of similar thickness and hence are of similar sizes as can be also
verified from the labelled scatter plot on the right.



.. rst-class:: horizontal


    *

      .. image:: images/plot_kmeans_silhouette_analysis_001.png
            :scale: 47

    *

      .. image:: images/plot_kmeans_silhouette_analysis_002.png
            :scale: 47

    *

      .. image:: images/plot_kmeans_silhouette_analysis_003.png
            :scale: 47

    *

      .. image:: images/plot_kmeans_silhouette_analysis_004.png
            :scale: 47

    *

      .. image:: images/plot_kmeans_silhouette_analysis_005.png
            :scale: 47


**Script output**::

  For n_clusters = 2 The average silhouette_score is : 0.704978749608
  For n_clusters = 3 The average silhouette_score is : 0.588200401213
  For n_clusters = 4 The average silhouette_score is : 0.650518663273
  For n_clusters = 5 The average silhouette_score is : 0.563764690262
  For n_clusters = 6 The average silhouette_score is : 0.450301208266



**Python source code:** :download:`plot_kmeans_silhouette_analysis.py <plot_kmeans_silhouette_analysis.py>`

.. literalinclude:: plot_kmeans_silhouette_analysis.py
    :lines: 32-

**Total running time of the example:**  1.82 seconds
( 0 minutes  1.82 seconds)