Menu:

Quick Example of Plotting

Dendrograms with scipy-cluster

Author: Damian Eads
Authored: November 17, 2007
Revised: May 24, 2008

This tutorial uses the Iris data set, first appearing in a paper on optimal discriminants for Gaussian data [1]. This data set contains four observations (sepal length, sepal width, petal length, and petal width) for 150 collected specimens of flowers. Three species are represented here: Iris Setosa, Iris Versicolour, and Iris Virginica. Rather than using Fisher's Discriminant Analysis to classify the data, we will use hcluster to analyze the data. Ideally, pairs of flowers in the same species should cluster more closely, and flowers from different species should be farther apart from one another. Attaining good clustering requires careful consideration of the distance metric. Since the purpose of this document is to demonstrate this library in action, we give cursory consideration to the choice of distance metric.

What to do

First, import the hcluster module. When and if this software package is integrated into the formal Scipy, this module name will change. We load the Fisher's flowerdata set using matplotlib's load command. Standardized Euclidean distance is used to compute the distances between each pair of flower specimens using the pdist command. Next, we use the single linkage algorithm to build the agglomorative clustering.

from hcluster import *
X=load('iris.txt')
Y=pdist(X, 'seuclidean')
Z=linkage(Y, 'single')
dendrogram(Z, color_threshold=0)

This yields the following dendrogram plot.

Color Thresholds

Some linkage methods (centroid, ward, and median) do not take condensed distance matrices as arguments but instead require the raw observations as input. Now that we have a dendrogram from which to inspect, let's find a suitable color threshold. By cutting the tree at 1.8, three clusters are formed. Let's plot another dendrogram using this color threshold. The legend shows the membership of each of the flat clusters formed by the cut.

Z=linkage(X, 'centroid')
dendrogram(Z, color_threshold=1.8)
title('Sir Ronald Fisher\'s Iris Data Set')
xlabel('Flower Specimen Number')
ylabel('Distance')
legend(('Iris Setosa', 'Iris Virginica', 'Iris Versicolour'))

The dendrogram plot is shown below.

Using Complete Linkage

For comparison, we show how the dendrogram plot for complete linkage differs from the dendrogram derived from a single linkage. 2.3 is chosen as the cutoff threshold by visual inspection.

Z=linkage(Y, 'complete')
dendrogram(Z, color_threshold=2.3)
title('Sir Ronald Fisher\'s Iris Data Set')
xlabel('Flower Specimen Number')
ylabel('Distance')
legend(('Iris Setosa', 'Iris Virginica', 'Iris Versicolour'))

The dendrogram resulting from the program snippet above is shown below.

Using Level Truncation

The number of specimens in the data set is large enough that the dendrogram looks cluttered. Truncation is used to condense the dendrogram.
dendrogram(Z, color_threshold=0, truncate_mode='level', p=3, show_contracted=True)
The truncate_mode parameter tells the dendrogram plotting routine the type of truncation to perform. When set along with p, no more than p levels of the dendrogram tree are displayed. If a non-leaf node is above this level threshold, it and its descendents are contracted into a single node. The show_contracted=True parameter specification plots a marker for each non-singleton cluster contracted along the link of contraction. The height of the marker is the distance between the contracted node descendents.

The contracted dendrogram with contraction markers is shown

.

Contracted leaf nodes are labeled with a number in parenthesis, represents the total number of leaf nodes belonging to the non-singleton clusters represented by the contracted link.

dendrogram(Z, color_threshold=0, truncate_mode='level', p=3, show_contracted=True, orientation='left')

We can change the orientation of the dendrogram with the orientation parameter.

Download

See my software page for more information on downloading the package used for this example.

Documentation

See the API documentation for reference on how to use each function in the scipy-cluster package.

References

Fisher, R.A. "The use of multiple measurements in taxonomic problems." Annals of Eugenics, 7(2): 179-188. 1936