Quick Example of Plotting
Dendrograms with scipy-cluster
Author: Damian EadsAuthored: November 17, 2007
Revised: May 24, 2008
This tutorial uses the Iris data set, first appearing in a paper on optimal discriminants for Gaussian data [1]. This data set contains four observations (sepal length, sepal width, petal length, and petal width) for 150 collected specimens of flowers. Three species are represented here: Iris Setosa, Iris Versicolour, and Iris Virginica. Rather than using Fisher's Discriminant Analysis to classify the data, we will use hcluster to analyze the data. Ideally, pairs of flowers in the same species should cluster more closely, and flowers from different species should be farther apart from one another. Attaining good clustering requires careful consideration of the distance metric. Since the purpose of this document is to demonstrate this library in action, we give cursory consideration to the choice of distance metric.
What to do
First, import the hcluster module. When and if this
software package is integrated into the formal Scipy, this module
name will change. We load the Fisher's flowerdata set using matplotlib's
load command. Standardized Euclidean distance is used to compute the
distances between each pair of flower specimens using the pdist command.
Next, we use the single linkage algorithm to build the agglomorative clustering.
from hcluster import *
X=load('iris.txt')
Y=pdist(X, 'seuclidean')
Z=linkage(Y, 'single')
dendrogram(Z, color_threshold=0)
This yields the following dendrogram plot.
Color Thresholds
Some linkage methods (centroid, ward, and median) do not take condensed distance matrices as arguments but instead require the raw observations as input. Now that we have a dendrogram from which to inspect, let's find a suitable color threshold. By cutting the tree at 1.8, three clusters are formed. Let's plot another dendrogram using this color threshold. The legend shows the membership of each of the flat clusters formed by the cut.
Z=linkage(X, 'centroid')
dendrogram(Z, color_threshold=1.8)
title('Sir Ronald Fisher\'s Iris Data Set')
xlabel('Flower Specimen Number')
ylabel('Distance')
legend(('Iris Setosa', 'Iris Virginica', 'Iris Versicolour'))
The dendrogram plot is shown below.
Using Complete Linkage
For comparison, we show how the dendrogram plot for complete linkage differs from the dendrogram derived from a single linkage. 2.3 is chosen as the cutoff threshold by visual inspection.
Z=linkage(Y, 'complete')
dendrogram(Z, color_threshold=2.3)
title('Sir Ronald Fisher\'s Iris Data Set')
xlabel('Flower Specimen Number')
ylabel('Distance')
legend(('Iris Setosa', 'Iris Virginica', 'Iris Versicolour'))
The dendrogram resulting from the program snippet above is shown below.
Using Level Truncation
The number of specimens in the data set is large enough that the dendrogram looks cluttered. Truncation is used to condense the dendrogram.dendrogram(Z, color_threshold=0, truncate_mode='level', p=3, show_contracted=True)The
truncate_mode parameter tells the dendrogram plotting routine
the type of truncation to perform. When set along with p, no more
than p levels of the dendrogram tree are displayed. If a non-leaf node
is above this level threshold, it and its descendents are contracted into a
single node. The show_contracted=True parameter specification
plots a marker for each non-singleton cluster contracted along the link of
contraction. The height of the marker is the distance between the contracted
node descendents.
The contracted dendrogram with contraction markers is shown
.Contracted leaf nodes are labeled with a number in parenthesis, represents the total number of leaf nodes belonging to the non-singleton clusters represented by the contracted link.
dendrogram(Z, color_threshold=0, truncate_mode='level', p=3, show_contracted=True, orientation='left')
We can change the orientation of the dendrogram with the orientation parameter.
Download
See my software page for more information on downloading the package used for this example.
Documentation
See the API documentation for reference on how to use each function in the scipy-cluster package.
References
Fisher, R.A. "The use of multiple measurements in taxonomic problems." Annals of Eugenics, 7(2): 179-188. 1936




