| | |
- Z_y_correspond(Z, Y)
- yesno = Z_y_correspond(Z, Y)
Returns True if a linkage matrix Z and condensed distance matrix
Y could possibly correspond to one another. They must have the same
number of original observations. This function is useful as a sanity
check in algorithms that make extensive use of linkage and distance
matrices that must correspond to the same set of original observations.
- average(y)
- Z = average(y)
Performs average/UPGMA linkage on the condensed distance matrix Z. See
linkage for more information on the return structure and algorithm.
(a condensed alias for linkage)
- braycurtis(u, v)
- d = braycurtis(u, v)
Computes the Bray-Curtis distance between two n-vectors u and v,
\sum{|u_i-v_i|} / \sum{|u_i+v_i|}.
- canberra(u, v)
- d = canberra(u, v)
Computes the Canberra distance between two n-vectors u and v,
\sum{|u_i-v_i|} / \sum{|u_i|+|v_i}.
- centroid(y)
- Z = centroid(y)
Performs centroid/UPGMC linkage on the condensed distance matrix Z.
See linkage for more information on the return structure and
algorithm.
(a condensed alias for linkage)
Z = centroid(X)
Performs centroid/UPGMC linkage on the observation matrix X using
Euclidean distance as the distance metric. See linkage for more
information on the return structure and algorithm.
- chebyshev(u, v)
- d = chebyshev(u, v)
Computes the Chebyshev distance between two n-vectors u and v,
\max {|u_i-v_i|}.
- cityblock(u, v)
- d = cityblock(u, v)
Computes the Manhattan distance between two n-vectors u and v,
\sum {u_i-v_i}.
- complete(y)
- Z = complete(y)
Performs complete complete/max/farthest point linkage on the
condensed distance matrix Z. See linkage for more information
on the return structure and algorithm.
(a condensed alias for linkage)
- cophenet(*args, **kwargs)
- d = cophenet(Z)
Calculates the cophenetic distances between each observation in the
hierarchical clustering defined by the linkage Z.
Suppose p and q are original observations in disjoint clusters
s and t, respectively and s and t are joined by a direct parent
cluster u. The cophenetic distance between observations i and j
is simply the distance between clusters s and t.
d is cophenetic distance matrix in condensed form. The ij'th
entry is the cophenetic distance between original observations
i and j.
c = cophenet(Z, Y)
Calculates the cophenetic correlation coefficient c of a hierarchical
clustering Z of a set of n observations in m dimensions. Y is the
condensed distance matrix from which Z was generated.
(c, d) = cophenet(Z, Y, [])
Also returns the cophenetic distance matrix in condensed form.
- copying()
- Displays the license for this package.
- correlation(u, v)
- d = correlation(u, v)
Computes the correlation distance between two n-vectors u and v,
1 - (u - n|u|_1)(v - n|v|_1)^T
--------------------------------- ,
|(u - n|u|_1)|_2 |(v - n|v|_1)|^T
where |*|_1 is the Manhattan norm and n is the common dimensionality
of the vectors.
- cosine(u, v)
- d = cosine(u, v)
Computes the Cosine distance between two n-vectors u and v,
(1-uv^T)/(||u||_2 * ||v||_2).
- dendrogram(Z, p=30, truncate_mode=None, colorthreshold=None, get_leaves=True, orientation='top', labels=None, count_sort=False, distance_sort=False, show_leaf_counts=True, no_plot=False, no_labels=False, color_list=None, leaf_font_size=None, leaf_rotation=None, leaf_label_func=None, no_leaves=False, show_contracted=False, link_color_func=None)
- R = dendrogram(Z)
Plots the hiearchical clustering defined by the linkage Z as a
dendrogram. The dendrogram illustrates how each cluster is
composed by drawing a U-shaped link between a non-singleton
cluster and its children. The height of the top of the U-link
is the distance between its children clusters. It is also the
cophenetic distance between original observations in the
two children clusters. It is expected that the distances in
Z[:,2] be monotonic, otherwise crossings appear in the
dendrogram.
R is a dictionary of the data structures computed to render the
dendrogram. Its keys are:
'icoords': a list of lists [I1, I2, ..., Ip] where Ik is a
list of 4 independent variable coordinates corresponding to
the line that represents the k'th link painted.
'dcoords': a list of lists [I2, I2, ..., Ip] where Ik is a
list of 4 independent variable coordinates corresponding to
the line that represents the k'th link painted.
'ivl': a list of labels corresponding to the leaf nodes
R = dendrogram(..., truncate_mode, p)
The dendrogram can be hard to read when the original observation
matrix from which the linkage is derived is large. Truncation
is used to condense the dendrogram. There are several modes:
* None/'none': no truncation is performed
* 'lastp': the last p non-singleton formed in the linkage are
the only non-leaf nodes in the linkage; they correspond to
to rows Z[n-p-2:end] in Z. All other non-singleton clusters
are contracted into leaf nodes.
* 'mlab': This corresponds to MATLAB(TM) behavior. (not implemented yet)
* 'level'/'mtica': no more than p levels of the dendrogram tree
are displayed. This corresponds to Mathematica(TM) behavior.
R = dendrogram(..., colorthreshold=t)
Colors all the descendent links below a cluster node k the same color
if k is the first node below the cut threshold t. All links connecting
nodes with distances greater than or equal to the threshold are
colored blue. If t is less than or equal to zero, all nodes
are colored blue. If t is None or 'default', corresponding with
MATLAB(TM) behavior, the threshold is set to 0.7*max(Z[:,2]).
R = dendrogram(..., get_leaves=True)
Includes a list R['leaves']=H in the result dictionary. For each i,
H[i] == j, cluster node j appears in the i'th position in the
left-to-right traversal of the leaves, where j < 2n-1 and i < n.
R = dendrogram(..., orientation)
Plots the dendrogram in a particular direction. The orientation
parameter can be any of:
* 'top': plots the root at the top, and plot descendent
links going downwards. (default).
* 'bottom': plots the root at the bottom, and plot descendent
links going upwards.
* 'left': plots the root at the left, and plot descendent
links going right.
* 'right': plots the root at the right, and plot descendent
links going left.
R = dendrogram(..., labels=None)
The labels parameter is a n-sized list (or tuple). The labels[i]
value is the text to put under the i'th leaf node only if it
corresponds to an original observation and not a non-singleton
cluster.
When labels=None, the index of the original observation is used
used.
R = dendrogram(..., count_sort)
When plotting a cluster node and its directly descendent links,
the order the two descendent links and their descendents are
plotted is determined by the count_sort parameter. Valid values
of count_sort are:
* False: nothing is done.
* 'ascending'/True: the child with the minimum number of
original objects in its cluster is plotted first.
* 'descendent': the child with the maximum number of
original objects in its cluster is plotted first.
R = dendrogram(..., distance_sort)
When plotting a cluster node and its directly descendent links,
the order the two descendent links and their descendents are
plotted is determined by the distance_sort parameter. Valid
values of count_sort are:
* False: nothing is done.
* 'ascending'/True: the child with the minimum distance
between its direct descendents is plotted first.
* 'descending': the child with the maximum distance
between its direct descendents is plotted first.
Note that either count_sort or distance_sort must be False.
R = dendrogram(..., show_leaf_counts)
When show_leaf_counts=True, leaf nodes representing k>1
original observation are labeled with the number of observations
they contain in parentheses.
R = dendrogram(..., no_plot)
When no_plot=True, the final rendering is not performed. This is
useful if only the data structures computed for the rendering
are needed or if matplotlib is not available.
R = dendrogram(..., no_labels)
When no_labels=True, no labels appear next to the leaf nodes in
the rendering of the dendrogram.
R = dendrogram(..., leaf_label_rotation):
Specifies the angle to which the leaf labels are rotated. When
unspecified, the rotation based on the number of nodes in the
dendrogram.
R = dendrogram(..., leaf_font_size):
Specifies the font size in points of the leaf labels. When
unspecified, the size based on the number of nodes
in the dendrogram.
R = dendrogram(..., leaf_label_func)
When a callable function is passed, leaf_label_func is passed
cluster index k, and returns a string with the label for the
leaf.
Indices k < n correspond to original observations while indices
k >= n correspond to non-singleton clusters.
For example, to label singletons with their node id and
non-singletons with their id, count, and inconsistency coefficient,
we simply do
# First define the leaf label function.
llf = lambda id:
if id < n:
return str(id)
else:
return '[%d %d %1.2f]' % (id, count, R[n-id,3])
# The text for the leaf nodes is going to be big so force
# a rotation of 90 degrees.
dendrogram(Z, leaf_label_func=llf, leaf_rotation=90)
R = dendrogram(..., show_contracted=True)
The heights of non-singleton nodes contracted into a leaf node
are plotted as crosses along the link connecting that leaf node.
This feature is only useful when truncation is used.
R = dendrogram(..., link_color_func)
When a link is painted, the function link_color_function is
called with the non-singleton id. This function is
expected to return a matplotlib color string, which represents
the color to paint the link.
For example:
dendrogram(Z, link_color_func=lambda k: colors[k])
colors the direct links below each untruncated non-singleton node
k using colors[k].
- dice(u, v)
- d = dice(u, v)
Computes the Dice dissimilarity between two boolean n-vectors
u and v, which is
c_{TF} + c_{FT}
----------------------------
2 * c_{TT} + c_{FT} + c_{TF}
where c_{ij} is the number of occurrences of
u[k] == i and v[k] == j
for k < n.
- euclidean(u, v)
- d = euclidean(u, v)
Computes the Euclidean distance between two n-vectors u and v, ||u-v||_2
- fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)
- T = fcluster(Z, t, criterion, depth=2, R=None, monocrit=None):
Forms flat clusters from the hierarchical clustering defined by
the linkage matrix Z. The threshold t is a required parameter.
T is a vector of length n; T[i] is the flat cluster number to which
original observation i belongs.
The criterion parameter can be any of the following values,
* 'inconsistent': If a cluster node and all its decendents have an
inconsistent value less than or equal to c then all its leaf
descendents belong to the same flat cluster. When no non-singleton
cluster meets this criterion, every node is assigned to its
own cluster. The depth parameter is the maximum depth to perform
the inconsistency calculation; it has no meaning for the other
criteria.
* 'distance': Forms flat clusters so that the original
observations in each flat cluster have no greater a cophenetic
distance than t.
* 'maxclust': Finds a minimum threshold r so that the cophenetic
distance between any two original observations in the same flat
cluster is no more than r and no more than t flat clusters are
formed.
* 'monocrit': Forms a flat cluster from a cluster node c with
index i when monocrit[j] <= t. monocrit must be monotonic.
monocrit is a (n-1) numpy vector of doubles; monocrit[i] is
the criterion upon which non-singleton i is thresholded. The
monocrit vector must be monotonic, i.e. given a node c with
index i, for all node indices j corresponding to nodes below c,
monocrit[i] >= monocrit[j].
For example, to threshold on the maximum mean distance as computed
in the inconsistency matrix R with a threshold of 0.8 do
MR = maxRstat(Z, R, 3)
cluster(Z, t=0.8, criterion='monocrit', monocrit=MR)
* 'maxclust_monocrit': Forms a flat cluster from a non-singleton
cluster node c when monocrit[i] <= r for all cluster indices i below
and including c. r is minimized such that no more than t flat clusters
are formed. monocrit must be monotonic.
For example, to minimize the threshold t on maximum inconsistency
values so that no more than 3 flat clusters are formed, do:
MI = maxinconsts(Z, R)
cluster(Z, t=3, criterion='maxclust_monocrit', monocrit=MI)
- fclusterdata(X, t, criterion='inconsistent', linkage='single', distance='euclid', d=2)
- T = fclusterdata(X, t)
Clusters the original observations in the n by m data matrix X
(n observations in m dimensions), using the euclidean distance
metric to calculate distances between original observations,
performs hierarchical clustering using the single linkage
algorithm, and forms flat clusters using the inconsistency
method with t as the cut-off threshold.
A one-dimensional numpy array T of length n is returned. T[i]
is the index of the flat cluster to which the original
observation i belongs.
T = fclusterdata(X, t, criterion='inconsistent', linkage='single',
dist='euclid', depth=2, R=None)
Clusters the original observations in the n by m data matrix X using
the thresholding criterion, linkage method, and distance metric
specified.
Named parameters are described below.
criterion: specifies the criterion for forming flat clusters.
Valid values are 'inconsistent', 'distance', or
'maxclust' cluster formation algorithms. See
cluster for descriptions.
lmethod: the linkage method to use. See linkage for
descriptions.
dmethod: the distance metric for calculating pairwise
distances. See pdist for descriptions and
linkage to verify compatibility with the linkage
method.
t: the cut-off threshold for the cluster function.
depth: the maximum depth for the inconsistency calculation.
See inconsistent for more information.
R: the inconsistency matrix. It will be computed if
necessary if it is not passed.
This function is similar to MATLAB(TM) clusterdata function.
- from_mlab_linkage(Z)
- Z2 = from_mlab_linkage(Z)
Converts a linkage matrix Z generated by MATLAB(TM) to a new linkage
matrix Z2 compatible with this module. The conversion does two
things:
* the indices are converted from 1..N to 0..(N-1) form, and
* a fourth column Z[:,3] is added where Z[i,3] is equal to
the number of original observations (leaves) in the non-singleton
cluster i.
- hamming(u, v)
- d = hamming(u, v)
Computes the Hamming distance between two n-vectors u and v,
which is simply the proportion of disagreeing components in u
and v. If u and v are boolean vectors, the hamming distance is
(c_{01} + c_{10}) / n
where c_{ij} is the number of occurrences of
u[k] == i and v[k] == j
for k < n.
- inconsistent(Z, d=2)
- R = inconsistent(Z, d=2)
Calculates statistics on links up to d levels below each
non-singleton cluster defined in the (n-1)x4 linkage matrix Z.
R is a (n-1)x5 matrix where the i'th row contains the link
statistics for the non-singleton cluster i. The link statistics
are computed over the link heights for links d levels below the
cluster i. R[i,0] and R[i,1] are the mean and standard deviation of
the link heights, respectively; R[i,2] is the number of links
included in the calculation; and R[i,3] is the inconsistency
coefficient, (Z[i, 2]-R[i,0])/R[i,2].
This function behaves similarly to the MATLAB(TM) inconsistent
function.
- is_isomorphic(T1, T2)
- Returns True iff two different cluster assignments T1 and T2 are
equivalent. T1 and T2 must be arrays of the same size.
- is_monotonic(Z)
- is_monotonic(Z)
Returns True if the linkage Z is monotonic. The linkage is monotonic
if for every cluster s and t joined, the distance between them is
no less than the distance between any previously joined clusters.
- is_valid_dm(D, t=0.0)
- is_valid_dm(D)
Returns True if the variable D passed is a valid distance matrix.
Distance matrices must be 2-dimensional numpy arrays containing
doubles. They must have a zero-diagonal, and they must be symmetric.
is_valid_dm(D, t)
Returns True if the variable D passed is a valid distance matrix.
Small numerical differences in D and D.T and non-zeroness of the
diagonal are ignored if they are within the tolerance specified
by t.
is_valid_dm(..., warning=True, name='V')
Invokes a warning if the variable passed is not a valid distance matrix.
The warning message explains why the distance matrix is not valid. 'name'
is used when referencing the offending variable.
is_valid_dm(..., throw=True, name='V')
Throws an exception if the varible passed is not valid. The message
explains why the variable is not valid. 'name' is used when referencing
the offending variable.
- is_valid_im(R, warning=False, throw=False, name=None)
- is_valid_im(R)
Returns True if the inconsistency matrix passed is valid. It must
be a n by 4 numpy array of doubles. The standard deviations R[:,1]
must be nonnegative. The link counts R[:,2] must be positive and
no greater than n-1.
- is_valid_linkage(Z, warning=False, throw=False, name=None)
- is_valid_linkage(Z, t)
Returns True if Z is a valid linkage matrix. The variable must
be a 2-dimensional double numpy array with n rows and 4 columns.
The first two columns must contain indices between 0 and 2n-1. For a
given row i, 0 <= Z[i,0] <= i+n-1 and 0 <= Z[i,1] <= i+n-1 (i.e.
a cluster cannot join another cluster unless the cluster being joined
has been generated.)
is_valid_linkage(..., warning=True, name='V')
Invokes a warning if the variable passed is not a valid linkage. The message
explains why the distance matrix is not valid. 'name' is used when referencing
the offending variable.
is_valid_linkage(..., throw=True, name='V')
Throws an exception if the variable passed is not a valid linkage. The message
explains why variable is not valid. 'name' is used when referencing the offending
variable.
- is_valid_y(y, warning=False, throw=False, name=None)
- is_valid_y(y)
Returns True if the variable y passed is a valid condensed
distance matrix. Condensed distance matrices must be
1-dimensional numpy arrays containing doubles. Their length
must be a binomial coefficient {n \choose 2} for some positive
integer n.
is_valid_y(..., warning=True, name='V')
Invokes a warning if the variable passed is not a valid condensed distance
matrix. The warning message explains why the distance matrix is not valid.
'name' is used when referencing the offending variable.
is_valid_y(..., throw=True, name='V')
Throws an exception if the variable passed is not a valid condensed distance
matrix. The message explains why variable is not valid. 'name' is used when
referencing the offending variable.
- jaccard(u, v)
- d = jaccard(u, v)
Computes the Jaccard-Needham dissimilarity between two boolean
n-vectors u and v, which is
c_{TF} + c_{FT}
------------------------
c_{TT} + c_{FT} + c_{TF}
where c_{ij} is the number of occurrences of
u[k] == i and v[k] == j
for k < n.
- kulsinski(u, v)
- d = kulsinski(u, v)
Computes the Kulsinski dissimilarity between two boolean n-vectors
u and v, which is
c_{TF} + c_{FT} - c_{TT} + n
----------------------------
c_{FT} + c_{TF} + n
where c_{ij} is the number of occurrences of
u[k] == i and v[k] == j
for k < n.
- leaders(Z, T)
- (L, M) = leaders(Z, T):
For each flat cluster j of the k flat clusters represented in the
n-sized flat cluster assignment vector T, this function finds the
lowest cluster node i in the linkage tree Z such that:
* leaf descendents belong only to flat cluster j (i.e. T[p]==j
for all p in S(i) where S(i) is the set of leaf ids of leaf
nodes descendent with cluster node i)
* there does not exist a leaf that is not descendent with i
that also belongs to cluster j (i.e. T[q]!=j for all q not in S(i)).
If this condition is violated, T is not a valid cluster assignment
vector, and an exception will be thrown.
Two k-sized numpy vectors are returned, L and M. L[j]=i is the linkage
cluster node id that is the leader of flat cluster with id M[j]. If
i < n, i corresponds to an original observation, otherwise it
corresponds to a non-singleton cluster.
- linkage(y, method='single', metric='euclidean')
- Z = linkage(y, method)
Performs hierarchical/agglomerative clustering on the
condensed distance matrix y. y must be a {n \choose 2} sized
vector where n is the number of original observations paired
in the distance matrix. The behavior of this function is very
similar to the MATLAB(TM) linkage function.
A (n - 1) * 4 matrix Z is returned. At the i'th iteration,
clusters with indices Z[i, 0] and Z[i, 1] are combined to
form cluster n + i. A cluster with an index less than n
corresponds to one of the n original observations. The
distance between clusters Z[i, 0] and Z[i, 1] is given by
Z[i, 2]. The fourth value Z[i, 3] represents the number of
original observations in the newly formed cluster.
The following linkage methods are used to compute the
distance dist(s, t) between two clusters s and t. The
algorithm begins with a forest of clusters that have yet
to be used in the hierarchy being formed. When two clusters
s and t from this forest are combined into a single cluster u,
s and t are removed from the forest, and u is added to
the forest. When only one cluster remains in the forest,
the algorithm stops, and this cluster becomes the root.
A distance matrix is maintained at each iteration. The
d[i,j] entry corresponds to the distance between cluster
i and j in the original forest.
At each iteration, the algorithm must update the distance
matrix to reflect the distance of the newly formed cluster
u with the remaining clusters in the forest.
Suppose there are |u| original observations u[0], ..., u[|u|-1]
in cluster u and |v| original objects v[0], ..., v[|v|-1]
in cluster v. Recall s and t are combined to form cluster
u. Let v be any remaining cluster in the forest that is not
u.
The following are methods for calculating the distance between
the newly formed cluster u and each v.
* method='single' assigns dist(u,v) = MIN(dist(u[i],v[j])
for all points i in cluster u and j in cluster v.
(also called Nearest Point Algorithm)
* method='complete' assigns dist(u,v) = MAX(dist(u[i],v[j])
for all points i in cluster u and j in cluster v.
(also called Farthest Point Algorithm
or the Voor Hees Algorithm)
* method='average' assigns dist(u,v) =
\sum_{ij} { dist(u[i], v[j]) } / (|u|*|v|)
for all points i and j where |u| and |v| are the
cardinalities of clusters u and v, respectively.
(also called UPGMA)
* method='weighted' assigns
dist(u,v) = (dist(s,v) + dist(t,v))/2
where cluster u was formed with cluster s and t and v
is a remaining cluster in the forest. (also called WPGMA)
Z = linkage(X, method, metric='euclidean')
Performs hierarchical clustering on the objects defined by the
n by m observation matrix X.
If the metric is 'euclidean' then the following methods may be
used:
* method='centroid' assigns dist(s,t) = euclid(c_s, c_t) where
c_s and c_t are the centroids of clusters s and t,
respectively. When two clusters s and t are combined into a new
cluster u, the new centroid is computed over all the original
objects in clusters s and t. The distance then becomes
the Euclidean distance between the centroid of u and the
centroid of a remaining cluster v in the forest.
(also called UPGMC)
* method='median' assigns dist(s,t) as above. When two clusters
s and t are combined into a new cluster u, the average of
centroids s and t give the new centroid u. (also called WPGMC)
* method='ward' uses the Ward variance minimization algorithm.
The new entry dist(u, v) is computed as follows,
dist(u,v) =
----------------------------------------------------
| |v|+|s| |v|+|t| |v|
| ------- d(v,s)^2 + ------- d(v,t)^2 - --- d(s,t)^2
\| T T T
where u is the newly joined cluster consisting of clusters
s and t, v is an unused cluster in the forest, T=|v|+|s|+|t|,
and |*| is the cardinality of its argument.
(also called incremental)
Warning to MATLAB(TM) users: when the minimum distance pair in
the forest is chosen, there may be two or more pairs with the
same minimum distance. This implementation may chose a
different minimum than the MATLAB(TM) version.
- lvlist(Z)
- L = lvlist(Z):
Returns a list of leaf node ids as they appear in the tree from
left to right. Z is a linkage matrix.
- mahalanobis(u, v, VI)
- d = mahalanobis(u, v, VI)
Computes the Mahalanobis distance between two n-vectors u and v,
(u-v)VI(u-v)^T
where VI is the inverse covariance matrix.
- matching(u, v)
- d = matching(u, v)
Computes the Matching dissimilarity between two boolean n-vectors
u and v, which is
(c_{TF} + c_{FT}) / n
where c_{ij} is the number of occurrences of
u[k] == i and v[k] == j
for k < n.
- maxRstat(Z, R, i)
- MR = maxRstat(Z, R, i)
Calculates the maximum statistic for the i'th column of the
inconsistency matrix R for each non-singleton cluster node. MR[j]
is the maximum over R[Q(j)-n, i] where Q(j) the set of all node ids
corresponding to nodes below and including j.
- maxdists(Z)
- MD = maxdists(Z)
MD is a (n-1)-sized numpy array of doubles; MD[i] represents the
maximum distance between any cluster (including singletons) below
and including the node with index i. More specifically,
MD[i] = Z[Q(i)-n, 2].max() where Q(i) is the set of all node indices
below and including node i.
Note that when Z[:,2] is monotonic, Z[:,2] and MD should not differ.
See linkage for more information on this issue.
- maxinconsts(Z, R)
- MI = maxinconsts(Z, R)
Calculates the maximum inconsistency coefficient for each node
and its descendents. Z is a valid linkage matrix and R is a valid
inconsistency matrix. MI is a monotonic (n-1)-sized numpy array of
doubles.
- median(y)
- Z = median(y)
Performs median/WPGMC linkage on the condensed distance matrix Z.
See linkage for more information on the return structure and
algorithm.
Z = median(X)
Performs median/WPGMC linkage on the observation matrix X using
Euclidean distance as the distance metric. See linkage for more
information on the return structure and algorithm.
(a condensed alias for linkage)
- minkowski(u, v, p)
- d = minkowski(u, v, p)
Returns the Minkowski distance between two vectors u and v,
||u-v||_p = (\sum {|u_i - v_i|^p})^(1/p).
- numobs_dm(D)
- numobs_dm(D)
Returns the number of original observations that correspond to a
square, non-condensed distance matrix D.
- numobs_linkage(Z)
- Returns the number of original observations that correspond to a
linkage matrix Z.
- numobs_y(Y)
- numobs_y(Y)
Returns the number of original observations that correspond to a
condensed distance matrix Y.
- pdist(X, metric='euclidean', p=2, V=None, VI=None)
- Y = pdist(X, method='euclidean', p=2)
Computes the distance between m original observations in
n-dimensional space. Returns a condensed distance matrix Y.
For each i and j (i<j), the metric dist(u=X[i], v=X[j]) is
computed and stored in the ij'th entry. See squareform
to learn how to retrieve this entry.
1. Y = pdist(X)
Computes the distance between m points using Euclidean distance
(2-norm) as the distance metric between the points. The points
are arranged as m n-dimensional row vectors in the matrix X.
2. Y = pdist(X, 'minkowski', p)
Computes the distances using the Minkowski distance ||u-v||_p
(p-norm) where p>=1.
3. Y = pdist(X, 'cityblock')
Computes the city block or Manhattan distance between the
points.
4. Y = pdist(X, 'seuclidean', V=None)
Computes the standardized Euclidean distance. The standardized
Euclidean distance between two n-vectors u and v is
sqrt(\sum {(u_i-v_i)^2 / V[x_i]}).
V is the variance vector; V[i] is the variance computed over all
the i'th components of the points. If not passed, it is
automatically computed.
5. Y = pdist(X, 'sqeuclidean')
Computes the squared Euclidean distance ||u-v||_2^2 between
the vectors.
6. Y = pdist(X, 'cosine')
Computes the cosine distance between vectors u and v,
1 - uv^T
-----------
|u|_2 |v|_2
where |*|_2 is the 2 norm of its argument *.
7. Y = pdist(X, 'correlation')
Computes the correlation distance between vectors u and v. This is
1 - (u - n|u|_1)(v - n|v|_1)^T
--------------------------------- ,
|(u - n|u|_1)|_2 |(v - n|v|_1)|^T
where |*|_1 is the Manhattan (or 1-norm) of its argument *,
and n is the common dimensionality of the vectors.
8. Y = pdist(X, 'hamming')
Computes the normalized Hamming distance, or the proportion
of those vector elements between two n-vectors u and v which
disagree. To save memory, the matrix X can be of type boolean.
9. Y = pdist(X, 'jaccard')
Computes the Jaccard distance between the points. Given two
vectors, u and v, the Jaccard distance is the proportion of
those elements u_i and v_i that disagree where at least one
of them is non-zero.
10. Y = pdist(X, 'chebyshev')
Computes the Chebyshev distance between the points. The
Chebyshev distance between two n-vectors u and v is the maximum
norm-1 distance between their respective elements. More
precisely, the distance is given by
d(u,v) = max {|u_i-v_i|}.
11. Y = pdist(X, 'canberra')
Computes the Canberra distance between the points. The
Canberra distance between two points u and v is
|u_1-v_1| |u_2-v_2| |u_n-v_n|
d(u,v) = ----------- + ----------- + ... + -----------
|u_1|+|v_1| |u_2|+|v_2| |u_n|+|v_n|
12. Y = pdist(X, 'braycurtis')
Computes the Bray-Curtis distance between the points. The
Bray-Curtis distance between two points u and v is
|u_1-v_1| + |u_2-v_2| + ... + |u_n-v_n|
d(u,v) = ---------------------------------------
|u_1+v_1| + |u_2+v_2| + ... + |u_n+v_n|
13. Y = pdist(X, 'mahalanobis', VI=None)
Computes the Mahalanobis distance between the points. The
Mahalanobis distance between two points u and v is
(u-v)(1/V)(u-v)^T
where (1/V) is the inverse covariance. If VI is not None,
VI will be used as the inverse covariance matrix.
14. Y = pdist(X, 'yule')
Computes the Yule distance between each pair of boolean
vectors. (see yule function documentation)
15. Y = pdist(X, 'matching')
Computes the matching distance between each pair of boolean
vectors. (see matching function documentation)
16. Y = pdist(X, 'dice')
Computes the Dice distance between each pair of boolean
vectors. (see dice function documentation)
17. Y = pdist(X, 'kulsinski')
Computes the Kulsinski distance between each pair of
boolean vectors. (see kulsinski function documentation)
17. Y = pdist(X, 'rogerstanimoto')
Computes the Rogers-Tanimoto distance between each pair of
boolean vectors. (see rogerstanimoto function documentation)
18. Y = pdist(X, 'russellrao')
Computes the Russell-Rao distance between each pair of
boolean vectors. (see russellrao function documentation)
19. Y = pdist(X, 'sokalmichener')
Computes the Sokal-Michener distance between each pair of
boolean vectors. (see sokalmichener function documentation)
20. Y = pdist(X, 'sokalsneath')
Computes the Sokal-Sneath distance between each pair of
boolean vectors. (see sokalsneath function documentation)
21. Y = pdist(X, f)
Computes the distance between all pairs of vectors in X
using the user supplied 2-arity function f. For example,
Euclidean distance between the vectors could be computed
as follows,
dm = pdist(X, (lambda u, v: numpy.sqrt(((u-v)*(u-v).T).sum())))
Note that you should avoid passing a reference to one of
the distance functions defined in this library. For example,
dm = pdist(X, sokalsneath)
would calculate the pair-wise distances between the vectors
in X using the Python function sokalsneath. This would result
in sokalsneath being called {n \choose 2} times, which is
inefficient. Instead, the optimized C version is more
efficient, and we call it using the following syntax.
dm = pdist(X, 'sokalsneath')
- rogerstanimoto(u, v)
- d = rogerstanimoto(u, v)
Computes the Rogers-Tanimoto dissimilarity between two boolean
n-vectors u and v,
R
-------------------
c_{TT} + c_{FF} + R
where c_{ij} is the number of occurrences of
u[k] == i and v[k] == j
for k < n, and
R = 2.0 * (c_{TF} + c_{FT}).
- russellrao(u, v)
- d = russellrao(u, v)
Computes the Russell-Rao dissimilarity between two boolean n-vectors
u and v, (n - c_{TT}) / n where c_{ij} is the number of occurrences
of u[k] == i and v[k] == j for k < n.
- set_link_color_palette(palette)
- set_link_color_palette(palette):
Changes the list of matplotlib color codes to use when coloring
links with the dendrogram colorthreshold feature.
- seuclidean(u, v, V)
- d = seuclidean(u, v, V)
Returns the standardized Euclidean distance between two
n-vectors u and v. V is a m-dimensional vector of component
variances. It is usually computed among a larger collection vectors.
- single(y)
- Z = single(y)
Performs single/min/nearest linkage on the condensed distance
matrix Z. See linkage for more information on the return structure
and algorithm.
(a condensed alias for linkage)
- sokalmichener(u, v)
- d = sokalmichener(u, v)
Computes the Sokal-Michener dissimilarity between two boolean vectors
u and v, 2R / (S + 2R) where c_{ij} is the number of occurrences of
u[k] == i and v[k] == j for k < n and R = 2 * (c_{TF} + c{FT}) and
S = c_{FF} + c_{TT}.
- sokalsneath(u, v)
- d = sokalsneath(u, v)
Computes the Sokal-Sneath dissimilarity between two boolean vectors
u and v, 2R / (c_{TT} + 2R) where c_{ij} is the number of occurrences
of u[k] == i and v[k] == j for k < n and R = 2 * (c_{TF} + c{FT}).
- sqeuclidean(u, v)
- d = sqeuclidean(u, v)
Computes the squared Euclidean distance between two n-vectors u and v,
(||u-v||_2)^2.
- squareform(X, force='no', checks=True)
- ... = squareform(...)
Converts a vector-form distance vector to a square-form distance
matrix, and vice-versa.
v = squareform(X)
Given a square d by d symmetric distance matrix X, v=squareform(X)
returns a d*(d-1)/2 (or {n \choose 2}) sized vector v.
v[{n \choose 2}-{n-i \choose 2} + (j-i-1)] is the distance
between points i and j. If X is non-square or asymmetric, an error
is returned.
X = squareform(v)
Given a d*d(-1)/2 sized v for some integer d>=2 encoding distances
as described, X=squareform(v) returns a d by d distance matrix X. The
X[i, j] and X[j, i] values are set to
v[{n \choose 2}-{n-i \choose 2} + (j-u-1)] and all
diagonal elements are zero.
As with MATLAB(TM), if force is equal to 'tovector' or 'tomatrix',
the input will be treated as a distance matrix or distance vector
respectively.
If checks is set to False, no checks will be made for matrix
symmetry nor zero diagonals. This is useful if it is known that
X - X.T is small and diag(X) is close to zero. These values are
ignored any way so they do not disrupt the squareform
transformation.
- to_mlab_linkage(Z)
- Z2 = to_mlab_linkage(Z)
Converts a linkage matrix Z generated by the linkage function of this
module to one compatible with MATLAB(TM). Z2 is the same as Z with the
last column removed and the cluster indices converted to use
1..N indexing.
- totree(Z, rd=False)
- r = totree(Z)
Converts a hierarchical clustering encoded in the matrix Z
(by linkage) into an easy-to-use tree object. The reference r
to the root cnode object is returned.
Each cnode object has a left, right, dist, id, and count
attribute. The left and right attributes point to cnode
objects that were combined to generate the cluster. If
both are None then the cnode object is a leaf node, its
count must be 1, and its distance is meaningless but set
to 0.
(r, d) = totree(Z, rd=True)
Same as totree(Z) except a tuple is returned where r is
the reference to the root cnode and d is a reference to a
dictionary mapping cluster ids to cnodes. If a cluster id
is less than n, then it corresponds to a singleton cluster
(leaf node).
Note: This function is provided for the convenience of the
library user. cnodes are not used as input to any of the
functions in this library.
- ward(y)
- Z = ward(y)
Performs Ward's linkage on the condensed distance matrix Z. See
linkage for more information on the return structure and algorithm.
Z = ward(X)
Performs Ward's linkage on the observation matrix X using Euclidean
distance as the distance metric. See linkage for more information
on the return structure and algorithm.
(a condensed alias for linkage)
- weighted(y)
- Z = weighted(y)
Performs weighted/WPGMA linkage on the condensed distance matrix Z.
See linkage for more information on the return structure and
algorithm.
(a condensed alias for linkage)
- yule(u, v)
- d = yule(u, v)
Computes the Yule dissimilarity between two boolean n-vectors u and v,
R
---------------------
c_{TT} + c_{FF} + R/2
where c_{ij} is the number of occurrences of
u[k] == i and v[k] == j
for k < n, and
R = 2.0 * (c_{TF} + c_{FT}).
|