Data Mining (79 page)

Authors: Mehmed Kantardzic

The reader may check that the result of the incremental-clustering process will not be the same if the order of the samples is different. Usually, this algorithm is not iterative (although it could be) and the clusters generated after all the samples have been analyzed in one iteration are the final clusters. If the iterative approach is used, the centroids of the clusters computed in the previous iteration are used as a basis for the partitioning of samples in the next iteration.

For most partitional-clustering algorithms, including the iterative approach, a summarized representation of the cluster is given through its clustering feature (CF) vector. This vector of parameters is given for every cluster as a triple, consisting of the number of points (samples) of the cluster, the centroid of the cluster, and the radius of the cluster. The cluster’s radius is defined as the square-root of the average mean-squared distance from the centroid to the points in the cluster (averaged within-cluster variation). When a new point is added or removed from a cluster, the new CF can be computed from the old CF. It is very important that we do not need the set of points in the cluster to compute a new CF.

If samples are with categorical data, then we do not have a method to calculate centroids as representatives of the clusters. In that case, an additional algorithm called
K-nearest neighbor
may be used to estimate distances (or similarities) between samples and existing clusters. The basic steps of the algorithm are

1.
to compute the distances between the new sample and all previous samples, already classified into clusters;

2.
to sort the distances in increasing order and select K samples with the smallest distance values; and

3.
to apply the voting principle. A new sample will be added (classified) to the largest cluster out of K selected samples.

For example, given six 6-D categorical samples

they are gathered into two clusters C
₁= {X
₁, X
₂, X
₃} and C
₂= {X
₄, X
₅, X
₆}. How does one classify the new sample Y = {A, C, A, B, C, A}?

To apply the K-nearest neighbor algorithm, it is necessary, as the first step, to find all distances between the new sample and the other samples already clustered. Using the SMC measure, we can find similarities instead of distances between samples.

Similarities with Elements in C ₁	Similarities with Elements in C ₂
SMC(Y, X ₁) = 4/6 = 0.66	SMC(Y, X ₄) = 4/6 = 0.66
SMC(Y, X ₂) = 3/6 = 0.50	SMC(Y, X ₅) = 2/6 = 0.33
SMC(Y, X ₃) = 2/6 = 0.33	SMC(Y, X ₆) = 2/6 = 0.33

Using the 1-nearest neighbor rule (K = 1), the new sample cannot be classified because there are two samples (X
₁and X
₄) with the same, highest similarity (smallest distances), and one of them is in the class C
₁and the other in the class C
₂. On the other hand, using the 3-nearest neighbor rule (K = 3) and selecting the three largest similarities in the set, we can see that two samples (X
₁and X
₂) belong to class C
₁, and only one sample to class C
₂. Therefore, using a simple voting system we can classify the new sample Y into the C
₁class.

9.6 DBSCAN ALGORITHM

Density-based approach in clustering assumes that clusters are regarded as dense regions of objects in the data space that are separated by regions of low object density (noise). These regions may have an arbitrary shape. Crucial concepts of this approach are density and connectivity both measured in terms of local distribution of nearest neighbors. The algorithm DBSCAN targeting low-dimensional data is the major representative in this category of density-based clustering algorithms. The main reason why DBSCAN recognizes the clusters is that within each cluster we have a typical density of points that is considerably higher than outside of the cluster. Furthermore, the points’ density within the areas of noise is lower than the density in any of the clusters.

DBSCAN is based on two main concepts:
density reachability
and
density connectability
. These both concepts depend on two input parameters of the DBSCAN clustering: the size of epsilon neighborhood (ε) and the minimum points in a cluster (m). The key idea of the DBSCAN algorithm is that, for each point of a cluster, the neighborhood of a given radius ε has to contain at least a minimum number of points m, that is, the density in the neighborhood has to exceed some predefined threshold. For example, in Figure
9.9
point p has only two points in the neighborhood ε, while point q has eight. Obviously, the density around q is higher then around p.

Figure 9.9.
Neighborhood (ε) for points p and q.

Density reachability
defines whether two close points belong to the same cluster. Point p
₁is density-reachable from p
₂if two conditions are satisfied: (1) the points are close enough to each other: distance (p
₁,p
₂) < ε, and (2) there are enough of points in ε neighborhood of p
₂: distance(r,p
₂) > m, where r are some database points. In the example represented in Figure
9.9
, point p is reachable from point q.
Density connectivity
is the next building step of DBSCAN. Points p
₀and p
_nare
density connected
, if there is a sequence of
density-reachable
points (p
₀, p
₁, p
₂, … ) from p
₀to p
_nsuch that p
_i+1is
density-reachable
from p
_i. These ideas are translated into DBSCAN
cluster
as a set of all density connected points.

The clustering process is based on the classification of the points in the dataset as
core points
,
border points
, and
noise points
(examples are given in Fig.
9.10
):

A point is a
core point
if it has more than a specified number of points (m) within neighborhood ε. These are points that are at the interior of a cluster
A
border point
has fewer than m points within its neighborhood ε, but it is in the neighbor of a core point.
A
noise point
is any point that is not a core point or a border point.

Figure 9.10.
Examples of core, border, and noise points. (a) ε and m determine the type of the point; (b) core points build dense regions.

Other books

Snow Angel (The Hope Falls Chronicles) by Shawn, Melanie

One in a Bear-llion (Polar Heat Book 3) by Bolryder, Terry

Heroes for My Son by Brad Meltzer

Veteran by Gavin Smith

Heart of Texas Vol. 3 by Debbie Macomber

Alice-Miranda In New York 5 by Jacqueline Harvey

Wine, Tarts, & Sex by Susan Johnson

Finding Their Girl (Town of Trio, #1) by Mikayla Selover

The Anomaly by J.A. Cooper

Wish Granted by Peter James West