Distance functions
Even if generic definitions of clustering are normally based on the concept of similarity, it's quite easy to employ its inverse, which is represented by distance function (dissimilarity measure). The most common choice is the Euclidean distance, but before choosing it, it's necessary to consider its properties and their behaviors in high-dimensional spaces. Let's start by introducing the Minkowski distance as a generalization of the Euclidean one. If the sample is xi ∈ ℜN, it is defined as:
For p=1, we obtain the Manhattan (or city block) distance, while p=2 corresponds to the standard Euclidean distance. We want to understand the behavior of dp when p → ∞. Let's suppose we're working in a bidimensional space and have a cluster whose center is xc=(0, 0) and a sample point x=(5, 3), the distances dp(xc, x) with respect to different values of p are:
It's clear (and very simple to prove) that if |x1j - x2j| is the largest component absolute difference, p → ∞, dp(xc, x) → |x1j - x2j|. This means that, if we are considering the similarity (or dissimilarity) as the result of the differences due to all components, we need to pick a small value for p (for example, p=1 or 2). On the other hand, if two samples must be considered different only according to the largest absolute difference between components, higher values for p are suitable. In general, this choice is very context-dependent and cannot be easily generalized. For our purposes, we often take into account only the Euclidean distance, which is reasonable in the majority of cases. On the other hand, choosing larger values for p has important consequences when N → ∞. Let's start with an example. We want to measure the distance between the 1N-vector (a vector belonging to ℜN with all components equal to 1) from the origin for different values of p and N (using a log-scale to compress the y axis), which can be done as follows:
import numpy as np
from scipy.spatial.distance import cdist
distances = np.zeros(shape=(8, 100))
for i in range(1, distances.shape[0] + 1):
for j in range(1, distances.shape[1] + 1):
distances[i - 1, j - 1] = np.log(cdist(np.zeros(shape=(1, j)), np.ones(shape=(1, j)),
metric='minkowski', p=i)[0][0])
The distances are shown in the following plot:
The first result is that if we pick a value for N, the distances contract and saturate when p → ∞. This is a normal consequence of the structure of the Minkowski distance, but there's another element that sharp-eyed readers could have noticed. Let's imagine setting one of the components of the 1N-vector equal to 0.0. This is equivalent to moving from a vertex of the N-dimensional hypercube to another one. What happens to the distance? Well, it's easy to prove with an example that, when p → ∞, the two distances converge to the same value. In particular, Aggarwal, Hinneburg, and Keim (in On the Surprising Behavior of Distance Metrics in High Dimensional Space, Aggarwal C. C., Hinneburg A., Keim D. A., ICDT 2001) proved an important result.
Let's suppose we have a distribution p(x) of M binary samples xi ∈ (0, 1)d. If we employ the Minkowski metric, we can compute the maximum (Dmaxp) and minimum (Dminp) distance between two points sampled from p(x) and the origin (in general, this distance can be computed analytically, but it's also possible to use an iterative procedure that continues sampling until Dmaxp and Dminp stop changing). The authors proved that the following inequality holds:
In the previous formula, Cp is a constant dependent on p. When p → ∞, the limit of the expected value E[Dmaxp - Dminp] is captured between the boundaries k1Cpd1/p-1/2 and (M-1)Cpd1/p-1/2. As the term d1/p-1/2 → 0 when p > 2 and d → ∞, the expected value of the difference between the maximum and minimum distances converges to 0. This means that, independently from the samples, when the dimensionality is high enough and p > 2, it's almost impossible to distinguish between two samples using the Minkowski distance. As we are finding the similarity on a distance function, this theorem warns us about the choice of large values for p when d >> 1. The common choice of the Euclidean metric is quite reliable also when d >> 1 (even if p=1 would be the optimal choice) because it has a minimum effect on the weight of the components (it's possible to assume that they have the same weight) and guarantees distinguishability in high-dimensional spaces. Conversely, p >> 2 in a high-dimensional space yields indistinguishable distances for all samples where the maximum component is kept fixed and all the other ones are modified (for example, if x=(5, 0) → (5, a) where |a| < 5), as shown in the following example:
import numpy as np
from scipy.spatial.distance import cdist
distances = []
for i in range(1, 2500, 10):
d = cdist(np.array([[0, 0]]), np.array([[5, float(i/500)]]), metric='minkowski', p=15)[0][0]
distances.append(d)
print('Avg(distances) = {}'.format(np.mean(distances)))
print('Std(distances) = {}'.format(np.std(distances)))
The output is as follows:
Avg(distances) = 5.0168687736484765 Std(distances) = 0.042885311128215066
Hence, for p = 15, all samples (5, x) for x ∈ [0.002, 5.0) has distance from the origin whose mean is about 5.0 and the standard deviation is about 0.04. When p becomes larger, Avg(distances) = 5.0 and Std(distances) = 0.04.
At this point, we can start discussing one of the most common and widely adopted clustering algorithms: K-means.