Clustering and Distribution Estimation

Clustering and Distribution Estimation#

Clustering and Distribution Estimation are important techniques in unsupervised learning that allow us to partition data into groups, find regions of high density in a dataset, and even detect outliers and anomalies in our data. Although there is a rich array of unsupervised models and learning techniques we can apply to solve these problems, in this section we will briefly review some of the most commonly used techniques for basic clustering.

The K-Means Algorithm#

One of the most popular algorithms for identifying clusters in data is the \(k\)-means algorithm. It is a simple yet effective algorithm that aims to partition a dataset into \(k\) distinct clusters based on similarity or proximity. \(k\)-means is widely employed in various domains, including data mining, image processing, customer segmentation, and pattern recognition. The way that this algorithm works is by initializing the cluster centers (called centroids) at randomly selected data points, and then iteratively improving the fit by re-assigning points to clusters and recalculating the centroid locations. We can summarize this procedure as follows:

Randomly select \(k\) points from the dataset as the initial centroid positions: \(\mathbf{c}_1, \mathbf{c}_2, ..., \mathbf{c}_k\). (Alternatively, the programmer can initialize the centroids positions by hand.)

Assign each point to the cluster with the nearest centroid.

For each cluster, set the centroid to be the mean of all assigned points.

If none of the cluster centroids moved, stop. Otherwise, repeat steps 2-4.

Because this is a relatively easy algorithm to implement, let’s write some Python code that performs the \(k\)-means algorithm:

import numpy as np

def k_means(data_x, k, max_steps=10**7, centroids=None):
    
    # initialize centroid positions (if not given):
    if centroids is None:
        centroids = np.array([
            data_x[i] 
            for i in np.random.choice(len(data_x), size=k, replace=False)
        ])
   
    # do a maximum of `max_steps` steps:
    for _ in range(max_steps):

        # assign all points to closest centroid:
        assignments = np.array([
            np.argmin(np.sum((x - centroids)**2,axis=1))
            for x in data_x   
        ])

        # compute the location of new centroids:
        new_centroids = np.array([
            np.mean(data_x[assignments == n], axis=0)
            for n in range(k)
        ])
 
        # if new centroids are the same as before, stop:
        if np.max(np.abs(new_centroids - centroids)) == 0:
            return new_centroids, assignments
        
        # otherwise, update centroids and do next step:
        centroids = new_centroids
    
    # if the maximum steps are reached, return centroids:
    return centroids, assignments

# initialize dataset with centers roughly at the following coordinates:
centers = np.array([ [3,3], [-1,2], [1,-4] ]).T
data_x = np.random.normal(centers.reshape(2,3,1),0.6,(2,3,100)).reshape(2,-1).T

../_images/65a3a8140c72fa3691a481b011f48bbfcc328fe3fdd7c845800fb3ee5e929dc2.png

Kernel Density Estimation (KDE)#

The K-means algorithm is quite useful for segmenting data into clusters, which can be useful for classifying both known and unknown data points. However, when encountering new or unseen data points it can be helpful instructive to identify new data points that lie outside the distribution of previously data points. This task is referred to as anomaly detection, and can be necessary for identifying outlier datapoints that may require special treatment or further investigation. In the materials sciences, outlier detection is crucial for identifying materials with exceptionally unique properties, or for checking that experimental data lies within the range of values predicted by some model or underlying theory.

A simple way of identifying anomalous datapoints is to estimate the underlying probability distribution \(p(x)\) of known data points over the space of possible model inputs \(\mathcal{X}\). This allows for the estimation of the likelihood \(p(x')\) for some new data point \(x'\) relative to the likelihood \(p(x)\) of a previously seen datapoint \(x\). If the space of features \(\mathcal{X}\) contains continuous quantities, the underlying distribution \(p(x)\) can be straightforwardly approximated using kernel density estimation (KDE). In KDE, \(p(x)\) is estimated as an equally-weighted sum of probability distributions \(K(x)\) centered at each data point \(x_n\) with a fixed width parameter \(\alpha\):

\[p(\mathbf{x}) = \frac{1}{N \alpha^d}\sum_{n=1}^N K\left(\frac{\mathbf{x} - \mathbf{x}_n}{\alpha}\right)\]

Above, \(d\) is the dimension of \(\mathcal{X}\). A popular choice of \(K(\mathbf{x})\) is the Gaussian function:

\[ K(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}}\exp\left(-\frac{1}{2}\Vert \mathbf{x}\Vert^2\right).\]

Although KDE is relatively simple, evaluating \(p(x)\) with a naive implementation requires evaluating \(K((x-x_n)/\alpha)\) for each \(x_n\) in the dataset. More efficient implementations use data structures that avoid this linear scaling issue. A good implementation of Gaussian KDE in Python can be found in the scipy.stats.gaussian_kde module. Below, we apply this to the same dataset as above, relying on the gaussian_kde function’s automatic methods for selecting the bandwidth parameter \(\alpha\):

../_images/3928d5305e346807a68057a2d62b99a286b81377c1e36ed8624d21e6a8fdf3c4.png

Gaussian Mixture Model#

The \(k\)-means algorithm is useful for segmenting data into clusters, while KDE is useful for estimating the probability distribution of the dataset as a whole. In some cases, however, we might want to combine aspects of these two methods, such as to estimate the likelihood of a given point belonging to a specific cluster. In other words, we wish to perform both clustering and distribution estimation simultaneously. One model that can accomplish this task is the Gaussian Mixture Model. A Gaussian Mixture model can be especially useful when multiple clusters overlap in the same region of space.

Below, we give some code that uses the sklearn.mixture.GaussianMixture model to identify clusters and determine their associated probability distributions:

../_images/a19a943f8cfcaf6c739818cded52e4aaacf2f2352bcd6b543f786ad0abc2cf44.png

Clustering and Distribution Estimation

Contents

Clustering and Distribution Estimation#

The K-Means Algorithm#

Kernel Density Estimation (KDE)#

Gaussian Mixture Model#

Exercises#

Solutions#

Exercise 1: Estimating Density of States with KDE#