There are a variety of ways to create a new machine learning model. Supervised learning is the simplest of these learning processes, but it requires human input and curated data sets. For a supervised learning process, you classify data with labels, then build a machine learning (ML) model around it. This ML model can then be used to classify new data in real time.
But what if you only have unclassified data (i.e data without any labels)? Is it possible to train a model with a data set like this? Can this be done without human curation?
Yes, leveraging unclassified data sets for model training is known as “unsupervised learning”.
Clustering and Unsupervised Machine Learning?
Unsupervised learning is also known as self-organization. It is a machine learning process that uses an algorithm for datasets which are neither classified nor labeled. In unsupervised learning, algorithms are allowed to act on data without guidance and they operate autonomously to discover interesting structures in the data based primarily on similarities and differences.
Let’s take a look at two of the most popular clustering and anomaly detection methods in use for unsupervised machine learning algorithms.
2 Popular Clustering Approaches
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (data without defined categories or groups). The goal of this algorithm is to find groups in the data. It is intended to partition “N” objects into “K” clusters in which each object belongs to the cluster with the nearest mean.
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ, and the data set. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set.
Clustering data into K groups where K is predefined
Select K points at random as cluster centers.
Assign objects to their closest cluster center according to the Euclidean distance function.
Calculate the centroid or mean of all objects in each cluster.
Repeat steps 2 and 3 until the same points are assigned to each cluster in consecutive rounds.
In general, there is no method for determining the exact value of K, but an estimate can be obtained by finding an “elbow point”. Increasing the number of clusters will always reduce the distance to data points, i.e. increasing K will always decrease this metric. This metric cannot be used as the sole target because when K is the same as the number of data points, then the metric approaches zero. Therefore, it is ideal to plot the mean distance to the centroid as a function of K. Then identify where the rate of decrease sharply shifts (i.e. the "elbow point"), and use this to determine K.
Hierarchical clustering is an algorithm that groups similar objects into groups of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar. For example, the organization of the files and folders on your personal computer is a hierarchy. Stepping into each of these folders will reveal more folders and files.
Working of Hierarchical Clustering
Start by assigning each observation as a separate cluster.
Find the clusters that are closest together.
Merge them into a single cluster, so that now you have one fewer cluster.
Repeat steps 2 and 3 until all items are clustered together.
Types of Hierarchical Clustering
In divisive (top-down) clustering method we assign all of the observations to a single cluster and then partition the cluster into at least two similar clusters. We proceed recursively on each cluster until there is one cluster for each observation. Divisive clustering is conceptually more complex and thus, rarely used to solve real-life problems.
Agglomerative hierarchical clustering (bottom-up), is a clustering method where we assign each observation to its own cluster. Agglomerative hierarchical clustering starts with every single object in a single cluster. Then, in each successive iteration, it agglomerates (merges) the closest pair of clusters by satisfying some similarity criteria, until all of the data converges in one cluster.
To determine the closest pair of clusters, the distance between each point is calculated using a distance function. These distances are generally called linkage between the clusters. There are three methods to determine the distance (linkage) between the clusters.
i. Single Linkage: In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two points in each cluster.
ii. Complete LinkageIn complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster.
iii. Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.
Leveraging unsupervised learning to generate a machine learning model is now an accepted and feasible process to operate on unclassified data sets. While it’s more complex to set up and tune an unsupervised learning process, the benefit is that the source data does not have to be curated by a human curation team. This is a beneficial process when it’s not feasible or economical to curate the source learning data. In this article, we’ve outlined the core clustering and anomaly detection methods which are used to set up an unsupervised machine learning algorithm. We use unsupervised learning at IceCream Labs as one of the many machine learning processes for our products