Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.
Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to fined customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.
II. K-Means Clustering
K-Means is a data clustering algorithm which can be used for unsupervised machine learning. It is capable of grouping together similar non-labeled data into a pre-determined number of clusters(k). K-means clustering problem can be formally stated as “Given an integer k and a set of n data points in R^d, the goal is to choose k centers so as to minimize the total squared distance between each data point and its closest center”.
In layman term’s, imagine you have been given a data set consisting of purchasing patterns of customers in a mall and your task is to identify the HNI who can be given selective discounts or can be persuaded to purchase expensive items. By applying Clustering Methods, one can split the customers into varying categories using different parameters and find out those HNI and provide customized offers to them.
- Choose the number of clusters(K) and obtain the data points
- Place the centroids c_1, c_2, ….. c_k randomly
- Repeat steps 4 and 5 until convergence or until the end of a fixed number of iterations
- for each data point x_i: – find the nearest centroid(c_1, c_2 .. c_k) – assign the point to that cluster
- for each cluster j = 1..k – new centroid = mean of all points assigned to that cluster
There would be some instances where we would not know the number of clusters. This can be resolved by using the elbow method. The graph looks as below;
From the above graph we can infer that at k=4, the graph reaches an optimum minimum value. The reason it is named the elbow method is that the optimum number of clusters would represent an elbow joint.
Sample Examples of K-Means Clustering Algorithm:
- Behavioral Segmentation
- Anomaly Detection
- Social Network Analysis
- Market Segmentation
K-Means is an introductory algorithm to clustering techniques and it is the simplest of them. K-Means is an easy to implement and handy algorithm. K-Means approach produce clusters with particular shapes. They are spherical and have approximately the same size.