K-Mean Clustering in the security domain.

4 min readAug 1, 2021

Hello Everyone,

This post will give an idea about k-men clustering and the use case of the k-mean clusters in the security world.

Before starting the K-means, let’s understand the meaning of unsupervised machine learning algorithms.

Unsupervised Machine Learning:

Unsupervised machine learning purports to uncover previously unknown patterns in data, but most of the time these patterns are poor approximations of what supervised machine learning can achieve. Additionally, since you do not know what the outcomes should be, there is no way to determine how accurate they are, making supervised machine learning more applicable to real-world problems.

The best time to use unsupervised machine learning is when you do not have data on desired outcomes, such as determining a target market for an entirely new product that your business has never sold before. However, if you are trying to better understand your existing consumer base, supervised learning is the optimal technique.

What is the k-mean cluster?

K-Mean Clustering is the wholesome idea of Machine Learning. Unlike the older concepts making a machine learning with some sort of data containing both the “x” and “y”, where “y” is the targetted value which can be further used to determine the next value. But there might be some use cases where the previously known “y” value is known.

Where to use k-means?

k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios.

How does the K-Means Algorithm Work?

First, you have to select the number K to decide the number of clusters. Then in the 2nd step, select random K points or centroids. (It can be other from the input dataset). In the 3rd step, assign each data point to their closest centroid, which will form the predefined K clusters. In the 4th step, Calculate the variance and place a new centroid of each cluster. Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster. At last, If any reassignment occurs, then go to step-4 else go to FINISH. hurray, your model is ready.

The k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios. These are some interesting use cases of k-means :

1. Document classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. this is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. the initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarities in document groups.

2. Delivery store optimization

Optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem

3. Identifying crime localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

4. Customer segmentation

clustering helps marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring. here is a white paper on how telecom providers can cluster pre-paid customers to identify patterns in terms of money spent in recharging, sending SMS, and browsing the internet. the classification would help the company target specific clusters of customers for specific campaigns.

5. Insurance fraud detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on their proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, detecting fraud is crucial. check out this white paper on using clustering in automobile insurance to detect frauds.

6. Cyber-profiling criminals

Cyber profiling is the process of collecting data from individuals and groups to identify significant correlations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. here is an interesting white paper on how to cyber-profile users in an academic environment based on user data preferences

Thanks for Reading …
Keep Learning, Keep Sharing !!