The role of Confusion Matrix in the Cyber ​​Security world.

Yagyandatta Murmu
7 min readJun 6, 2021

“At the end of the day, the goals are simple: safety and security.”

Hello connections,

This article will focus on the Confusion Matrix and the role of the Confusion Matrix in the world of Cyber Security.

Understanding Cyber Security

Today, the internet has become an important part of our lives. Almost everything we do requires the internet. Let’s face it. Everything from instant messaging, banking, emailing, shopping, and even traveling, none of these things can be done without the internet. Moreover, with the growing need for the internet, protecting our information and data has also become necessary. Whether you own a company, business, or simply a habitual user of the internet, you should be aware of how to minimize threats, risks, and cybercrime and be cautious, proactive, and stay informed of Cyber-Criminals.

The number of cyberattacks has grown up steadily over the last few years. In 2016, 758 million malicious attacks occurred according to KasperskyLab, (an attack launched every 40 seconds) and the cost of cybercrime damages is expected to hit $5 trillion by 2020.

Below are a few examples of companies that have fallen victim and paid a high price for it.

Adobe (October 2013)

Adobe announced in October 2013 the massive hacking of its IT infrastructure. Personal information of 2.9 million accounts was stolen (logins, passwords, names, credit card numbers, and expiration dates). Another file discovered on the internet later brought the number of accounts affected by the attack to 150 million (only 38 million active accounts). The most worrying problem for Adobe was the theft of over 40GB of source code.

Canva (May 2019)

In May 2019 Australian graphic design tool website Canva suffered an attack that exposed email addresses, usernames, names, cities of residence, and salted and hashed with crypt passwords (for users not using social logins — around 61 million) of 137 million users. Canva says the hackers managed to view, but not steal, files with a partial credit card and payment data.

Sina Weibo (March 2020)

With over 500 million users, Sina Weibo is China’s answer to Twitter. However, in March 2020 it was reported that the real names, site usernames, gender, location, and — for 172 million users — phone numbers had been posted for sale on dark web markets. Passwords were not included, which may indicate why the data was available for just ¥1,799 ($250).

LinkedIn (2012 and 2016)

In 2012 the company announced that 6.5 million unassociated passwords (unsalted SHA-1 hashes) were stolen by attackers and posted onto a Russian hacker forum. However, it wasn’t until 2016 that the full extent of the incident was revealed. The same hacker selling MySpace’s data was found to be offering the email addresses and passwords of around 165 million LinkedIn users for just 5 bitcoins (around $2,000 at the time).

By going through the cyber crimes in these years, we can understand how important cybersecurity is now.

Understanding Confusion Matrix

When we get the data, after data cleaning, pre-processing, and wrangling, the first step we do is to feed it to a model and of course, get output in probabilities. But hold on! How can we measure the effectiveness of our model? Better the effectiveness, better the performance, and that is exactly what we want. And it is where the Confusion matrix comes into the limelight. The confusion matrix is a very popular measure used while solving classification problems in Machine learning. It can be applied to binary classification as well as for multiclass classification problems.

Let’s understand with an example confusion matrix for a binary classifier,

From this matrix, we can see

  • There are two possible predicted classes: “yes” and “no”. If we were predicting the presence of a disease, for example, “yes” would mean they have the disease, and “no” would mean they don’t have the disease.
  • The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease).
  • Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55 times.
  • In reality, 105 patients in the sample have the disease, and 60 patients do not.

Let’s define some basic terms,

  • true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
  • true negatives (TN): We predicted no, and they don’t have the disease.
  • false positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)
  • false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)

Let’s add some row and columns in the above diagram,

  • Accuracy: Overall, how often is the classifier correct?
(TP+TN)/total = (100+50)/165 = 0.91
  • Misclassification Rate: Overall, how often is it wrong?
--> (FP+FN)/total = (10+5)/165 = 0.09
--> equivalent to 1 minus Accuracy
--> also known as "Error Rate"
  • True Positive Rate: When it’s actually yes, how often does it predict yes?
--> TP/actual yes = 100/105 = 0.95
--> also known as "Sensitivity" or "Recall"
  • False Positive Rate: When it’s actually no, how often does it predict yes?
--> FP/actual no = 10/60 = 0.17
  • True Negative Rate: When it’s actually no, how often does it predict no?
--> TN/actual no = 50/60 = 0.83
--> equivalent to 1 minus False Positive Rate
--> also known as "Specificity"
  • Precision: When it predicts yes, how often is it correct?
--> TP/predicted yes = 100/110 = 0.91
  • Prevalence: How often does the yes condition actually occur in our sample?
--> actual yes/total = 105/165 = 0.64

Cyber Attack Detection and Classification Using Parallel Support Vector Machine

Cyber-attack detection and classification is an important area of
research in the field of network systems. The variant of cyberattack also provides great difficulty for the detection process. In the current research trends, various cyber-attack detection techniques apply in concern of data mining such as clustering, classification, and model validation. As a consequence of cyberattack detection, some authors used graph-based techniques for the collection of different features. In cyber-attack detection
feature extraction and feature, the selection is important aspects.
Cyber-attack classification can either have a single variable approach or a multi-variable approach to detect Cyber-attack depending on the algorithm used.

The rapid increase in connectivity and accessibility of computer systems has resulted in frequent chances for cyber-attacks. Attacks on computer infrastructures are becoming an increasingly serious problem. Cyber attack detection is a classification problem, in which we classify the normal pattern from the abnormal pattern (attack) of the system. The subset selection decision fusion method plays a key role in cyber-attack detection. It has been shown that redundant and/or irrelevant features may severely affect the accuracy of learning algorithms. The SDF is a very powerful and popular data mining algorithm for decision-making and classification problems. It has been used in many real-life applications like medical diagnosis, radar signal classification, weather prediction, credit approval, and fraud detection, etc.

The rapid increase in connectivity and accessibility of computer systems has resulted in frequent chances for cyber-attacks. Attacks on computer infrastructures are becoming an increasingly serious problem. Cyber attack detection is a classification problem, in which we classify the normal pattern from the abnormal pattern (attack) of the system. The subset selection decision fusion method plays a key role in cyber-attack detection. It has been shown that redundant and/or irrelevant features may severely affect the accuracy of learning algorithms. The SDF is a very powerful and popular data mining algorithm for decision-making and classification problems. It has been used in many real-life applications like medical diagnosis, radar signal classification, weather prediction, credit approval, and fraud detection, etc.

Conclusion:

Though not all people are victims of cybercrimes, they are still at risk. Crimes by computer vary, and they don’t always occur behind the computer, but they are executed by the computer. The hacker’s identity is ranged from 12 years young to 67years old. The hacker could live three continents away from its victim, and they wouldn't even know they were being hacked. Crimes done behind the computer are the 21st century’s problem. With the technology increasing, criminals don’t have to rob banks, nor do they have to be outside to commit any crime. They have everything they need on their lap. Their weapons are’t guns anymore; they attack with mouse cursors and passwords.

Happy Coding!

--

--

Yagyandatta Murmu

Devops || MlOps || Flutter || Web Development || PYTHON || Data Science || AWS cloud || GCP || Azure