4 min read
Precision vs Recall- Demystifying Accuracy Paradox in Machine Learning
Anurag : Oct 31, 2017 10:00:00 PM
Machine learning (ML) is one such field of data science and artificial intelligence that has gained massive buzz in the business community. Developers and researchers are coming up with new algorithms and ideas every day. These ML technologies have also become highly sophisticated and versatile in terms of information retrieval. Machine learning applications range from banking to healthcare to marketing. Unfortunately, we haven’t reached a level of artificial intelligence where we can say that our algorithms are hundred percent accurate. Machine learning enabled Computers aren’t as smart as humans and we need rigorous coding to make them capable of showing some level of intelligence. That said, data-driven companies are working hard to get the best from their algorithms by aiming for relevant results with the highest accuracy possible. But is accuracy really all what you should be aiming for or it’s just a fad? Let’s look at an example to understand this:
Some general terms that you need to know:
Before diving into the details of machine learning algorithms it is important that you understand the basics and standard terms that will be used in this blog. At first, you might feel overwhelmed with the information but rest assured that it is not as complicated as it might appear. The outputs from any classification algorithm can be classified as follows:
- Classes – Actual class is the real output set while predicted class is the set of outputs given by the machine learning algorithms. For example, the actual class can be the data of actual rains in a season while a predicted class can be the data of expected rains in the season.
- True Positives (TP) - The number of true positives refers to the correctly predicted positive values i.e. the value of both the actual class and predicted class is yes. E.g. if the actual class value indicates that it rained and predicted class depicts the same thing.
- True Negatives (TN) - These are the correctly predicted negative values i.e. the value of both actual class and predicted class is no. E.g. if actual class says that it didn’t rain and predicted class depicts the same thing.
- False Positives (FP) - It indicates values where the actual class is no but the predicted class is yes. E.g. if actual class says that it didn’t rain but predicted class showed that there would be rain.
- False Negatives (FN) - It indicates values where the actual class is yes but predicted class in no. E.g. if actual class says that it rained but predicted class showed that there would be no rain.
- Confusion Matrix - A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
- Harmonic Mean of Precision and Recall - The F1 score is the harmonic mean of the precision and recall. The highest possible value of F1 is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero.
- Area Under the ROC Curve (AUC) - The area under the ROC curve ( AUC ) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal)
- Accuracy - It is the measure of the effectiveness of the machine learning model.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Read More: Using Machine Learning to Predict Sentiments
Drilling Down the Accuracy Metric:
Here we are going to analyze a classification model to understand accuracy. Say we have a classifier to identify spams and it shows following results:
Classified Positive | Classified Negative | |
Actual Positive | 10 (TP) | 15 (FN) |
Actual Negative | 25 (FP) | 100 (TN) |
In this case, accuracy = (10 + 100) / (10 + 100 + 25 + 15) = 73.3%. Looks like a decent algorithm. Now let’s see what happens when we switch it for a dumb classifier that marks everything as “no spam”:
Classified Positive | Classified Negative | |
Actual Positive | 0 (TP) | 25 (FN) |
Actual Negative | 0 (FP) | 125 (TN) |
Now accuracy = (0 + 125) / (0 + 125 + 0 + 25) = 83.3%. You saw what happened? Although we moved to a dumb classifier, with exactly zero predictive power, yet, we saw an increase in the accuracy. This is called the accuracy paradox.
In a case where TP < FP, then accuracy will always increase when we move a classification rule that always gives “negative” output. Similarly, in the case where TN < FN, the same will happen when we move to a rule that always gives “positive” output. Fortunately, there is a way to solve this issue. Here comes precision, recall, and F1 to the rescue:
What is Precision, Recall and F1 Score?
Precision:
Precision = TP/TP+FP
Precision is the ratio of correctly predicted positive values to the total predicted positive values. This metric highlights the correct positive predictions out of all the positive predictions. High precision indicates low false positive rate.
Recall (Sensitivity):
Recall = TP/TP+FN
The recall is the ratio of correctly predicted positive values to the actual positive values. Recall highlights the sensitivity of the algorithm i.e. out of all the actual positives how many were caught by the program. High recall means that an algorithm returns most of the relevant results (whether or not irrelevant ones are also returned)
F1 Score:
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
It is the weighted average of Precision and Recall. At first glance, F1 might appear complicated. It is a much more sophisticated metric than accuracy because it takes both false positives and false negatives into account. Accuracy is suitable only when both false positives and false negatives have similar cost (which is quite unlikely). Precision, Recall, and F1 Score offer a suitable alternative to the traditional accuracy metric and offer detailed insights about the algorithm under analysis.
Read More: 5 Machine Learning Trends to Follow
Precision vs Recall - Time to Make a Business Decision:
A common aim of every business executive would be to maximize both precision and recall and that in every way is logical. But machine learning technologies are not as sophisticated as they are expected to be. Any algorithm can be tuned to focus on one metric more than the other. Either your algorithm can be sensitive or it can be precise. The importance of a metric depends on your business goal.
For instance, in case of an algorithm for fraud detection recall is a more important metric. It is obviously important to catch every possible fraud even if it means that the authorities might need to go through some false positives. On the other hand, if the algorithm is created for sentiment analysis and all you need is a high-level idea of emotions indicated in tweets then aiming for precision is the way to go.
The ultimate aim is to reach the highest F1 score but we usually reach a point from where we can’t go any further. Whenever you decide to create a machine learning algorithm keep your priorities defined from the very beginning. Hopefully, our guide on precision vs recall would help you define your targets.
At NewGenApps we specialize in developing Machine Learning applications whether on mobile or web. If you have a project like this then feel free to get in touch.