Metrics in Machine Learning

Ehsan Yousefzadeh-Asl-Miandoab
10 min readJun 15, 2023

--

This post reviews machine learning, its different subbranches very concisely as introduction, then delve into metrics. It is important to be careful when choosing and using or coming up with new metrics evaluating how well a ML approach performs.

Machine Learning (ML)

ML is about learning patterns in data instead of explicitly programming. For example for the following figure, with the help of ML, we can get the equation (model) that estimates the number of chirps per minute for the corresponding temperature. The help of the ML in finding pattern in data gets really noticeable when the number of dimensions of data increases and it is not easy to check the data with human eyes and detect the patterns.

Image from [https://developers.google.com/machine-learning/crash-course/descending-into-ml/linear-regression]

Learning paradigms in ML are categorized as follows, with a concise description about each of them.

(1) supervised: the main tasks in this division are classification and regression. In classification case, all training data are labelled, and for regression case, all data inputs have the corresponding desired value.

(2) unsupervised: the data is not labelled in these practices, and the goal is to find structure or to figure out how data are related to each other. Keep in mind that in real world practices, labeling data can become really expensive (time, hiring people to label the data!). For example, consider that we have millions of images of different animals and they are not labelled and we want to categorize them.

(3) semi-supervised (weak supervised): This approach goal is to tackle the challenge of having a limited amount of labelled data in supervised learning. Thus, it uses the labelled data for training in supervised manner, and uses the large amount of unlabelled data to improve the performance further more by learning the structure of the larger data.

(4) reinforcement learning: this approach works based on rewarding and penalizing behaviors of a learner (agent) in a specific environment.

Data preprocessing is the phase after data collection and it is the process of understanding, cleaning, transforming data for better learning goal. ML practitioners usually try to understand the data by visualizing it. They aim at, if possible, to detect some obvious patterns in the data. They look at central tendencies and measures of shape and dispersion. They transform data, encode it in different space, reduce or augment data. Data that is finally fed into models. Keep in mind that data is always changed into the format of tensors.

Nowadays, artificial neural networks (ANN), which are inspired by how human brain functions, are common practice in solving problems that were hard to get high performance with traditional approaches, like computer vision tasks (for example, Image Classification). ANNs’ concept was tried earlier but due to limited computing power was constrained and forcefully abandoned. But, these days, they are so popular because of large amounts of data (referred as big data), giant modern GPUs, and more efficient algorithms and the possibility of easier software development thanks to very high-level frameworks, e.g., TensorFlow, PyTorch, Keras, Hugging Face. This is called deep learning (DL) when the number of layers of neurons increases. Keep in mind that

AI (mimicking human behavior) > ML (fitting a model to data without explicit programming) > DL (learning patterns in data using neural networks)

The following figure shows how a neuron is modeled.

image credit [https://www.cs.toronto.edu/~lczhang/aps360_20191/lec/w02/terms.html]

Why Metrics?

When we build something new in general, and specially when we develop a new mechanism to address an challenge , we finally want to evaluate it and show its effectiveness to other people compared to available methods. In this way, we use metrics that can give a numerical understanding of how well a mechanism performs compared to what we had. Furthermore, we should be careful on using available metrics that are defined by other people. We must make sure that the metric shows the things we want to or the whole goodness of the product. We will see metrics for some of ML practices and understand how they are developed and used.

Metrics for Classification Algorithm (supervised)

In classification tasks (binary classification, this article focuses for clarifying metrics’ importance), the goal is to identify to which class an image belongs. In evaluating a classification algorithm, we want to show a number representing how well that classifier does its job. For example, we want to show in average how many images are classified correctly in a specific amount of tests.

Accuracy

This metric shows the percentage that the classifier can classify correctly. It is defined as follows. This metrics does not consider the importance of some estimations, which can be vital in some applications. We will check those cases in the coming metrics that consider those cases.

False Negative Rate (FNR) or Miss Rate/ TPR or hit rate or Recall

For understanding this metric, first we need to understand False Positive (FP), False Negative(FN), True Positive(TP) and True Negative(TN) concepts. True/False is used when an estimation was correct/incorrect. Positive/Negative is used when the output of the classification was Yes/No.

image credit [https://en.wikipedia.org/wiki/Confusion_matrix]

The accuracy metric, which we reviewed earlier, can be restated as follows:

False positive rate (FPR) — (miss rate) is defined as follows:

As we can see it is just a formula saying how to divide False negative to the sum of false negative and true positive. But, the most important part is to understand the logic behind it. As we see the other name of this metric is miss rate showing how much we lose by predicting Negativing and getting rid of something, which can be vital in some applications. Let’s consider an example to understand it, assume that we build a ML-based system that based on some data from patients’ diagnosis, categorizes them needy of having serious cure and care of sends them home. False means that the person does not need help and is OK and should leave and rest. On the other hand, True means that the person should be taken care of. In these kind of vital matters, if system makes wrong decisions, humans lives or very high expenses are imposed. So, we need to have a metric to show how much a ML-based system misses (this is why it is also called miss rate). We would like miss rate to be ~zero in vital scenarios. Another metric that complements this metric is True Negative Rate (TNR, or hit rate or recall or sensitivity), which is defined as follows:

For our example, we want out hit rate or recall to be ~1 (very high).

True Negative Rate or selectivity/ False Positive Rate or Fall-out

Selectivity is defined as follows:

For understanding it, let’s consider spam/ham email classification. True means that spam is detected. Spam emails can be dangerous and should not find a way to inbox of the user. So we need high selectivity (~1) in this application. Fall-out is the complementing metric of selectivity.

Precision

This metric gives us a sense of how portion of cases that are predicted True are actually correctly identified. It is true that we might have a good hit-rate but it is possible that the method says yes to every sample, so the method is good for not missing and losing a patient, but can be expensive for the society! and it shows that the method’s precision is not trustworthy. Precision has another name called PPV (positive predictive value).

As precision focuses on positive predictions, another metric negative predictive value focuses on negative predictions.

Recall

Remember that recall is the hit-rate. Read again from metrics.

This quiz from Google really helps to deepen the understanding of earlier reviewed metrics.

F1-score

This metric combines recall and precision metrics that shows how much a ML method is hitting precisely.

Good values for F1-score can be considered as follows:

table credit [https://stephenallwright.com/interpret-f1-score/]

Metrics for Regression

In regression tasks, we want to estimate the continuous value of a variable based on a set of features, e.g., price of a car based on its features, like maximum speed, brand, fuel per 100km, etc. Contrary to the classification task, we use error metrics for regression because in regression we don’t have a set of classes. We review three most common ones here. But there are other metrics that are implemented within Scikit-learn [linked].

Mean Squared Error (MSE)

It is important to keep in mind that MSE is also a loss function and is used by optimization algorithm to find the best model for a set of data points. it is defined as follows, in which the predicted value is subtracted from the actual value, then powered by two for every data point, then their mean is calculated. This metric by having errors powered by 2, punishes models for having large errors. Units are squared units.

The question that comes into mind is that in what range MSE metric should be to have a sense about the mechanism we are using. The common practice in regression is to use a baseline model and compute MSE for that, then computing MSE for next coming methods and compare with them and choose the best one.

Root Mean Squared Error (RMSE)

The unit in this one remains same as the dataset, while punished large errors. For using this metric, we should be careful about respecting the same rules we had for MSE.

Mean Absolute Error (MAE)

It shares the same unit with dataset, and does not weight any error.

Metrics for Image Segmentation

Image segmentation is the task of dividing an image into multiple regions that are indeed sets of pixels. We can consider image segmentation as pixel labeling too. There are various approaches to address this task. Our focus here is to look at some of this area’s metrics.

Pixel Accuracy and Mean Pixel Accuracy

In the following metrics, Pii refers to the number of pixels from class i classified correctly as belonging to class i. On the other hand, Pij refers to the number of pixels from class i, classified belonging to class j. K is the number of whole classes (segments).

When PA metric is divided by whole classes (segments + background), MPA metric comes out.

For these metrics, values closer to 1 are better. PA gives a sense about how good a segmentation approach works for a specific class, and the second one gives a holistic view over all classes as it is averaged.

Intersection over Union (IoU, Jaccard Index) of Mean Intersection over Union (Mean-IoU)

This metric shows how much overlap there is between predicted and ground truth in segmentation task. Values closer to 1 are considered better performance.

Learn more about it from the following post.

Metrics for Unsupervised

In unsupervised learning, we focus on finding structures in the dataset because the dataset does not include any labels about the data points. For example, we crawl internet and gather billions of images from various topics. We want to cluster them. Some metrics used in these approaches are e.g., Minkowski distances, Intra-Inter cluster distance, which following link connect to more information about them.

https://machinelearningmastery.com/distance-measures-for-machine-learning/

Conclusion

In this article, we reviewed ML shortly, then delved into metrics as means of showing the performance of a method, and saw some metrics to develop a better sense and understanding of them.

References

https://machinelearningmastery.com/regression-metrics-for-machine-learning/

BECOME a WRITER at MLearning.ai //Try These FREE ML Tools Today!

--

--