Cracking the Machine Learning Interview — Binary Classification Metrics

Zhibing Zhao
4 min readAug 18, 2022

Binary classification is very classic, and probably the simplest supervised learning problem. It is also the most popular problem in machine learning interviews. In a binary classification task, you want to predict a yes/no using given features. Binary classification has broad applications, including disease diagnosis, spam detection, etc. This post is not a tutorial for binary classification, but to go over popular aspects about binary classification metrics in interviews.

If binary classification is part of the interview, you will be asked about metrics for sure. This is fundamental knowledge, but also very confusing if you don’t have enough experience. The core part is the confusion matrix, which I will not show in this post because I can never memorize exactly what the matrix is like. And no interviewer ever asked me to draw the confusion matrix. So let’s skip the matrix and dive deep into each element.

Each element is a two-word phrase. The first word is “True” or “False”, which tells whether the model predicts correctly, and the second word is “Positive” or “Negative”, which tells what the model predicts. More concretely, “True Positive” counts the cases where the model predicts “Positive” and the label is also “Positive”; “True Negative” counts the cases where the model predicts “Negative” and the label is also “Negative”; “False Positive” counts the cases where the model predicts “Positive” and the label is “Negative”; and “False Negative” counts the cases where the model predicts “Negative” and the label is “Positive”. Hopefully, you are not confused until now. If you do feel confused, read the sentence in italic in this paragraph again. These are important building blocks for popular binary classification metrics.

The first important metric is accuracy, which is the ratio of correctly predicted examples ((True Positive+True Negative)/(True Positive+True Negative+False Positive+False Negative)). Knowing the definition is sufficient.

The most important pair of metrics is precision and recall. You may be asked to explain them in plain language. They are both metrics to evaluate the model. Precision (True Positive/(True Positive+False Positive)) describes how likely a data point is labeled positive among the cases where the model predicts positive. Recall (True Positive/(True Positive+False Negative)) describes how likely a data point is predicted positive among the cases whose labels are positive. Both precision and recall lie between 0 and 1, and higher values means better models. I suggest to memorize the explanations and use the explanations to recall the formulas, which is exactly what I did when typing this paragraph. You may heard of Type I Error and Type II Error if you have some knowledge of statistics. They are closely related to precision and recall but much less intuitive. In my experience, I was never asked about Type I Error and Type II Error.

There are tradeoffs between precision and recall. For example, if your model predicts more positives, you will likely to observe an improvement in recall but a regression in precision. While in a typical machine learning problem, you can only optimize for one metric. If you focus on precision, the recall of your model can be very bad. If you don’t want either of precision or recall to be too bad, you can consider F1 score, which is the harmonic mean of precision and recall (2*precision*recall/(precision+recall)).

Another important metric is AUC (Area Under Curve). It is the area under the ROC (Receiver Operating Characteristic) curve. I found this part the hardest to memorize and I always review some online tutorials about AUC before a technical interview. For completeness, I would like to cite a plot

By cmglee, MartinThoma — Roc-draft-xkcd-style.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=109730045

This plot has everything I want to include so it saves me some energy. I would like to thank the author for the great work. Please see the caption for the source of this figure.

A machine learning model does not predict positive or negative directly, but outputs a value between 0 and 1. You can pick a threshold so that a data point is positive if the model output is above the threshold and negative otherwise. ROC is plotted by varying the threshold. For each given threshold, you can calculate the False Positive rate and True Positive rate, and therefore you can get one point on the curve. And the whole curve can be obtained when the threshold changes from 0 to 1. AUC is the area under the ROC curve, which is a measure of performance of the model, aggregated from different threshold values. AUC is usually interpreted as the likelihood of discriminating a positive example from a negative example. The larger the AUC, the better the model.

A few things I would like to emphasize about this ROC curve. The X-axis is False Positive Rate (False Positive/(False Positive+True Negative)), and the Y-axis is True Positive Rate (Recall). The perfect point is at the top left corner and a better curve is closer to the perfect point.

There are many other metrics as well but I believe the metrics in this post are sufficient for most machine learning interviews. I would like to mention that I have a post which provides an overview of machine learning interviews:

And I will write other special topics about machine learning interviews. If you are interested, please stay tuned!

--

--