When introduced to machine learning, practically oriented textbooks and online courses focus on two major loss functions, the *squared error* for regression tasks and *cross entropy* for classification tasks, usually with no justification for *why* these two are important. I’ll here show that they’re both instances of the same concept: maximum likelihood estimation.

## Revisiting old friends

Before we dive into why we might be interested in these loss functions, let’s ensure that we’re on the same page and quickly recall how they are defined. Whenever we have two vectors $x,y\in\mathbb R^n$, the **mean squared error** is given as

and, assuming now that $x\in{0,1}^n$ and $y\in(0,1]^n$, the **binary cross-entropy** is defined as

In Scikit-learn we find these as `sklearn.metrics.mean_squared_error`

and `sklearn.metrics.log_loss`

, in Keras as `keras.losses.mean_squared_error`

and `keras.binary_crossentropy`

, and in PyTorch as `torch.nn.MSELoss`

and `torch.nn.BCELoss`

.

## A statistical point of view

To explain why these two losses achieve what we want, we first need to agree on what *exactly* it is that we want to achieve. Let’s consider a running regression example. In this case we’re trying to estimate the value of a variable, which for instance could be the number of active Twitter users worldwide in given quarter:

We assume here that there is a *true* answer, meaning that there is a distribution which will accurately model the number of Twitter users throughout all time. We only know these true values for historic data and we’d like to be able to predict the future values.

Let us write $X$ for the random variable following the true distribution, meaning that $X(q)$ will be the actual number of Twitter users in a given quarter $q$. Say that we have chosen our favourite model $\hat X_{\theta}$ with parameter $\theta$, which attempts to model the historic data we have available.

## Mean squared error

We will now assume that the errors our model is making is *normally distributed* around the true values. This assumes that our model is a reasonable choice, of course. If our model simply output the value $1$ for all quarters then this would be far from true, so we’re simply excluding that possibility.

If we further assume that our observations are independently drawn from this distribution, then this means that we can write the density function of $\hat X_{\theta}$’s distribution as

\[\prod_{q=1}^{37}\mathcal{N}(X(q),\sigma^2) = \prod_{q=1}^{37}\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(\hat X_{\theta}(q)-X(q))^2\right).\]By finding the value for our model parameter $\theta$ which maximises this value we’re thus effectively maximising the probability of our model being correct, which is what we want! This is called **maximum likelihood estimation**, where the above expression is our likelihood function in this case.

The exponentials are a bit annoying to deal with, so instead of dealing with the above expression we’ll apply a logarithmic transform and we wind up with the **log-likelihood function**

As the logarithm is strictly increasing our job is equivalent to maximising this new expression, and as we’re maximising with respect to $\theta$ we can ignore the constants. If we also flip the sign then we’re left with the job to *minimise*

which is exactly the squared error! This shows that

*Minimising mean squared error is equivalent to maximising likelihood, assuming that our errors are normally distributed.*

## Cross entropy

The story about cross entropy turns out to be surprisingly similar to the squared error. To see this, recall the notion of expected value of a random variable:

\[EX := \int xp(x)d\mu,\]where $p(x)$ is the density function of $X$’s distribution. If we assume that our probability space is ${1,2,\dots,n}$ with $\mu$ being the discrete probability measure, then the expectation is simply the average:

\[EX = \sum_{k=1}^n p(k)X(k) = \frac{1}{n}\sum_{k=1}^nX(k).\]With this is mind, we can rewrite the mean squared error as $E[(X-\hat X_\theta)^2]$, i.e. when we’re minimising the squared error then we’re reducing the expected value of the squared distance between our predictions and the true values.

Here I was taking $\mu$ to be the discrete probability measure, giving equal probabilities to all the $n$ outcomes. In the mean squared case this translates to giving equal weight to the squared differences throughout all quarters: whether our prediction is wrong in the beginning of 2015 or end of 2018 doesn’t really matter to us.

But what if we’re dealing with a different distribution? When we’re dealing with classifications our true variable $X$ follows a probability distribution: If, say, 25% of the values are true and the rest false, then $X$ follows the distribution with density function

\[p(k) = \left\{\begin{array}{ll}\tfrac{4}{n} & \text{if the $k$'th observation is true}\\ 0 & \text{otherwise}\end{array}\right.\]and $1-X$ then follows the distribution with density function $1-p$. Using these facts, and letting $\text{supp}(X)$ stand for the **support** of $X$, meaning the values for which it is non-zero, we can now rewrite the cross entropy as

which is precisely the two (negative) log-likehood functions corresponding to our two classes! This means that, just as before, *minimising* the cross entropy between the true variable $X$ and the predicted variable $\hat X_\theta$ is equivalent to *maximising* the likelihood that the predicted variable follow the two distributions of the true variable, corresponding to the two classes. The case with more than two classes is completely analogous, we’ll just end up with more distributions.

So there you go, it’s really all about maximum likelihood estimation!