Chapter 2: Assessing Performance
In the last chapter, we introduced the general machine learning pipeline in the context of linear regression. We learned about the regression model, using gradient descent to learn a predictor that minimizes our quality metric. We also introduced the important concept of the features used by the model, where you can transform your input data to learn more complex relationships (e.g. polynomial regression)11. A quick review of these models:
When we introduced this flexibility of learning more complex relationship in the specific context of polynomial regression, we introduced a subtle challenge that we needed to identify a solution to: If we are able to train a regression model with a polynomial of any degree , how do we know which one to use? Remember, we only have access to the given data, not the true function.
If you have prior information, or a domain expert you're working with gives you information about the model you should use, you should start with that. For example, if you have evidence to support that the underlying function is, say linear22. Example: There is strong empirical evidence that shows there is a linear relationship between femur length and your height., then you should start by trying . Remember models are always assumptions about how the world works, so in some contexts, you might want to be skeptical and try other options.
If you don't have as much expertise in the context your working in, you might need to identify what the degree should be. A fundamental question to choosing the right is discussing how to assess the performance of a predictor so that you can compare these various models.
Given what we've discussed so far, our first instinct might be to us the quality metric (e.g., RSS) on the data the predictor was trained from. Given the predictors (colored curves) in the animation above, which one will have the lowest RSS on the training dataset (the black dots)?33. Occasionally, we will have these expandable boxes that prompt you to think about the answer before reading through. There is an explanation of the answer inside the box, but try to think through it before you click!
The model with the highest degree !
Why is that? Well, with a higher degree polynomial, it is allowed to "wiggle" up and down more. If you keep letting the degree grow, it will eventually be able to be complex enough so that the curve passes perfectly through every point. In other words, a sufficiently high degree polynomial might be able to get an RSS of 0. So following that approach would lead to use selecting the model with the highest degree every time!
This happens because there is a mismatch in how we are assessing our predictor and the goal we set out to accomplish originally. Remember, in many contexts the goal of training a predictor is to use it out in the wild on new data as it comes in44. Like Redfin/Zillow trying to predict the price of a house on a new listing.. If we choose the model that minimizes RSS of the data it learned from, we are just rewarding models that have sufficient complexity to memorize the dataset, and have no guarantee on how well it will generalize.
An analogy: Suppose you studied a specific practice exam for a few hours and afterwards, you were able to answer 100% of the questions correctly after taking it multiple times. Would you expect to get 100% on the real exam based on that practice exam alone? Not necessarily! It's entirely possible that you could have just memorized the specific answers on the practice exam, rather than learning general concepts that enable you to answer related questions that you haven't seen before.
The key idea here is that assessing your predictor on data it encountered while training will likely overestimate its true performance on future, unseen data. The predictor is able to shape its knowledge around these specific examples, so it's more likely to get those ones correct. This is exactly the same as it being easier for you to answer a question on the test that also showed up on the practice test.
So if we care about future performance, how might we go about assessing the predictor? Instead of only considering the error metric like the RSS on the training dataset, we will also consider the true error of our predictor. The true error tries to quantify how severe the errors are that we might expect to see in the future.
A brief segue into theory-land
In order to understand what we intend to capture in a notion of "true error", we have to highlight some additional statistical assumptions we are making. For this discussion, we will stick with the housing example, but these ideas apply more broadly to other contexts.
Not all house square footages are equally likely to show up in the wild. There are probably no homes that have fewer than 10 square feet (although some New York City apartments might feel like an exception). We might expect that there is a distribution over the possible square footages, indicating that some square footages are more likely than others.
On top of that, for any particular square footage, we might expect to see a range of possible prices for the house of that size55. This is why our model always includes a in the relationship between features/output.. It's entirely expected that each square footage has its own distribution of prices; if this were not the case, we would predict the same price for every house, regardless of their size. For example, we would expect the prices for larger homes to trend to be more expensive. This forms what statisticians call a "join distribution", where there is a distribution over the combinations of square footage and price.
To get a visual intuition for what we mean by these distributions, we have a picture below. There is a distribution on the left for the how likely it is to see any particularly square footage 66. 📝Notation: The in the label of the right-hand graph is a conditional. This bar, when talking about probability, indicates we are conditioning on some event occurring. In our example, this is conditioning on one possible square footage.. Then for any one possible square footage, there is another distribution of possible prices. This distribution for the prices is specific to that one square footage: it is entirely expected that different square footages will have different distributions as we said before! Important Note: This picture shows normal distributions for both sides, but this does not necessarily need to be the case. Different assumptions of the distribution yield different models.
One other concept that's generally added when discussing true error is allowing the idea of a loss function . A loss function is a generalization of our concept of RSS we discussed before. The loss function is a function that takes the true outcome and the prediction made by our predictor, and outputs a value to quantify the error made 77. Our RSS used earlier can be used as a loss function for a single input/output . This generalization allows us to consider a broader class of loss function other than just RSS.
With these ideas, we can now define the true error as the expected loss we would see over all possible pairs from the possible inputs () and possible outputs (). This tries to capture how wrong our model will be "on average" over all possible inputs/outputs we can see in the future. The true error is defined as:
This notation should be read exactly as our last paragraph states. It's an expected value of the loss incurred over all pairs.
If the inputs and outputs take on discrete values, we can write the to mean the probability of seeing the pair . We can write the idea of the average loss incurred weighted by the probability with the formula88. 📝 Notation: We use to say some element in a set of possible elements . Then, the sum is the sum of values "looping" over every possible element in .:
This definition should be reminiscent of a formula for an expectation (since that is what it is computing), with a few modifications. Now, there is the added complexity of dealing with the pairs which requires the nested sum. If this sum is large, that means "on average", the model incurs high loss (i.e. has lots of error)
The details of the specific notation is not the main point here. It's important to get the intuition behind what this value is trying to compute. So with that in mind, our task of selecting the model that generalizes best, is exactly the task of finding the model with the lowest true error.
Unfortunately in most real-life circumstances, it's not possible to compute this true error! You might not know the exact distributions of the houses or the distribution of prices conditioned on a particular house size. Since we don't know the distribution, there is no way we can exactly compute this expectation. So without access to all future data, how can we actually compute the true error?
Back to practice
A very common technique in machine learning suggests that if you aren't able to exactly compute something, you can try to estimate it using data you have. That's what we will do here. The idea is to hide part of our dataset from our ML algorithm and use that hidden data as a proxy for "all future data" after the predictor is trained.
The most common way to accomplish this is to split your dataset into a training set and a test set.
- The training set is used by the ML algorithm to train the predictor.
- The test set is used the estimate the performance of how we expect the predictor to do on future data.
So even though the value we really care about is the true error, we will use this error computed from the test set as a stand-in for that value. We call the error made by the model on the test set the test error, which we can compute. In the case of regression using RSS as the loss function99. 📝 Notation: We use a new notation for to signify that it is the predictor defined by our estimates for the coefficients by saying that . Just two notations for the same thing, but the second is more explicit in what is estimated!
We use to mean the same thing as or ., the test error is defined as:
More generally, a common definition of the test error is the average loss for whichever loss function you are using1010. 📝Notation: We use the notation to mean the number of elements in the set . So is the number of test examples we are using..
Our hope is that by using a large enough test set, we can reasonable approximate the true error. So how big of a test set is big enough?
Well in some sense, you want as big of a test set as possible since the more examples in your test set, the better estimate of the true error you will get. You can think of the extreme case where your test set contains all possible input/outputs, then you will exactly be able to compute the true error.
However, because you only have finite data, by making your test set larger you will need to make your training set smaller. This can cause problems since we want as much training data possible to give us the best possible estimate of the true function. Consider the animation below that compares a small training dataset to a large one.
TODO(manim): Train data size
In practice, people generally use a ratio of 80% train and 20% test or 90% train and 10% test, but this really depends on your context and how much data you have! Two very important points about this test set:
- When splitting a train/test set, you should do so randomly. If you selected the last 20% of the data as a test set you could potentially introduce biases in the test set. Imagine your data was sorted by square footage from smallest to largest. If you selected the last 20% of the rows as a test set, your test set would be entirely larger houses than the ones you trained on. This is not ideal since we wanted the test set to be a stand in for "all future data", which is not entirely large houses.
- Once you have put data in your test set, you must never train your predictor using those examples. You have to guard those test examples away from the ML algorithm with your life! If you fail to keep them separate and you accidentally train your predictor on the test set, you will no longer have a good estimate of the true error!
In this section, let's explore the relationship between a model's complexity (e.g., its degree and its various types of error. By complexity of the model, we have a hand-wavy notion of the learned predictor's ability to learn more complicated relationships in the data (like a high degree polynomial); so in our regression example, a simple model is one like a line while a complex model is a high degree polynomial.
The animation below shows the calculation of the training error as we increase the degree of the polynomial . We draw a small number of points in the animation and then extrapolate the learning curve for all complexities in between. As we increase the complexity, the model has more and more capacity to perfectly match the data, so as we might expect, the training error would decrease.
Now consider what happens to the true error1111. We will mention what happens to test error in a bit as we change this complexity. Remember, we can't compute the true error in most contexts, but it's still useful to think about.
At first, when the model is too simple, the true error is higher. Suppose the simple model you tried at first, was a linear function but the true function was a quadratic. This linear predictor will accrue true error since it is incapable of learning the curve-like pattern, no matter how much training data it had. As the complexity of the model approaches the complexity of the true function, we expect the true error to go down1212. This is assuming our training set is representative of the population. Usually, an assumption we have to make for the idea of using ML in the first place..
As the complexity continues to increase, we see the true error go up again. Think to the curve learned from the high-degree polynomial. Because it is capable of hitting the training points perfectly, the areas in-between get these wild wiggles in the learned predictor. These in-between areas are a source of error since the learned predictor doesn't match the true function. In the next section, we explain a more formal way of articulating these sources of error.
The animation below shows the same learning curves but on the same axes. The same patterns we saw before are present. One addition is the test error being drawn over the true error. We assume, with a large enough test set, that the test error is a good estimate of the true error so they should be close, but not necessarily the same.
TODO(manim): Combined animation of train/true/test
The model with the lowest true error is the optimal model, which we notate as 1313. 📝 Notation: A denoting a variable usually means "optimal". . Unfortunately, we normally don't have access to the true error, which poses a challenge for selecting the optimal model. You might think that we can use the test error to choose the best model since it's estimating the true error. While that does seem reasonable, we will show later in this chapter why that won't work out 1414. Never use the test error to select which complexity model you should use..
We generally have special terms to identify the performance of a model: overfitting and underfitting. We will give a more formal definition of overfitting below, but as a general intuition, models less complex than tend to underfit while models more complex than tend to overfit.
Overfitting happens when your model matches too closely to the noise in the training data rather than learning the generalized patterns. The formal definition of overfitting says a predictor is overfit if there exists another predictor that has the following properties:
In English, this says a model is overfit if there is another model that has higher training error, but lower true error. This means that the model you are using is too specific the training data you have: hence the term "over fit".
You can imagine the definition of underfit is similar, and we leave it out as an exercise for you to generate.
So for many models, there is essentially a knob we can tune to balance the complexity between underfitting and overfitting. For our polynomial regression, it is the degree of the polynomial . As you make the model more complex, it is easier to overfit. As you make the model less complex, it is harder to overfit and more likely to underfit. In the next section, we will talk about how to tune this knob so it is "just right", but first we will introduce some terms to describe why overfitting and underfitting happen.
Whenever we are using machine learning to model the world, we need to balance the signal and the noise that are present in our data1515. The Signal and the Noise: Why So Many Predictions Fail, but Some Don't - Nate Silver, 2012 . It's impossible to tell from a single example what parts of it result from the underlying model and what are contributed by noise. Whenever are a learning, we need to balance trying to fit to the data we are training on with the idea that we don't want to overfit to the specific noise patterns in this one dataset. In the regression model, we can decompose our true error into three components: bias, variance, and irreducible noise.
Before defining these terms, we have to highlight a specific assumption we have been making. We have been assuming the training dataset is a random sample from some underlying population distribution. Since it is a random sample, you could imagine it is just as likely that we would receive another dataset drawn from the same distribution that will look slightly different1616. For example, if I gave you a dataset of 100 coin flips it's just as likely to see a dataset of 52 heads and 48 tails as it is to see a dataset with 48 heads and 52 trails; both are drawn from the same underlying distribution, but slight differences are expected to happen from chance..
When thinking about machine learning, we are thinking that the data is generated from a process following the model we assume. So for the regression model, we assume that for each in our dataset, its corresponding for some unknown . So since there is randomness not only in which inputs we receive, but in their associated output, we will expect to learn different predictors if we trained on different datasets drawn from the same distribution. We can think about what is the "average predictor" if we drew a bunch of different training sets, trained a predictor from each one, and averaged the results. The animation below shows this process and what this average predictor looks like1717. 📝 Notation: We use the notation to mean average..
The bias of a model comes from it being too simple (or a mismatch with reality) that it fails to fit the signal in the data1818. This does not necessarily line up with our every-day use of the word bias (e.g., discriminatory actions/views against certain groups). While it is crucial to avoid that type of bias in our models, this is not what the statistical notion of the term "bias" necessarily means. We will talk about our every-day notion of bias in the Fairness in ML chapter.. In some sense, this signifies a fundamental limitation of the model we are using to fail to fit the underlying phenomena. The bias tries to measure how much the average predictor we will expect differs from the true function .
Mathematically we write this as the following. This definition is tries to capture how much this average predictor differs over the true function. The expectation is taken over the possible inputs you could receive (i.e. weighted to care more about inputs that you are more likely to see).
Low complexity (simple) models tend to have high bias which is why the tend to have higher true error if they are too simple.
A model that is too complex for the task at hand has the potential to overfit to the noise in the data. The flexibility the complex model allows the model to potentially memorize rather than generalize. This error that comes from fitting to noise is attributed as variance of the model.
For our very complex model, a slightly different dataset will lead to a wildly different predictor since the function wiggles a lot between the points. The differences we see in the predictors learned on slightly different datasets is another sign of error. These differences account for fitting to specific artifacts in the one training set we learned on.
Mathematically we write the variance as the following. It tries to capture how much we expect any particular fit on a dataset to differ from the average. If this value is high, it means that we will expect to learn wildly different functions from dataset to dataset (even if they are relatively similar). If this value is low, it means that on each dataset, we learned about the same function (close to the average ). The expectation here is over all possible datasets you could receive as training data.
High complexity models tend to have high variance. Or in other words, we call models that have high variance "complex" to describe that behavior.
It turns out that bias and variance live on this complexity spectrum: decreasing one of bias/variance tends to increase the other.
- Simple models tend to have high bias and low variance
- Complex models tend to have low bias and high variance
In fact, in the case of measuring squared error with regression, you can exactly decompose the true error into contributions from bias and variance1919. Don't worry about the squared business in the equation, just the idea that we can decompose the error.. The noise in this equation corresponds to the distribution of the : if there is lots of underlying noise, there isn't much we can do about making a good predictor.
The following animation shows how the bias and variance change with model complexity, and how those two with noise (which is independent of model complexity) add up to the true error curve we saw earlier.2020. Notice this graph has some of the common things we mentioned earlier about the tendency of low vs high complexity models and their bias/variance.
One subtle point we have been leaving out is the discussion of the amount of data we have to train on in the first place. All of this earlier discussion assumes some fixed, finite dataset size. In fact, the model's complexity is relative to how much data you have. For example, it's really easy for a 20 degree polynomial to overfit to 2 data points, but very difficult for it to overfit to 2 billion data points (it doesn't have the capacity to wiggle 2 billion times).
You can imagine thinking about what happens to our training and true error as we increase the amount of data we have in our training set. For this animation, we are considering some fixed model complexity (e.g., a 20 degree polynomial) and showing train/true error as function of the size of the training set.
TODO(manim) animating fitting 2 points vs 100
The training error starts out small when it is easy for the model to overfit to a small training set. When the training set is small and the model can overfit, we expect to have a higher true error (because it is overfitting). As you increase the training set size, it becomes harder and harder for a fixed-complexity model to overfit once the amount of data exceeds its flexibility2121. Remember our example of 20-degree polynomial's complexity is relative to how much data you have.. This why we see the training error increase as you tend to have more training data. Conversely, since the model struggles to overfit with a large dataset, you see the true error decrease because the variance is actually going down: with a very large dataset, you learn approximately the same function each time.
In the limit, when you have infinite training data, you would expect these curves to meet. This is because having every possible input/output means your training error is just computing the true error! Notice, they don't converge to 0. There is a floor they converge to based on the bias and noise of the model. Irreducible noise will never go away. If you are using a model with high bias2222. Using a linear model when the true function is, say, a cubic function then, no matter how much training data you have, you will not be able to learn a sufficiently complex function so you will always have some non-zero error.
Choosing Model Complexity
So we introduced this idea of training error and true error (and how we approximate it with test error), and where that error can manifest as overfitting or underfitting from the contributions of the model's bias and variance. Now that we understand that model complexity can impact the performance of the model, how do we actually go about picking the right complexity if we don't have access to this true error?
What about choosing the model with the lowest training error? Hopefully from our discussion in the last sections, you can see why this won't find the model that will perform the best in the future.
So maybe we could just choose the model that has the lowest test error. This seems more intuitive since the test error is our approximation of the true error and our goal is to find a model that will do best in the future (i.e. lowest true error). This approach is right in some sense, since we are unlikely to choose a model that is overfit. However, this approach completely ruins the point of the test set.
Remember, we introduced the idea of a test set to approximate how our model would do in the future. If you found that your predictor got 90% of the examples in the test set wrong, you would be pretty confident that the future performance will be close to 90% error too, assuming the test error is a good approximation of true error.
However, if we use the test set to choose the model complexity, the test error is no longer a good estimate of the true error. If we used the test set to select the model complexity, the test set is no longer a good stand-in for "the unknown", since we implicitly chose the model that does best on that one particular test set.
This is a fairly subtle point that we should restate for emphasis. Many people intuitively understand why we have to separate the test set out from the training set: in order to keep it as good estimate of future performance. They tend to get a little confused by the introduction of this second restriction that you can't use the test set to select the model complexity. Remember the dataset that we receive (and the test set we select from it) are just one possible dataset from a large number of possible datasets that could be drawn from the population distribution. We are just using the test set as a stand-in for "the future", but this only works if we never look at it or train the model on it. But by using the test set to select model complexity, you are implicitly choosing which model is best based on the data in that specific test set we used. Even though you never explicitly gave it to the model to train, you might be choosing a model that happens to do well on that specific test set.
Fear not though! There are many principled ways for helping you choose this model complexity. The two popular techniques we discuss are using a validation set and cross validation.
It turns out that your intuition for using a dataset separate from the training set is a very good intuition! The only shortcoming we had earlier was to use the test set as that separate dataset. So instead, let's break off yet another part of our data into a validation set. This validation set is also withheld from training, and we use it to select which model we think will perform best in the future. By using this validation set to select the best model complexity, we can still then use the test set afterwards to get an unbiased estimate of the true error.
The process for selecting the best model using a validation set almost always follows the following pseudocode.
train, test, validation = split_data(dataset)
for each model complexity p:
predictor_p = ml_algorithm(train, p)
val_error = error(predictor_p, validation)
keep track of predictor_p with lowest val_error
return best predictor_p and the error(predictor_p, test)
This process of using a validation set is just one many techniques to select the best model complexity from a set of possible ones (we will explore one more in the next section). Like any technique, it has pros and cons of using it.
- The pros of using a validation set are its simplicity. It's relatively simple to explain. Additionally, it's fairly efficient. For each model complexity, we only have to train one predictor (our next technique requires multiple trainings per complexity class).
- The cons of using a validation set come from the fact that we are forced to set aside another chunk from our data. There was already this tension of needing to balance between the amount of training and test data. Now, we have that same tension but with the additional validation set!
To avoid having to cut up our dataset even further, another common technique is to do cross validation. Cross validation is a clever idea to split up the training set (after we have split off the test set) into many small datasets that we use for training and validation.
We use each chunk of the training data in a process to validate the model. We repeatedly train a model using all but one chunk of the available chunks, and then validate the learned predictor on the left-out chunk. This process is repeated, leaving each chunk out once while training on the others.
So for the image above, in order to evaluate a single model complexity, we will end up training four separate predictors:
- Train on Chunks 1,2,3 and Validate on Chunk 4
- Train on Chunks 1,2,4 and Validate on Chunk 3
- Train on Chunks 1,3,4 and Validate on Chunk 2
- Train on Chunks 2,3,4 and Validate on Chunk 1
We can then average the validation error over those 4 chunks to estimate which model will perform best. This is just to assess a single model complexity, so this process needs to be repeated for each model complexity. In general, we don't necessarily use 4 chunks but decide a setting of chunks to use. More on how we go about picking later.
We specify this process a little more formally with pseudo-code:2323. We use the notation
chunks \ chunk_i to signify all chunks that aren't
chunk_1, ..., chunk_k, test = split_data_cv(dataset)
for each model complexity p:
for i in [1, k]:
predictor_p = ml_algorithm(chunks \ chunk_i, p)
val_error = error(predictor_p, chunk_i)
avg_val_error = average val_error over chunks
keep track of p with smallest avg_val_error
return a new predictor trained on the whole dataset (all chunks)
and evaluate it on the test set
2424. It's okay to train on the whole training set now that we have selected a model. Don't train on test though!
Just like with using a validation set, cross validation has its own pros and cons. It turns out the pros and cons are essentially swapped from the validation set case.
- The pros of using cross validation come from the fact that you don't have to throw out any more data for a validation set. Cross validation is then a good technique to use if you have a fairly small dataset where it would be too much to separate off a whole validation set.
The cons of using cross validation come from performance. For each model complexity, we need to train separate predictors. This can be quite slow if you make large or are evaluating many possible model complexities.
Unfortunately, it turns out that the larger is (i.e., evaluating on more chunks), the better the final predictor ends up being. The theoretically best use of cross validation is to use (where each chunk is a single point). This is called Leave One Out Cross Validation since each time, we are leaving out one example.
For large datasets, you can imagine Leave One Out Cross Validation can be quite slow 2525. If you have 20,000 training points, you would need to train 20,000 predictors per model complexity!which means it is impractical in many contexts. A common choice then is something like or to balance good estimation with the feasibility of it actually running.
In this chapter, we introduced the ideas behind assessing the performance of our models. Specifically, in the context of comparing models of different complexities.
We looked at understanding how the complexity of our model (in our case the degree of the polynomial) impacts the error. We saw how a model can underfit or overfit and how that interacts with how much data we have. We finally saw how this true error can decomposed into these sources of underfitting and overfitting in the form of a model's bias and variance (and irreducible noise).
We then discussed techniques for choosing the right model complexity. Namely, we discussed using a validation set and cross validation as possible approaches to choosing the right model complexity. We compared the two approaches for their pros and cons. Regardless of which technique you use, it is extremely important to understand why we need other approaches like this instead of relying on the test set.
While we focused on the context of polynomial regression and choosing a degree polynomial in this chapter, you will see in this course almost every machine learning problem we will encounter requires these ideas from model selection2626. Another term for "model selection" or "model complexity selection" is hyperparameter tuning. We will introduce this terminology later.. In fact, many current threads of research in the space of machine learning (deep learning in particular) are all focused on how to tune the model's complexity in more efficient ways and how to prevent overfitting2727. We will briefly talk about some more advanced approaches near the end of the course..