Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised.
Suppose that we observe a quantitative response and different predictors, . We assume that there is some relationship between and , which can be written in the very general form:
Here is some fixed but unknown function of , and is a random error term, which is independent of and has mean zero. In this formulation, represents the —information that provides about . Since we generally do not know , statistical learning is the set of approaches for estimating .
There are two main reasons that we may wish to estimate : prediction and inference.
One might reasonably ask: Why would we ever choose to use a more restrictive method instead of a very flexible approach?
If we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model is clearly a good choice since it will be quite easy to understand the relationship between and .
In contrast, very flexible approaches, such as the support vector machines, bagging, and boosting, can lead to such complicated estimates of that it is difficult to understand how any individual predictor is associated with the response.