by Daniel Hsu

The practice of deep learning has attracted new attention to several basic questions in statistical machine learning. One such question is how fitting machine learning models to relatively small “training” data sets can lead to accurate predictions on new data. In machine learning jargon, this is the question of *generalization*.

The conventional wisdom in machine learning offers the following about generalization:

- A model that is too simple will
*underfit*the true patterns in the training data, and thus, it will predict poorly on new data. - A model that is too complicated will
*overfit*spurious patterns in the training data; such a model will also predict poorly on new data.

Consequently, one should choose a model that balances these concerns of underfitting and overfitting [30]. A textbook corollary is that a model that exactly fits — i.e., *interpolates *— the training data “is overfit to the training data and will typically generalize poorly” [17].

Recent machine learning practice appears to eschew this conventional wisdom in a dramatic fashion. For instance, a common starting point for training a neural network is to find a model that exactly fits the training data [27]. (Typically, the model is subsequently fine-tuned using different criteria, but the starting point is already nontrivial.) While this may be difficult or impossible with small neural networks, it becomes substantially easier after making the network very large. It is remarkable that this practice is compatible with generalization, despite the concerns about overfitting.

This apparent contradiction of conventional wisdom is not new. In the last century, machine learning practitioners were already aware of the efficacy (and advantages) of using “oversized” neural networks where the number of model parameters exceeds the number of training data [9, 20]. Similar observations were made about large ensembles of decision trees: it was found that increasing the number of decision trees in an ensemble could improve accuracy, even beyond the point at which the ensemble exactly fit the training data [28]. These empirical observations motivated the development of a new theory that greatly sharpens the existing body of knowledge in statistical machine learning [1, 28], and this theory has been recently extended to deep neural networks [2, 15, 26].

However, the new theory does not fully explain recent observations about interpolating neural networks. First, in more recent experiments, neural networks are trained to interpolate noisy data [34]. By noisy data, we mean data in which the prediction targets (labels) may be erroneous. In fact, experiments have been conducted in which noise is deliberately injected in the data by assigning random labels to subsets of the training data. Neural networks that interpolate these data were nevertheless found to have nontrivial accuracy on new data. Second, similar phenomena have been observed in the context of predicting real numerical targets. (The aforementioned theoretical and empirical studies mostly focused on classification targets.) When predicting real numbers, it is almost always a given that training data will have noisy labels, so the thought of interpolating training data seems even more preposterous. But again, in recent experimental studies, interpolating (or nearly interpolating) models were found to have nontrivially accurate predictions on new data [4, 7].

CONTINUE READING