Overfitting is a challenge inherent to machine learning in general and is one of the most significant challenges you will face when applying deep learning specifically. Overfitting occurs when a model fits patterns in the training data so closely that it includes non-generalizable noise or non-scientifically relevant perturbations in the relationships it learns. In other words, the model fits patterns that are overly specific to the data it is training on rather than learning general relationships that hold across similar datasets. This subtle distinction is made clearer by seeing what happens when a model is tested on data to which it was not exposed during training: just as a student who memorizes exam materials struggles to correctly answer questions for which they have not studied, a machine learning model that has overfit to its training data will perform poorly on unseen test data. Deep learning models are particularly susceptible to overfitting due to their relatively large number of parameters and associated representational capacity. Just as some students may have greater potential for memorization, deep learning models seem more prone to overfitting than machine learning models with fewer parameters. However, having a large number of parameters does not always imply that a neural network will overfit [@doi:10.1073/pnas.1903070116].
In general, one of the most effective ways to combat overfitting is to detect it in the first place. One way to do this is to split the main dataset being worked on into three independent parts: a training set, a tuning set (also commonly called a validation set in the machine learning literature), and a test set. These three partitions allow you to optimize models by iterating between model learning on the training set and hyperparameter evaluation on the tuning set without affecting the final model assessment on the test set. A researcher can compare the model's performance on the training and tuning data to assess how overfit (i.e. non-generalizable) the model is. The data used for testing should be "locked away" and used only once to evaluate the final model after all training and tuning steps are completed. This type of approach is necessary for evaluating the generalizability of models without the biases that can arise from learning and testing on the same data [@arxiv:1811.12808; @doi:10.1162/089976698300017197]. While a slight drop in performance from the training set to the test set is normal, a significant drop is a clear sign of overfitting. See Figure @fig:overfitting-fig for a visual demonstration of an overfit model that performs poorly on test data.
There are a variety of techniques to reduce overfitting, including data augmentation and regularization techniques [@url:http://jmlr.csail.mit.edu/papers/v15/srivastava14a.html; @tag:krogh-weight-decay]. Another way to reduce overfitting, as described by Chuang and Keiser, is to identify the baseline level of memorization that is occurring by training on data that has its labels randomly shuffled [@doi:10.1021/acschembio.8b00881]. By comparing the model performance with the shuffled data to that achieved with the actual data [@doi:10.1021/acschembio.8b00881], a practitioner can identify overfitting as a model that performs no better on real data. This suggests that any predictive capacity is not due to data-driven signal. One important caveat when working with partitioned data is the need to apply transformation and normalization procedures equally to all datasets. The parameters required for such procedures (for example, quantile normalization, a common standardization method when analyzing gene-expression data) should only be derived from the training data, and not from the tuning or test data.
Additionally, many conventional metrics for classification (for example, area under the receiver operating characteristic curve) have limited utility in cases of extreme class imbalance [@pmid:25738806]. Consider an example of a dataset of mammograms in which 99% of the samples do not have breast cancer. A model could achieve 99% accuracy by classifying every sample as negative. Therefore, model performance should be evaluated with a carefully picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [@doi:10.1021/acs.molpharmaceut.7b00578]. One alternative approach is to use the precision-recall curve rather than the receiver operating characteristic since the former is more robust to class imbalance [@doi:10.1145/1143844.1143874].
When working with biological and medical data, one must also carefully consider potential sources of bias and/or non-independence when defining training and test sets. For example, a deep learning model for pneumonia detection in chest X-rays appeared to perform well within the hospitals providing the training data, but then failed to generalize to other hospitals [@doi:10.1371/journal.pmed.1002683]. This resulted from the deep learning model picking up on signals related to which hospital the images were from and represents a type of artifact or batch effect that practitioners must be vigilant towards. When dealing with sequence data, holding out test data that are evolutionarily related or that share structural homology to the training data can result in overfitting that is hard to detect due to the inherent relatedness of the partitioned data [@doi:10.1093/bib/bbv082]. In such situations, simply holding out test data selected from a random partition of the training data can be insufficient. The best remedy for identifying confounding variables is to know your data and to test models on truly independent data.
In essence, practitioners should split data into training, tuning, and single-use testing sets to assess the performance of the model on data that can provide a reliable estimate of its generalization performance. Furthermore, practitioners should be cognizant of the danger of skewed or biased data artificially inflating performance.