Which cross-validation strategy should the Data Scientist adopt?
A k-fold cross-validation strategy with k=5
A stratified k-fold cross-validation strategy with k=5
A k-fold cross-validation strategy with k=5 and 3 repeats
An 80/20 stratified split between training and validation
Explanations:
A standard k-fold cross-validation strategy does not ensure that the class distribution is preserved across folds. Given that the disease occurs in only 3% of the population, some folds may end up with no positive cases, leading to an inaccurate evaluation of the model’s performance.
A stratified k-fold cross-validation strategy maintains the proportion of classes in each fold, which is crucial for imbalanced datasets like this one (3% positive cases). This ensures that each fold is representative of the overall class distribution, providing a better estimate of the model’s performance.
While k-fold cross-validation with repeats can provide a more robust estimate, it does not address the issue of class imbalance on its own. If the folds do not maintain the class distribution (which is not guaranteed in a standard k-fold), it could still result in unreliable performance metrics.
An 80/20 stratified split is a good approach, but it is a single split rather than a cross-validation strategy. This option does not utilize multiple folds to validate the model, which limits the robustness of the performance estimate and may not fully leverage the available data for training and testing.