A lot of my clients think more data lead to more accurate models, but this may not be true. Here’s why.
- Noise in the predictors or the response are also likely to increase as more data are collected, and hence reduce any positive impact of the large sample on model quality.
- The law of diminishing returns. Beyond a certain point, more of the same data from the same population won’t improve models by much.
- More data lead to more computational burden and cost.
Instead of collecting more data, it’s often better to collect different data. In other words, increase P (number of predictors) instead of N (sample size).