by Guangming Lang
~1 min read

Categories

  • da

A lot of my clients think more data lead to more accurate models, but this may not be true. Here’s why.

  • Noise in the predictors or the response are also likely to increase as more data are collected, and hence reduce any positive impact of the large sample on model quality.
  • The law of diminishing returns. Beyond a certain point, more of the same data from the same population won’t improve models by much.
  • More data lead to more computational burden and cost.

Instead of collecting more data, it’s often better to collect different data. In other words, increase P (number of predictors) instead of N (sample size).