Data Preprocessing and Xgboost

Master R

By Guangming Lang Comment

We all know we need to pre-process data before we build models. However, if you’re building a XGboost model, you can avoid many of the pre-processing steps. I’ve done some google search and book reading, and the following list summarizes my findings.

  • Remove zero or near-zero variance predictors. (Don’t do it for xgboost or any tree based methods)
  • Remove highly correlated predictors. (Not needed for xgboost or other tree based ensemble methods)
  • BoxCox, Yeo-Johnson, exponential transformation of Manly (1976) and other type of transformations of the same spirit on the predictors. (Not needed for xgboost)
  • Centering and scaling the predictors. (Not needed for xgboost)
  • Missing data imputation. (Not needed for xgboost as it supports missing internally by default)
  • One-hot encoding or dummy variable creation of categorical predictors (Needed for xgboost)
  • PCA (May help with performance for xgboost)

The phrase “not needed” means, by and large, doing it won’t improve (or worsen) model performance.

If you enjoyed this post, get updates. It's FREE