Boosting Techniques for Machine Learning — XGBoost for Regression and Classification
The XGBoost is highly correlated with Gradient Boost. It stands for Extreme Grident Boost.
XGboost is for large and complex datasets. And this is why people use it on Kaggle.
The procedure for a basic XGBoost for Regression
1. Initial Prediction: 0.5 for both regression and classification.
2. Similarity Score and Gain: the most common way is to get all the residuals into one leaf and calculate the similarity score:
where lambda is the regularization parameter, which helps prevent overfitting.
Then we split the obs by the averages between every two obs. The “cancel off” for similarity score means the smaller the similarity, the less they are similar.
Then we calculate the Gain value(Calculate how great is it the leaves classify similar residuals compared to the root):
We calculate all the Gains based on different splitting methods. The more the Gain, the better the splitting. And we choose the split position with the largest gain value.
When we still have more than 1 residuals in one leaf, we can continue to split it. At last, it should be smaller than 6 levels.
3. Prune the Tree:
Pick the gamma and compare it with the Gain.
Gain — Gamma > 0 Keep the tree.
Gain — Gamma < 0 Prune the tree.
If you want to fight with overfitting, you should set the gamma larger
Note: If the lower tree is kept, the top tree will be kept too.
4. Predicting:
Predicted Value = Initial prediction + eta(learning rate) * output from the leaves(mean value)
Then the procedures repeat to reduce the residuals. Every small step is in the right direction.
The procedure for a basic XGBoost for Classification
The similarity score changed to :
Same as regression, the larger the score, the more similar the leaves.
Same as regression, the larger the lambda, the easier the trees getting pruned.
The Gain Value is the same as well:
Here is a new terminology: Cover. It determines the minimum number of residuals in each leaf. The cover is the denominator of similarity score minus lambda, which is:
If this number is smaller than the cover we defined in the function, we will remove the leaf.
Then we pick the gamma and prune the tree.
Then we calculate the output value, which is
We need to convert the initial probability into “log of odds”, where odds is p/(1-p)
Predicted Value = log of odds of Initial prediction + eta(learning rate) * output from the leaves(mean value)
Then we convert this value back to probability with logistic function:
At last, we fit the new probability predicted to the steps above to add more trees.
The Math Part for XGBoost
- Loss Function:
For Regression:
For Classification:
The total loss will be the sum of this function.
2. The goal is to minimize:
Here the T is the number of leaves(Terminal Nodes), and the gamma is a user-defined penalty, which encourages pruning.
O is the output value. This term is very similar to ridge regression. We can get the optimal value of output based on minimizing the formula above.