Hunt the papertiger from boosting to XGBoost, intuitively, mathematically, implementably

We will use this simple data set \{x, y\} throughout the tutorial. If we use the fourth column as the label, the third column is the feature; if we use the third column as the label, the fourth column is the feature.

Sleep data table

Boosting

Boosting is an additive modeling strategy. Instead of training one complex model at once, we train many simple models sequentially. Each new model tries to correct what the current ensemble still gets wrong.

Write the prediction after step m as

\[F_m(x) = F_{m-1}(x) + \eta f_m(x),\]

where F_{m-1} is the current model, f_m is the new weak learner, and \eta is the learning rate. In tree boosting, f_m is usually a regression tree. The important idea is that the tree does not learn the original target directly; it learns the direction in which the current prediction should move.

For squared-error regression this direction is easy to see:

\[L(y, F(x)) = \frac{1}{2}(y – F(x))^2,\]

so the negative gradient is

\[-\frac{\partial L}{\partial F} = y – F(x).\]

That is just the residual. This is why gradient boosting can be introduced as “fit the next tree to the residuals.” For other losses, the same idea still works, but the residual becomes a gradient-derived pseudo-residual.

Gradient Boost for Regression

The loss function of gradient boosting can be written as

\[\mathcal{L}^{(t)} = \sum_{i=1}^n l\left(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\right) + \Omega(f_t).\]

Here \hat{y}_i^{(t-1)} is the prediction before adding the new tree, f_t(x_i) is the update from the new tree, and \Omega(f_t) is a regularization term. To build intuition, ignore regularization first:

\[\mathcal{L}^{(t)} \approx \sum_{i=1}^n l\left(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\right).\]

For squared error,

\[l(y_i, \hat{y}_i) = \frac{1}{2}(y_i – \hat{y}_i)^2.\]

The derivative with respect to the current prediction is

\[\frac{\partial l}{\partial \hat{y}_i} = \hat{y}_i – y_i.\]

Therefore the negative gradient is

\[y_i – \hat{y}_i,\]

which is exactly the residual. A regression boosting iteration is therefore:

  1. Predict with the current ensemble.
  2. Compute residuals or negative gradients.
  3. Fit a new tree to those residuals.
  4. Add the tree to the ensemble, usually multiplied by a learning rate.

This explains the bridge from ordinary residual fitting to gradient boosting: residuals are a special case of negative gradients.

Gradient Boost for Classification

Caveat: to avoid heavy notation, the summation symbol \sum is sometimes omitted below. The derivation is for binary classification.

For a binary classification problem, define odds as

\[\text{odds} = \frac{\text{win}}{\text{lose}}.\]

The probability is

\[p = \frac{\text{win}}{\text{win}+\text{lose}}.\]

With simple algebra,

\[p = \frac{\text{odds}}{1 + \text{odds}} = \frac{e^{\log(\text{odds})}}{1 + e^{\log(\text{odds})}}.\]

This is the logistic sigmoid applied to the log-odds. Let

\[F_m = \log(\text{odds}) = F_{m-1} + \gamma.\]

Then

\[p = \frac{e^{F_m}}{1 + e^{F_m}}.\]

Use binary cross entropy as the loss:

\[\begin{aligned} \mathcal{L}(y, F_{m-1} + \gamma) &= -\left[y\log(p) + (1-y)\log(1-p)\right] \\ &= -\left[y\left(\log(p)-\log(1-p)\right) + \log(1-p)\right] \\ &= -\left[y\log\frac{p}{1-p} – \log(1+\text{odds})\right] \\ &= -\left[y\log(\text{odds}) – \log\left(1+e^{\log(\text{odds})}\right)\right]. \end{aligned}\]

We want to find \gamma that minimizes the loss:

\[\operatorname*{arg\,min}_{\gamma}\; \mathcal{L}.\]

We could work directly on the loss and solve

\[\frac{\partial \mathcal{L}}{\partial \gamma} = 0,\]

but that becomes cumbersome. A second-order Taylor approximation gives a cleaner update. Around the current score F_{m-1},

\[\mathcal{L}(y, F_{m-1}+\gamma) \approx \mathcal{L}(y, F_{m-1}) + g\gamma + \frac{1}{2}h\gamma^2,\]

where

\[g = \frac{\partial \mathcal{L}}{\partial F}, \qquad h = \frac{\partial^2 \mathcal{L}}{\partial F^2}.\]

Set the derivative with respect to \gamma to zero:

\[\frac{\partial}{\partial \gamma}\left(\mathcal{L}(y, F_{m-1}) + g\gamma + \frac{1}{2}h\gamma^2\right) = 0.\]

This gives

\[g + h\gamma = 0,\]

so the optimal local update is

\[\gamma = -\frac{g}{h}.\]

For binary cross entropy, the first derivative with respect to F = \log(\text{odds}) is

\[\frac{\partial \mathcal{L}}{\partial \log(\text{odds})} = -\left(y – \frac{e^{\log(\text{odds})}}{1 + e^{\log(\text{odds})}}\right) = p-y.\]

With some illustration:

Rendered by QuickLaTeX.com

The second derivative of \mathcal{L} with respect to \log(\text{odds}) is

\[\begin{aligned} \frac{\partial^2 \mathcal{L}}{\partial \log(\text{odds})^2} &= \frac{\partial}{\partial \log(\text{odds})} \frac{e^{\log(\text{odds})}}{1 + e^{\log(\text{odds})}} \\ &= \frac{e^{\log(\text{odds})}(1 + e^{\log(\text{odds})}) – e^{2\log(\text{odds})}}{(1 + e^{\log(\text{odds})})^2} \\ &= \frac{1}{1 + e^{\log(\text{odds})}} \frac{e^{\log(\text{odds})}}{1 + e^{\log(\text{odds})}} \\ &= p(1-p). \end{aligned}\]

Therefore, for one observation,

\[\gamma = -\frac{p-y}{p(1-p)} = \frac{y-p}{p(1-p)}.\]

For a leaf containing many observations, the Newton-style update aggregates gradients and Hessians:

\[\gamma_j = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i}.\]

This is the practical bridge from gradient boosting classification to XGBoost.

XGBoost

XGBoost keeps the additive tree model but makes the objective more explicit and regularized. At boosting round t, it minimizes

\[\mathcal{L}^{(t)} = \sum_{i=1}^n l\left(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\right) + \Omega(f_t),\]

where the tree regularization is commonly written as

\[\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T} w_j^2.\]

Here T is the number of leaves, w_j is the score of leaf j, \gamma penalizes adding leaves, and \lambda is L2 regularization on leaf weights.

Using the second-order Taylor approximation,

\[\mathcal{L}^{(t)} \approx \sum_{i=1}^{n}\left[g_i f_t(x_i) + \frac{1}{2}h_i f_t(x_i)^2\right] + \Omega(f_t),\]

where

\[g_i = \frac{\partial l(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}, \qquad h_i = \frac{\partial^2 l(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}.\]

For a fixed tree structure, each sample falls into one leaf. Let I_j be the set of samples in leaf j, and define

\[G_j = \sum_{i \in I_j} g_i, \qquad H_j = \sum_{i \in I_j} h_i.\]

The objective for leaf j becomes

\[G_j w_j + \frac{1}{2}(H_j + \lambda)w_j^2.\]

The best leaf weight is therefore

\[w_j^* = -\frac{G_j}{H_j + \lambda}.\]

This is the XGBoost version of the -g/h update. The regularization term \lambda prevents very large leaf scores when the Hessian is small.

The corresponding score of a tree structure is

\[-\frac{1}{2}\sum_{j=1}^{T}\frac{G_j^2}{H_j + \lambda} + \gamma T.\]

When splitting a leaf into left and right children, XGBoost evaluates the gain:

\[\text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} – \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] – \gamma.\]

A split is useful when this gain is positive and large enough to justify the extra complexity. This is why XGBoost is not just “gradient boosting with trees”; it is gradient boosting with second-order information, explicit leaf scoring, and regularized split selection.

A minimal way to inspect these ideas in Python is:

import numpy as np

# Binary labels and current probabilities from the current ensemble.
y = np.array([1, 0, 0, 1, 1], dtype=float)
p = np.array([0.55, 0.40, 0.48, 0.70, 0.62], dtype=float)

# Logistic-loss gradient and Hessian with respect to the log-odds score F.
g = p - y
h = p * (1 - p)

lambda_l2 = 1.0
leaf_weight = -g.sum() / (h.sum() + lambda_l2)

print(g)
print(h)
print(leaf_weight)

The exact numbers depend on the current predictions and on which samples fall into the leaf. The formula is stable: compute gradients, compute Hessians, aggregate them by leaf, then use -G / (H + lambda).

Reference

  1. Dana D. Sleep Data Personal Sleep Data from Sleep Cycle iOS App. Kaggle. https://www.kaggle.com/danagerous/sleep-data#

Leave a Reply