XGboost

regularizer
7 min readOct 1, 2019

A brief review of boosting, gradient boosting, gradient boosting decision tree (GBDT) and XGboost

Boosting

Boosting is a statistical ensemble method, in contrast to Bagging (Bootstrapping aggregation). Bagging trains each base classifier independently and averages the prediction. Boosting trains each base classifier sequentially and uses “residuals” from the previous classifier to train the next classfier. The generic framework of Boosting consists of addictive models and forward stepwise learning.

Forward stepwise method

Although Step 2.a just mathematically represents the goal in this step with a single equation, it is the essential step to actually train a new classifier. Step 2.a depends on the loss function L(y, f(x)), the base classifier beta*b(x; gamma)and the optimization method. Depending on the specific optimization process to train the new classifier, it generates different boosting algorithms. For example, it could be the process to train a decision tree if the base classifier is decision tree.

AdaBoost

AdaBoost is one of the boosting algorithms. It has specific rules to (1) compute error/residual (2) update data distribution based on error (3) compute weights for models. In the above forward stepwise learning framework, AdaBoost is equivalent to minimize the exponential loss at step 2.a. AdaBoost Tree uses decision tree as base classifier.

How AdaBoost fits into the forward stepwise method.

Boosting Tree with MSE Loss

Let’s keep decision tree as the base classifier to develop the procedure to train boosting trees. For regression problem, apparently, the loss function should be MSE. After substituting the ground-truth y and the prediction y^_t-1 + f(x)_t, the objective function is a quadratic function of the new model f(x)_t.

The residual r is used to fit a classification tree T(x; theta).

The same idea from XGBoost official website:

The objective function for the loss MSE

Notice that the objective function is basically a quadratic function of the new model f(x)_t. The coefficient of the first order term is the residual (y^_t-1-y). Therefore, it is straightforward to find the optimal f(x)_t to minimize the objective function.

Gradient Boosting (Gradient Boosting Decision Tree-GBDT)

However, it is challenging to transform the objective function to the quadratic form for other loss functions, such as hinge loss, logistic loss, etc. One approach is to use gradient to approximate residual. (why???)

r_mi is the first order derivative of f(x)_t to approximate the residual.

Extreme Gradient Boosting (XGboost) takes a further step to transform the original objective function. It is inspired by the quadratic form for MSE. For other loss functions, the original objective function can be also approximated to a quadratic function of f(x)_t with Taylor’s expansion. The coefficients of the first and second order terms, g_i and h_i, are computed from y_t-1 and are not dependent on f_t(x). But before diving into the detailed derivation, let’s take a step back and review CART because it is the most common base classifier used in XGboost.

Classification and Regression Tree (CART)

Although any base classifier can be used in Boosting, decision tree is the most common classifier. CART was published in 1984 with a few features different from C4.5 and ID3: (1) using Gini index to split features; (2) features can be both numeric and categorical; (3) it can be used for both classification and regression problems. Based on the content in many textbooks and courses, my understanding is that the target variable (the ground-truth y for each data point) ends up in the leaf nodes. A commom way is to take the majority vote for classification and the average for regression.

But more generally, in the slides by Tianqi Chen, the score in a leaf node not only represents the target variable, either label or value. It can be interpreted more generally as a value that depends on the objective function. Here is an example of the leaf score from a CART.

The +2 and -1 scores are the prediction score in the leaf given the features of the input.

What is +2? It looks like that the tree classifies whether a person likes computer game or not. Wouldn’t it be just a label — ‘yes’ or ‘no’? But think about this. If the problem is defined in a way to predict a “like” score (turn the label into some sort of scores), and the objective function is just to minimize the difference between predicted and grouth-truth scores. So the leaf score depends on the f(x)or y_i^from the objective function.

Different interpretations of predicted scores y_i^, which is used to define the objective function.

The regression tree is just a function that maps the features to the leaf score. The parameters of the tree consist of the structure and leaf scores.

Interpret a tree with math

Another part that confuses a bit is that although the tree has a structure and leaf scores as parameters, it doesn’t seem to have “weights”, which is the common parameter for ML algorithms. For example, neural network consists of the structure and weights, which compute the scores. Unlike other algorithms, a tree doesn’t have “weights”, but it has many “spaces” Rxthat are split by features and there is a score in each of the spaces. Here is an example of the mathematical representation of a tree.

R_k is the space. c_k is the score in R_k. R_k is split based on feature j and split value s.

It is like a step function.

Analogy: tree and step function

Herein, the inventor refers to a more general CART.

A more general understanding of regression tree

I wrote down the most confusing parts in this section when I studied Boosting and XGboost. The confusion mainly comes from my previous narrow understanding of how CART works, the leaf score, lack of knowledge about how to represent a tree in a mathematical way. This section clears all of that!

XGboost

Taylor expansion of the objective function. What is the engineering benefit? Notice the index of g and h. What is the size of g and h? Right, each data point has a g_i and h_i, which could be computed in parallel.

Regularization on the tree structure

There are two questions: defining the math representation and the complexity of the tree.

Now let’s dive into the regularization term in the objective function. Regularization has two forms: L1 and L2, restricting the parameters/weights. What are the parameters in a decision tree?

The regularization term is defined with the number of leaves T and the leaf scores w_j.

As discussed in the section of CART, the math representation of CART is like a step function. T is the number of leaves. q(x) maps an input x to a d-dimensional space R^d, which has T leaf indices. q(x) contains the tree structure characterized by specific feature and splits. w_j is the leaf score where j=q(x) within [1, T] so w_j is the output of f(x). Regularizing T and w_j is like pruning trees intuitively, restricting it to grow too many branches. Substitute the regularization term into the Taylor expanded objective function, and replace f(x) with w_j:

The objective function changes the index of summation from the index of the data point — i to the index of the tree leaf — j.

G_j and H_j can be computed from the data. Thus, if T and q are given (so I_j is known), the only unknown w_j can be determined easily because the objective function is a quadratic form of w_j. The optimal w_j* and obj* are

The objective function represents how good the tree is in term of how well it could minimize the total loss. Up to this point, the only question left is how to find a tree structure, q(x) and T.

An interesting observation is that the objective function of XGboost for the data is the same as the objective function to measure the feature split in each node. For decision tree and random forest, it is similar. The objective for feature split is either entropy/Gini index or MSE while the objective for data is also entropy or MSE.

A new way of learning the tree structure

Here is how to use G_j²/(H_j² + lambda) to split a node, computing the “gain” similar to decision tree.

An example:

XGboost redefines the metric to split features instead of information gain or Gini index. The new metric is not arbitrary, it is based on the derivation, including the original objective function, Taylor’s expansion and regularization. The parameter in XGboost package is here.

最大的困惑可能在于CART中leaf的score到底是什么。因为学过decision tree,可能都会先入为主的认为,leaf中是majority vote of labels或者average of y,总之都是训练数据中的target variable。但作者给出更加general的definition,这个score完全取决于objective function,可以是训练数据的原始target variable,也可以是addictive model中的residual。其实是无关紧要的,只要按照推导优化,那么每次的score都是minimize residual,最后的总和model就一定是最接近原始target variable的。

另一个困惑,对于我来说,看到CART就以为用Gini index。但其实想想,为什么用entropy或者Gini index,好像也是arbitrary挑选的。只是比misclassification更加strict convex。而XGboost是根据上面的objective function和泰勒级数的近似,而推导出的一个Gain,这比Gini效果更好就是因为这是optimization出来的结果。

还有就是用数学表达CART的形式,将pruning也融入到Gain当中。虽然XGboost还是greedy,但是比起从前的pre/post prunning,还是更加quantitative的。

Useful references:

  1. 一步一步理解GB,GBDT,xgboost
  2. 分类与回归树CART
  3. xgboost的原理没你想象的那么难
  4. 数据挖掘面试题之决策树必知必会
  5. GBDT迭代决策树入门教程|简介
  6. 统计学习方法-李航
  7. 提升树GBDT详解
  8. XGboost parameters

--

--