XGboost
A brief review of boosting, gradient boosting, gradient boosting decision tree (GBDT) and XGboost
Boosting
Boosting is a statistical ensemble method, in contrast to Bagging (Bootstrapping aggregation). Bagging trains each base classifier independently and averages the prediction. Boosting trains each base classifier sequentially and uses “residuals” from the previous classifier to train the next classfier. The generic framework of Boosting consists of addictive models and forward stepwise learning.
Although Step 2.a just mathematically represents the goal in this step with a single equation, it is the essential step to actually train a new classifier. Step 2.a depends on the loss function L(y, f(x))
, the base classifier beta*b(x; gamma)
and the optimization method. Depending on the specific optimization process to train the new classifier, it generates different boosting algorithms. For example, it could be the process to train a decision tree if the base classifier is decision tree.
AdaBoost
AdaBoost is one of the boosting algorithms. It has specific rules to (1) compute error/residual (2) update data distribution based on error (3) compute weights for models. In the above forward stepwise learning framework, AdaBoost is equivalent to minimize the exponential loss at step 2.a. AdaBoost Tree uses decision tree as base classifier.
Boosting Tree with MSE Loss
Let’s keep decision tree as the base classifier to develop the procedure to train boosting trees. For regression problem, apparently, the loss function should be MSE. After substituting the ground-truth y and the prediction y^_t-1 + f(x)_t, the objective function is a quadratic function of the new model f(x)_t.
The same idea from XGBoost official website:
Notice that the objective function is basically a quadratic function of the new model f(x)_t. The coefficient of the first order term is the residual (y^_t-1-y). Therefore, it is straightforward to find the optimal f(x)_t to minimize the objective function.
Gradient Boosting (Gradient Boosting Decision Tree-GBDT)
However, it is challenging to transform the objective function to the quadratic form for other loss functions, such as hinge loss, logistic loss, etc. One approach is to use gradient to approximate residual. (why???)
Extreme Gradient Boosting (XGboost) takes a further step to transform the original objective function. It is inspired by the quadratic form for MSE. For other loss functions, the original objective function can be also approximated to a quadratic function of f(x)_t with Taylor’s expansion. The coefficients of the first and second order terms, g_i and h_i, are computed from y_t-1 and are not dependent on f_t(x). But before diving into the detailed derivation, let’s take a step back and review CART because it is the most common base classifier used in XGboost.
Classification and Regression Tree (CART)
Although any base classifier can be used in Boosting, decision tree is the most common classifier. CART was published in 1984 with a few features different from C4.5 and ID3: (1) using Gini index to split features; (2) features can be both numeric and categorical; (3) it can be used for both classification and regression problems. Based on the content in many textbooks and courses, my understanding is that the target variable (the ground-truth y for each data point) ends up in the leaf nodes. A commom way is to take the majority vote for classification and the average for regression.
But more generally, in the slides by Tianqi Chen, the score in a leaf node not only represents the target variable, either label or value. It can be interpreted more generally as a value that depends on the objective function. Here is an example of the leaf score from a CART.
What is +2? It looks like that the tree classifies whether a person likes computer game or not. Wouldn’t it be just a label — ‘yes’ or ‘no’? But think about this. If the problem is defined in a way to predict a “like” score (turn the label into some sort of scores), and the objective function is just to minimize the difference between predicted and grouth-truth scores. So the leaf score depends on the f(x)or y_i^
from the objective function.
The regression tree is just a function that maps the features to the leaf score. The parameters of the tree consist of the structure and leaf scores.
Another part that confuses a bit is that although the tree has a structure and leaf scores as parameters, it doesn’t seem to have “weights”, which is the common parameter for ML algorithms. For example, neural network consists of the structure and weights, which compute the scores. Unlike other algorithms, a tree doesn’t have “weights”, but it has many “spaces” Rx
that are split by features and there is a score in each of the spaces. Here is an example of the mathematical representation of a tree.
It is like a step function.
Herein, the inventor refers to a more general CART.
I wrote down the most confusing parts in this section when I studied Boosting and XGboost. The confusion mainly comes from my previous narrow understanding of how CART works, the leaf score, lack of knowledge about how to represent a tree in a mathematical way. This section clears all of that!
XGboost
Taylor expansion of the objective function. What is the engineering benefit? Notice the index of g and h. What is the size of g and h? Right, each data point has a g_i and h_i, which could be computed in parallel.
Regularization on the tree structure
There are two questions: defining the math representation and the complexity of the tree.
Now let’s dive into the regularization term in the objective function. Regularization has two forms: L1 and L2, restricting the parameters/weights. What are the parameters in a decision tree?
As discussed in the section of CART, the math representation of CART is like a step function. T is the number of leaves. q(x) maps an input x to a d-dimensional space R^d, which has T leaf indices. q(x) contains the tree structure characterized by specific feature and splits. w_j is the leaf score where j=q(x) within [1, T] so w_j is the output of f(x). Regularizing T and w_j is like pruning trees intuitively, restricting it to grow too many branches. Substitute the regularization term into the Taylor expanded objective function, and replace f(x) with w_j:
G_j and H_j can be computed from the data. Thus, if T and q are given (so I_j is known), the only unknown w_j can be determined easily because the objective function is a quadratic form of w_j. The optimal w_j* and obj* are
The objective function represents how good the tree is in term of how well it could minimize the total loss. Up to this point, the only question left is how to find a tree structure, q(x) and T.
An interesting observation is that the objective function of XGboost for the data is the same as the objective function to measure the feature split in each node. For decision tree and random forest, it is similar. The objective for feature split is either entropy/Gini index or MSE while the objective for data is also entropy or MSE.
A new way of learning the tree structure
Here is how to use G_j²/(H_j² + lambda) to split a node, computing the “gain” similar to decision tree.
An example:
XGboost redefines the metric to split features instead of information gain or Gini index. The new metric is not arbitrary, it is based on the derivation, including the original objective function, Taylor’s expansion and regularization. The parameter in XGboost package is here.
最大的困惑可能在于CART中leaf的score到底是什么。因为学过decision tree,可能都会先入为主的认为,leaf中是majority vote of labels或者average of y,总之都是训练数据中的target variable。但作者给出更加general的definition,这个score完全取决于objective function,可以是训练数据的原始target variable,也可以是addictive model中的residual。其实是无关紧要的,只要按照推导优化,那么每次的score都是minimize residual,最后的总和model就一定是最接近原始target variable的。
另一个困惑,对于我来说,看到CART就以为用Gini index。但其实想想,为什么用entropy或者Gini index,好像也是arbitrary挑选的。只是比misclassification更加strict convex。而XGboost是根据上面的objective function和泰勒级数的近似,而推导出的一个Gain,这比Gini效果更好就是因为这是optimization出来的结果。
还有就是用数学表达CART的形式,将pruning也融入到Gain当中。虽然XGboost还是greedy,但是比起从前的pre/post prunning,还是更加quantitative的。
Useful references: