Surrogate Splits Everywhere

Writing The Art of Feature Engineering was quite intense and there were parts that I enjoyed more than others. By far the bit I enjoyed the most was systematically filling any gaps on the topic I had. Rediscovering details of tools I have used continuously for years, for example.

This is the case with surrogate splits, the ability of decision trees to do smart imputation together with classification. This often not implemented feature of decision trees is part of the original concept of the Classification And Regression Trees (CART):

Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software

When building the decision tree, the best feature to split the dataset is chosen. With surrogate splits, a separate feature is chosen, with the intention for it to be used in case the best feature is undefined. That is called a surrogate split. This way, a decision tree with surrogate splits can handle some degree of missing values in the most intelligent way. Compare that to feature imputation, where we try to fill the missing values in a way that will not attract too much attention to the classifier (because if the classifier bases its prediction on that value it is just imagining the result, it is not based on observed data).

As far as I know, surrogate splits are implemented in the OpenCV ML library (it turns out that OpenCV has a very efficient ML library that can be used outside computer vision) and the rpart R package.

It would be great to add this functionality to other decision trees / random forests libraries, like scikit-learn, Spark ML or even Weka.