2018-06-26/ Amarnath Mishra

Random forest is a technique to bring down the noise while developing a prediction model. This noise is often called over-fitting.

A prediction model comes with its bias and variance factors. Bias is the difference between actual value and the predicted value. Variance, as we know, is the mean of squared differences between predicted values and their mean value. One would, definitely like to, aim for a minimum variance (i.e. range of predicted values to be aptly low) and unbiased model, which also is called minimum variance unbiased estimator (MVUE). In order to create an apt model of MVUE type however, at times, noise is introduced into the prediction model.

Thus, when a model is getting trained, in addition to the training error (or bias, which is removed through cross-validation in general machine learning algorithms), there is a scope of inducing generalization error (or noise). This phenomenon is called over-fitting and, unless treated, will add error to the predicted data. Random Forests are known to wipe this error out.

As it sounds obvious, Random Forest is a collection of (decision) trees. It will add value to this discussion, if we stop-by, briefly, at the concept of Decision Tree, before we proceed further. In fact, you will start to appreciate Random Forest technique, post this brief stop-by.

Decision Trees are the classifier models that help in segregating the instance space, i.e. the dataset at hand. They work on the basic fact that such instance spaces are attributed. Attributes are the features that may be segregating a space for good. Thus, it is seen that such attribute, which have highest variance over the instance space, be taken in priority, followed by the subsequent ones, at a given node to segregate the space further – this is also referred as entropy, which helps in division of the instance space2. More the entropy more is the information received about the space and hence, better is the division of the same. Let’s look at an instance space and the adjoining decision tree to understand this better.

The below basketball game data is the data space3, which we shall use to train and develop a decision tree. We use the value for information gain to choose an attribute once (only) along each route of this decision tree. For that, we always divide the instances at a given node into 2 classes, say W and L, as per the final target concept (Won and Lost in this case).

Gain (at a node) = [Entropy at that node or -p(W) X log2(p(W)) - p(L) X log2(p(L)) ] – [Expected Entropy subsequent to partitioning or WS/(W+L) X p(Ws) X log2(p(WS)) - LS/(W+L) X p(Ls) X log2(p(LS)) ]

For example, at root node, we find that while using the “When” attribute, maximum value for information gain is achieved.

Entropy of 1st set, H(5pm) = -1/4log(1/4) - 3/4log(3/4)

Entropy of 2nd set, H(7pm) = -9/12log(9/12) -3/12log(3/12)

Entropy of 3rd set, H(9pm) = -0/4log(0/4) -4/4log(4/4)

Expected Entropy after partitioning = 4/20*H(5pm)+12/20*H(7pm)+4/20*H(9pm) = 0.65

Information Gain = 1-0.65 = 0.35

Division along a certain path in the tree may stop if all the attributes are expended in that path or the entropy for the values at that node is 0 (essentially, all the values at that node belong to the same class).

This is how division at each node has happened and the complete tree, shown above, is designed. You can see, one may need to recursively use an attribute (except the base attribute, which in above case was “When”) across different paths of the tree based on the concept of Information Gain. You can refer my earlier blog to see CHAID algorithm at work in developing a Decision Tree.

Random Forest was introduced in order to leverage on the tumultuous task, which happened while formulating a decision tree, and to chisel away any noise from this tree.

The trees in Random Forest are formed by shaping them with the help of randomly selected training data (with replacements)4. Here is an illustration to that effect5.

In Random Forest there is no test set for cross-validation to remove the bias; instead, through the above shown process of bootstrapping, subsets of data from training data are randomly picked up to shape the individual decision trees according to each of these subsets. The rule of thumb is to use 2/3rd of total data for this process of generating decision trees. These decision trees are combined to give a resulting prediction and the model, thus formed, is called Random Forest. The combination, often termed as aggregation, is either by the process of averaging the output (for regression lines based problems) or by majority-voting (for classification type problems, e.g. whether next branch’s mango is sweet or sour)6. In case of binary dependent variable, the votes will be either a YES or NO and the number of YES will be the RF score; in regression, it is the average value of dependent variable7. The total process to generate the RF score is termed as bagging (Bootstrap Aggregating).

An example, where a regression based problem is applied with the process of bootstrapping, can be found at this link.

The datasets, which remain after the above random selection, are often referred as OOB (out-of-bag) and used for further validation and approximation of any generalization error. According to the popular practice, 1/3rd of total data is used for this validation process. This also confirms the requirement that the building blocks (i.e. datasets used in developing this random forest) and the OOB, for the model, are co-located for validation.

A very naïve example to explain the purpose of Random Forest would be the approach to obtain the answer to this question –Are teenagers always up to trouble? If your responding set, of people, contains mostly of old guys, you are sure to get a biased collective response. However, if the set contains a diversified ensemble of people (i.e. people from different age groups), it will remove the bias and improve the validity of the responding statement. Similarly, if you have a bagged ensemble of diverse decision trees, which is created from one training data, as described in the above sections, the chances of getting accurate predictions, almost all the time, increases manifolds.