Kernel random forest

In machine learning, kernel random forests establish the connection between random forests and kernel methods. By slightly modifying their definition, random forests can be rewritten as kernel methods, which are more interpretable and easier to analyze.^[1]

History

Leo Breiman^[2] was the first person to notice the link between random forest and kernel methods. He pointed out that random forests which are grown using i.i.d random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon ^[3] established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani^[4] proposed Random Forest Kernel and show that it can empirically outperform state-of-art kernel methods. Scornet^[1] first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centred random forest^[5] and uniform random forest,^[6] two simplified models of random forest. He named these two KeRFs by Centred KeRF and Uniform KeRF,and proved upper bounds on their rates of consistency.

Notations and definitions

Preliminaries: Centred forests

Centred forest^[5] is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level $k$ is built, where $k\in {\mathbb {N}}$ is a parameter of the algorithm.

Uniform forest

Uniform forest^[6] is another simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at a point uniformly drawn on the side of the cell, along the preselected attribute.

From random forest to KeRF

Given a training sample ${\mathcal {D}}_{n}=\{({\mathbf {X}}_{i},Y_{i})\}_{{i=1}}^{n}$ of $[0,1]^{p}\times {\mathbb {R}}$ -valued independent random variables distributed as the independent prototype pair $({\mathbf {X}},Y)$ , where ${\mathbb {E}}[Y^{2}]<\infty$ . We aim at predicting the response $Y$ ,associated with the random variable $\mathbf {X}$ , by estimating the regression function $m({\mathbf {x}})={\mathbb {E}}[Y|{\mathbf {X}}={\mathbf {x}}]$ . A random regression forest is an ensemble of $M$ randomized regression trees. Denote $m_{n}({\mathbf {x}},{\mathbf {\Theta }}_{j})$ the predicted value at point $\mathbf {x}$ by the $j$ -th tree, where ${\mathbf {\Theta }}_{1},\cdots ,{\mathbf {\Theta }}_{M}$ are independent random variables, distributed as a generic random variable ${\mathbf {\Theta }}$ , independent of the sample $\mathcal{D}_n$ . This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction. The trees are combined to form the finite forest estimate $m_{{M,n}}({\mathbf {x}},\Theta _{1},\cdots ,\Theta _{M})={\frac {1}{M}}\sum _{{j=1}}^{M}m_{n}({\mathbf {x}},\Theta _{j})$ . For regression trees, we have $m_{n}=\sum _{{i=1}}^{n}{\frac {Y_{i}{\mathbf {1}}_{{{\mathbf {X}}_{i}\in A_{n}({\mathbf {x}},\Theta _{j})}}}{N_{n}({\mathbf {x}},\Theta _{j})}}$ , where $A_{n}({\mathbf {x}},\Theta _{j})$ is the cell containing $\mathbf {x}$ , designed with randomness $\Theta _{j}$ and dataset $\mathcal{D}_n$ , and $N_{n}({\mathbf {x}},\Theta _{j})=\sum _{{i=1}}^{n}{\mathbf {1}}_{{{\mathbf {X}}_{i}\in A_{n}({\mathbf {x}},\Theta _{j})}}$ .

Thus random forest estimates satisfy, for all ${\mathbf {x}}\in [0,1]^{d}$ , $m_{{M,n}}({\mathbf {x}},\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{M}}\sum _{{j=1}}^{M}\left(\sum _{{i=1}}^{n}{\frac {Y_{i}{\mathbf {1}}_{{{\mathbf {X}}_{i}\in A_{n}({\mathbf {x}},\Theta _{j})}}}{N_{n}({\mathbf {x}},\Theta _{j})}}\right)$ . Random regression forest has two level of averaging, first over the samples in the target cell of a tree, then over all trees. Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells. In order to improve the random forest methods and compensate the misestimation, Scornet^[1] defined KeRF by

{\tilde {m}}_{{M,n}}({\mathbf {x}},\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{\sum _{{j=1}}^{M}N_{n}({\mathbf {x}},\Theta _{j})}}\sum _{{j=1}}^{M}\sum _{{i=1}}^{n}Y_{i}{\mathbf {1}}_{{{\mathbf {X}}_{i}\in A_{n}({\mathbf {x}},\Theta _{j})}}

which is equal to the mean of the $Y_{i}$ 's falling in the cells containing $\mathbf {x}$ in the forest. If we define the connection function of the $M$ finite forest as $K_{M,n}(\mathbf {x} ,\mathbf {z} )={\frac {1}{M}}\sum _{j=1}^{M}\mathbf {1} _{\mathbf {z} \in A_{n}(\mathbf {x} ,\Theta _{j})}$ , i.e. the proportion of cells shared between $\mathbf {x}$ and $\mathbf {z}$ , then almost surely we have ${\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {\sum _{i=1}^{n}Y_{i}K_{M,n}(\mathbf {x} ,\mathbf {x} _{i})}{\sum _{\ell =1}^{n}K_{M,n}(\mathbf {x} ,\mathbf {x} _{\ell })}}$ , which defines the KeRF.

Centred KeRF

The construction of Centred KeRF of level $k$ is the same as for centred forest, except that predictions are made by ${\tilde {m}}_{{M,n}}({\mathbf {x}},\Theta _{1},\ldots ,\Theta _{M})$ , the corresponding kernel function, or connection function is

K_{k}^{{cc}}({\mathbf {x}},{\mathbf {z}})=\sum _{{k_{1},\ldots ,k_{d},\sum _{{j=1}}^{d}k_{j}=k}}{\frac {k!}{k_{1}!\ldots k_{d}!}}\left({\frac {1}{d}}\right)^{k}\prod _{{j=1}}^{d}{\mathbf {1}}_{{\lceil 2^{{k_{j}}}x_{j}\rceil =\lceil 2^{{k_{j}}}z_{j}\rceil }}

, for all

{\mathbf {x}},{\mathbf {z}}\in [0,1]^{d}

Uniform KeRF

Uniform KeRF is built in the same way as uniform forest, except that predictions are made by ${\tilde {m}}_{{M,n}}({\mathbf {x}},\Theta _{1},\ldots ,\Theta _{M})$ , the corresponding kernel function, or connection function is

K_{k}^{{uf}}({\mathbf {0}},{\mathbf {x}})=\sum _{{k_{1},\ldots ,k_{d},\sum _{{j=1}}^{d}k_{j}=k}}{\frac {k!}{k_{1}!\ldots k_{d}!}}\left({\frac {1}{d}}\right)^{k}\prod _{{m=1}}^{d}\left(1-|x_{m}|\sum _{{j=0}}^{{k_{m}-1}}{\frac {(-\ln |x_{m}|)^{j}}{j!}}\right)

, for all

{\mathbf {x}}\in [0,1]^{d}

Properties

Relation between KeRF and random forest

Predictions given by KeRF and random forests are close if the number of points in each cell is controlled:

{\text{Assume that there exist sequences }}(a_{n}),(b_{n}){\text{ such that, a.s.}},a_{n}\leq N_{n}({\mathbf {x}},\Theta )\leq b_{n}{\text{ and }}a_{n}\leq {\frac {1}{M}}\sum _{{m=1}}^{M}N_{n}{{\mathbf {x}},\Theta _{m}}\leq b_{n},{\text{ then almost surely}},

|m_{{M,n}}({\mathbf {x}})-{\tilde {m}}_{{M,n}}({\mathbf {x}})|\leq {\frac {b_{n}-a_{n}}{a_{n}}}{\tilde {m}}_{{M,n}}({\mathbf {x}})

Relation between infinite KeRF and infinite random forest

When the number of trees $M$ goes to infinity, then we have infinite random forest and infinite KeRF. Their estimates are close if the number of observations in each cell is bounded:

{\text{Assume that there exist sequences }}(\varepsilon _{n}),(a_{n}),(b_{n}){\text{ such that, a.s.}}

${\mathbb {E}}[N_{n}({\mathbf {x}},\Theta )]\geq 1$ ,
${\mathbb {P}}[a_{n}\leq N_{n}({\mathbf {x}},\Theta )\leq b_{n}\mid {\mathcal {D}}_{n}]\geq 1-\varepsilon _{n}/2$ ,
${\mathbb {P}}[a_{n}\leq {\mathbb {E}}_{{\Theta }}[N_{n}({\mathbf {x}},\Theta )]\leq b_{n}\mid {\mathcal {D}}_{n}]\geq 1-\varepsilon _{n}/2$ ,

${\text{Then almost surely,}}|m_{{\infty ,n}}({\mathbf {x}}-{\tilde {m}}_{{\infty ,n}}({\mathbf {x}})|\leq {\frac {b_{n}-a_{n}}{a_{n}}}{\tilde {m}}_{{\infty ,n}}({\mathbf {x}})+n\varepsilon _{n}\left(\max _{{1\leq i\leq n}}Y_{i}\right)$ .

Consistency results

Assume that $Y=m({\mathbf {X}})+\varepsilon$ , where $\varepsilon$ is a centred Gaussian noise, independent of $\mathbf {X}$ , with finite variance $\sigma ^{2}<\infty$ . Moreover, $\mathbf {X}$ is uniformly distributed on $[0,1]^{d}$ and $m$ is Lipschitz. Scornet^[1] proved upper bounds on the rates of consistency for centred KeRF and uniform KeRF.

Consistency of centred KeRF

Providing $k\rightarrow\infty$ and $n/2^{k}\rightarrow \infty$ , there exists a constant $C_{1}>0$ such that, for all $n$ , ${\mathbb {E}}[{\tilde {m}}_{n}^{{cc}}({\mathbf {X}})-m({\mathbf {X}})]^{2}\leq C_{1}n^{{-1/(3+d\log 2)}}(\log n)^{2}$ .

Consistency of uniform KeRF

Providing $k\rightarrow\infty$ and $n/2^{k}\rightarrow \infty$ , there exists a constant $C>0$ such that, ${\mathbb {E}}[{\tilde {m}}_{n}^{{uf}}({\mathbf {X}})-m({\mathbf {X}})]^{2}\leq Cn^{{-2/(6+3d\log 2)}}(\log n)^{2}$ .

References

1 2 3 4 Scornet, Erwan (2015). "Random forests and kernel methods". arXiv:1502.03836.
↑ Breiman, Leo (2000). "Some infinity theory for predictor ensembles" (PDF). Technical Report 579, Statistics Dept. UCB.
↑ Lin, Yi; Jeon, Yongho (2006). "Random forests and adaptive nearest neighbors". Journal of the American Statistical Association. 101 (474): 578–590. doi:10.1198/016214505000001230.
↑ Davies, Alex; Ghahramani, Zoubin (2014). "The Random Forest Kernel and other kernels for big data from random partitions". arXiv:1402.4293.
1 2 Breiman, Leo; Ghahramani, Zoubin (2004). "Consistency for a simple model of random forests". Statistical Department, University of California at Berkeley. Technical Report (670).
1 2 Arlot, Sylvain; Genuer, Robin (2014). "Analysis of purely random forests bias". arXiv:1407.3939.

This article is issued from Wikipedia - version of the 7/8/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.