|
About
NeuroCOLT
Papers
Archive
Books
info@neurocolt.org
|
NeuroCOLT
Technical Report NC-TR-96-041
Finding
Optimal Multi-Splits for Numerical Attributes in Decision Tree Learning
Tapio
Elomaa
University of Helsinki
Finland
Juho
Rousu
University of Helsinki
Finland
Abstract
Handling continuous attribute ranges remains a deficiency of
top-down induction of Decision Trees. They require special treatment
and do not fit the learning scheme as well as one could hope for.
Nevertheless, they are common in practical tasks and, therefore, need
to be taken into account. This topic has attracted abundant
attention in recent years. In particular, Fayyad and Irani showed
how optimal binary partitions can be found efficiently. Later, they
based a greedy heuristic multipartitioning algorithm on these results.
Recently, Fulton, Kasif, and Salzberg attempted to develop algorithms
for finding the optimalmulti-split for a numerical attribute in one
phase. We prove that, similarly as in the binary partitioning,
only boundary points need to be inspected in order to find the optimal
multipartition of a numerical value range. We develop efficient algorithms
for finding the optimal splitting into more than two intervals. The
resulting partition is guaranteed to be optimal w.r.t. the function
that is used to evaluate the attributes' utility in class prediction.
We contrast our method with alternative approaches in initial empirical
experiments. They show that the new method surpasses the greedy heuristic
approach of Fayyad and Irani constantly in the goodness of the produced
multi-split, but, with small data sets, cannot quite attain the efficiency
of the greedy approach. Furthermore, our experiments reveal that one
of the techniques proposed by Fulton, Kasif, and Salzberg is of scarce
use in practical tasks, since its time consumption falls short of
all demands. In addition, it categorically fails in finding the optimal
multi-split because of an error in the rationale of the method.
Download Compressed
Postscript
|