NeuroCOLT

Neural Networks and Computational Learning Theory

 

About NeuroCOLT

Papers Archive

1994 1995
1996 1997
1998 1999
2000 2001

Books

info@neurocolt.org

NeuroCOLT Technical Report NC-TR-96-041

Finding Optimal Multi-Splits for Numerical Attributes in Decision Tree Learning

Tapio Elomaa
University of Helsinki
Finland

Juho Rousu
University of Helsinki
Finland

Abstract
Handling continuous attribute ranges remains a deficiency of top-down induction of Decision Trees. They require special treatment and do not fit the learning scheme as well as one could hope for. Nevertheless, they are common in practical tasks and, therefore, need to be taken into account.  This topic has attracted abundant attention in recent years. In particular, Fayyad and Irani showed how optimal binary partitions can be found efficiently. Later, they based a greedy heuristic multipartitioning algorithm on these results. Recently, Fulton, Kasif, and Salzberg attempted to develop algorithms for finding the optimalmulti-split for a numerical attribute in one phase.  We prove that, similarly as in the binary partitioning, only boundary points need to be inspected in order to find the optimal multipartition of a numerical value range. We develop efficient algorithms for finding the optimal splitting into more than two intervals. The resulting partition is guaranteed to be optimal w.r.t. the function that is used to evaluate the attributes' utility in class prediction.  We contrast our method with alternative approaches in initial empirical experiments. They show that the new method surpasses the greedy heuristic approach of Fayyad and Irani constantly in the goodness of the produced multi-split, but, with small data sets, cannot quite attain the efficiency of the greedy approach. Furthermore, our experiments reveal that one of the techniques proposed by Fulton, Kasif, and Salzberg is of scarce use in practical tasks, since its time consumption falls short of all demands. In addition, it categorically fails in finding the optimal multi-split because of an error in the rationale of the method.

Download Compressed Postscript