C4.5 algorithm
From Wikipedia, the free encyclopedia
c4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. Improvements contains:
- Choosing an appropriate attribute selection measure.
- Handling training data with missing attribute values.
- Handling attributes with differing costs.
- Handling continuous attributes
Contents |
[edit] Data Structures
Whole tree is kept in _tree_record defined as follows:
typedef struct _tree_record { short NodeType; /* 0=leaf 1=branch 2=cut 3=subset */ ClassNo Leaf; /* most frequent class at this node */ ItemCount Items; /* no of items at this node */ ItemCount *ClassDist; /* class distribution of items */ ItemCount Errors; /* no of errors at this node */ Attribute Tested; /* attribute referenced in test */ short Forks; /* number of branches at this node */ float Cut; /* threshold for continuous attribute */ float Lower; /* lower limit of soft threshold */ float Upper; /* upper limit ditto */ Set *Subset; /* subsets of discrete values */ Tree *Branch; /* Branch[x] = (sub)tree for outcome x */ }
Algorithm works as follows:
(Additional file-line information are based on Quinlan original implementation)
[edit] Tree building
whole algorithm is written in <build.c:FormTree:162>:
- if all cases are of the same class, the tree is a leaf and so the leaf is returned labelled with this class it counts class frequency distribution, ClassFreq.
ClassFreq is used further to determine what to do with this leaf. We stop iteration if all test cases are in the same class (so we have all knowledge that we need) or if possible cases are less than MINOBJS.
- for each attribute, calculate the potential information provided by a test on the attribute (see gain counting below) <discr.c:EvalDiscreteAtt:19> <contin.c:EvalContinuousAtt:28> (based on the probabilities of each case having a particular value for the attribute), and the gain in information that would result from a test on the attribute (based on the probabilities of each case with a particular value for the attribute being of a particular class).
- on the basis of these figures, and depending on the current selection criterion, find the best attribute to branch on and check out whether the decision of branching the tree on this particular example would not be worse than making a leaf. Here are also some improvements like not using test cases with unknown attribute value (all tests are )
[edit] Pruning
Try to prune the tree and show the result if pruning succeeds <prune.c:Prune:31>. All pruning is done by removing unnecessary subset tests on missing values <prune.c:CheckPossibleValues:260>.
[edit] Iteration case
In iteration case we try to generate TRIALS trees <besttree.c:Iterate:275>, first one with initial window and every next with window bigger by increment. Each one constructs a classifier tree using the data items in the window (part of test data) , then test for the successful classification of other data items by this tree. If there are misclassified items, put them immediately after the items in the window, increase the size of the window and build another classifier tree, and so on until we have a tree which successfully classifies all of the test items or no improvement is apparent.
Than best of such trees is chosen and saved.
[edit] Counting gain
In short:
We define Entropy of T as:
where we iterate over all possible values of T. We define conditional Entropy as:
We define Gain by:
- Gain(T,V) = Entropy(T) − Entropy(V | T)
Maximizing Gain prefers arguments with more possible values, so we are trying to maximize Gain divided by overall entropy due to split argument T by value V.
[edit] References
- Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
[edit] External links
- Original implementation on Ross Quinlan's homepage: http://www.rulequest.com/Personal/
- Source: http://www2.cs.uregina.ca