C4.5 algorithm

From Wikipedia, the free encyclopedia

This article may require cleanup to meet Wikipedia's quality standards.
Please discuss this issue on the talk page or replace this tag with a more specific message.
This article has been tagged since February 2007.

The correct title of this article is c4.5 algorithm. The initial letter is shown capitalized due to technical restrictions.

c4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. Improvements include:

Choosing an appropriate attribute selection measure.
Handling training data with missing attribute values.
Handling attributes with differing costs.
Handling continuous attributes

1 Data Structures
2 Tree building
3 Pruning
4 Iteration case
5 Counting gain
6 See also
7 References
8 External links

[edit] Data Structures

Whole tree is kept in _tree_record defined as follows:

struct _tree_record {
  short      NodeType;    /* 0=leaf 1=branch 2=cut 3=subset */
  ClassNo    Leaf;        /* most frequent class at this node */
  ItemCount  Items;       /* no of items at this node */
  ItemCount  *ClassDist;  /* class distribution of items */
  ItemCount  Errors;      /* no of errors at this node */
  Attribute  Tested;      /* attribute referenced in test */
  short      Forks;       /* number of branches at this node */
  float      Cut;         /* threshold for continuous attribute */
  float      Lower;       /* lower limit of soft threshold */
  float      Upper;       /* upper limit ditto */
  Set        *Subset;     /* subsets of discrete values  */
  Tree       *Branch;     /* Branch[x] = (sub)tree for outcome x */
}

Algorithm works as follows:

(Additional file-line information are based on Quinlan original implementation)

[edit] Tree building

whole algorithm is written in <build.c:FormTree:162>:

if all cases are of the same class, the tree is a leaf and so the leaf is returned labelled with this class it counts class frequency distribution, ClassFreq.

ClassFreq is used further to determine what to do with this leaf. We stop iteration if all test cases are in the same class (so we have all knowledge that we need) or if possible cases are less than MINOBJS.

for each attribute, calculate the potential information provided by a test on the attribute (see gain counting below) <discr.c:EvalDiscreteAtt:19> <contin.c:EvalContinuousAtt:28> (based on the probabilities of each case having a particular value for the attribute), and the gain in information that would result from a test on the attribute (based on the probabilities of each case with a particular value for the attribute being of a particular class).

on the basis of these figures, and depending on the current selection criterion, find the best attribute to branch on and check out whether the decision of branching the tree on this particular example would not be worse than making a leaf. Here are also some improvements like not using test cases with unknown attribute value (all tests are $\pm \epsilon$ )

[edit] Pruning

Try to prune the tree and show the result if pruning succeeds <prune.c:Prune:31>. All pruning is done by removing unnecessary subset tests on missing values <prune.c:CheckPossibleValues:260>.

[edit] Iteration case

In iteration case we try to generate TRIALS trees <besttree.c:Iterate:275>, first one with initial window and every next with window bigger by increment. Each one constructs a classifier tree using the data items in the window (part of test data) , then test for the successful classification of other data items by this tree. If there are misclassified items, put them immediately after the items in the window, increase the size of the window and build another classifier tree, and so on until we have a tree which successfully classifies all of the test items or no improvement is apparent.

Than best of such trees is chosen and saved.

[edit] Counting gain

In short:

We define entropy of T as:

$Entropy(T) = - \sum^{m}_{j=1} \frac{|T_j|}{|T|} \log \frac{|T_j|}{|T|}$

where we iterate over all possible values of T. We define conditional Entropy as:

$Entropy(x|T) = \frac{|T_x|}{|T|} \log \frac{|T_x|}{|T|}$

We define Gain by:

G a i n (T, V) = E n t r o p y (T) - E n t r o p y (V | T)

Maximizing Gain prefers arguments with more possible values, so we are trying to maximize Gain divided by overall entropy due to split argument T by value V.