Decision and Regression Trees

(A Brief Documentation of the Programs dti / dtp / dtx / dtr)

Introduction
Determining Attribute Domains
Inducing a Decision Tree
Pruning a Decision Tree
Executing a Decision Tree
Computing a Confusion Matrix
Extracting Rules from a Decision Tree
Other Decision Tree Examples
License
Download
Contact

Introduction

I am sorry that there is no detailed documentation yet. Below you can find a brief explanation of how to grow a decision tree with the program dti, how to prune a decision tree with the program dtp, how to execute a decision tree with the program dtx, and how to extract rules from a decision tree with the program dtr. For a list of options, call the programs without any arguments.

Enjoy,
Christian Borgelt

As a simple example for the explanations below I use the dataset in the file dtree/ex/drug.tab, which lists 12 records of patient data (sex, age, and blood pressure) together with an effective drug (effective w.r.t. some unspecified disease). The contents of this file is:

   Sex    Age Blood_pressure Drug
   male   20  normal         A
   female 73  normal         B
   female 37  high           A
   male   33  low            B
   female 48  high           A
   male   29  normal         A
   female 52  normal         B
   male   42  low            B
   male   61  normal         B
   female 30  normal         A
   female 26  low            B
   male   54  high           A

back to the top

Determining Attribute Domains

To induce a decision tree for the effective drug, one first has to determine the domains of the table columns using the program dom (to be found in the table package, see below):

  dom -a drug.tab drug.dom

The program dom assumes that the first line of the table file contains the column names. (This is the case for the example file drug.tab.) If you have a table file without column names, you can let the program read the column names from another file (using the -h option) or you can let the program generate default names (using the -d option), which are simply the column numbers. The -a option tells the program to determine automatically the column data types. Thus the values of the Age column are automatically recognized as integer values.

After dom has finished, the contents of the file drug.dom should look like this:

  dom(Sex) = { male, female };
  dom(Age) = ZZ;
  dom(Blood_pressure) = { normal, high, low };
  dom(Drug) = { A, B };

The special domain ZZ represents the set of integer numbers, the special domain IR (not used here) the set of real numbers. (The double Z and the I in front of the R are intended to mimic the bold face or double stroke font used in mathematics to write the set of integer or the set of real numbers. All programs that need to read a domain description also recognize a single Z or a single R.)

back to the top

Inducing a Decision Tree

To induce a decision tree using the dti program (dti is simply an abbreviation of Decision Tree Induction), type

  dti -a drug.dom drug.tab drug.dt

You need not tell the program dti that the Drug column contains the class, since by default it uses the last column as the class column (the Drug column is the last column in the file drug.tab). If a different column contains the class, you can specify its name on the command line using the -c option, e.g. -c Drug.

At first glance it seems to be superfluous to provide the dti program with a domain description, since it is also given the table file and thus can determine the domains itself. But without a domain description, the dti program would be forced to use all columns in the table file and to use them with the automatically determined data types. But occasions may arise in which you want to induce a decision tree from a subset of the columns or in which the numbers in a column are actually coded symbolic values. In such a case the domain file provides a way to tell the dti program about the columns to use and their data types. To ignore a column, simply remove the corresponding domain definition from the domain description file (or comment it out --- C-style (/* ... */) and C++-style (// ...) comments are supported). To change the data type of a column, simply change the domain definition.

By default the program dti uses information gain ratio as the attribute selection measure. Other measures can be selected via the -e option. Call dti with option -! for a list of available attribute selection measures.

With the above command the induced decision tree is written to the file drug.dt. The contents of this file should look like this:

  dtree(Drug) =
  { (Blood_pressure)
    normal:{ (Age|41)
             <:{ A: 3 },
             >:{ B: 3 }},
    high  :{ A: 3 },
    low   :{ B: 3 }};

Since the -a option was given, the colons after the values of an attribute (here, for example, the values of the attribute Blood_pressure) are aligned. This makes a decision tree easier to read, but may result in larger than necessary output files.

back to the top

Pruning a Decision Tree

Although it is not necessary for our simple example, the induced decision tree can be pruned, i.e., simplified by removing some decisions. This is done by invoking the program dtp (dtp is simply an abbreviation for Decision Tree Pruning):

  dtp -a drug.dt drug_p.dt

The table the decision tree was induced from can be given as a third argument to the dtp program. In this case an additional way of pruning (replacing an inner node (an attribute test) by its largest child) is enabled.

By default dtp uses confidence level pruning with a confidence level of 0.5 as the pruning method. The confidence level can be changed via the -p option (pruning parameter), the pruning method via the -m option. Call dtp without arguments for a list of available pruning methods.

back to the top

Executing a Decision Tree

An induced decision tree can be used to classify new data using the program dtx (dtx is simply an abbreviation for Decision Tree Execution):

  dtx -a drug.dt drug.tab drug.cls

drug.tab is the table file (since we do not have special test data, we simply use the training data), drug.cls is the output file. After dtx has finished, drug.cls contains (in addition to the columns appearing in the decision tree, and, for preclassified data, the class column) a new column dt, which contains the class that is predicted by the decision tree. You can give this new column a different name with the -p option, e.g. -p decision_tree_class.

If the table contains preclassified data and the name of the column containing the preclassification is the same as for the training data, the error rate of the decision tree is determined and printed to the terminal.

The contents of the file drug.cls should look like this:

  Sex    Age Blood_pressure Drug dt
  male   20  normal         A    A
  female 73  normal         B    B
  female 37  high           A    A
  male   33  low            B    B
  female 48  high           A    A
  male   29  normal         A    A
  female 52  normal         B    B
  male   42  low            B    B
  male   61  normal         B    B
  female 30  normal         A    A
  female 26  low            B    B
  male   54  high           A    A

That is, the classification is perfect, which is not surprising for such a simple example. The columns are neatly aligned because of the -a option. Without it, there would only be a single space between two column values.

back to the top

Computing a Confusion Matrix

The classification quality can be inspected in more detail with the program xmat (determine a confusion matrix, to be found in the table package, see below):

  xmat drug.cls

This program reads the output of the program dtx and computes a confusion matrix from two columns of this file. It uses the last two columns by default (the last column for the x- and the semi-last for the y-direction), which is fine for our example. Other columns can be selected via the options -x and -y followed by the name of the column that is to be used for the x- or y-direction of the confusion matrix. The output of the program xmat, which by default is written to the terminal, should read like this:

  confusion matrix for Drug vs. dt:
   no | value  |      1      2 | errors
  ----+--------+---------------+-------
    1 | A      |      6      0 |      0
    2 | B      |      0      6 |      0
  ----+--------+---------------+-------
      | errors |      0      0 |      0

In this matrix the x-direction corresponds to the column dt and the y-direction to the column Drug. Since in our simple example the classification is perfect, only the fields in the diagonal differ from zero. If the classification is not perfect, the other fields show what errors are made by the decision tree classifier.

back to the top

Extracting Rules from a Decision Tree

If you like rules better than trees, you can extract rules from a learned decision tree with the program dtr (dtr is simply an abbreviation for Decision Tree Rule Extraction).

  dtr -sc drug.dt drug.rs

drug.dt is the decision tree file, drug.rs the file to which the rules shall be written. The option -s instructs the program to print the support of each rule, i.e., the number of cases in the training data set the rule applies to, and the option -c instructs it to print the confidence of each rule, i.e., the percentage of cases in which a rule is correct relative to the number of cases in which it is applicable. The file drug.rs should contain the following set of rules:

  Drug = A <- Blood_pressure = high [3/100.0%];
  Drug = B <- Blood_pressure = low [3/100.0%];
  Drug = A <- Age < 41 & Blood_pressure = normal [3/100.0%];
  Drug = B <- Age > 41 & Blood_pressure = normal [3/100.0%];

The numbers in brackets are the number of cases in the training data set the rule applies to (first number) and the confidence (second number). Note that there is one rule for each leaf of the decision tree. The program tries to simplify the rules by combining conditions on numeric attributes if possible, but it will not simplify the rule set by merging rules.

back to the top

Other Decision Tree Examples

The decision trees in the directory dtree/ex (files with extensions .dt and .pdt) were created with the following commands:

  dti -a drug.dom drug.tab drug.dt
  dti iris.dom iris.tab iris.dt
  dti -s monk1.dom monk1.tab monk1.dt
  dtp monk1.dt monk1.pdt monk1.tab
  dti vote.dom vote.tab vote.dt
  dtp vote.dt vote.pdt
  dti wine.dom wine.tab wine.dt

back to the top

License

(MIT license, or more precisely Expat License; to be found in the file mit-license.txt in the directory dtree/doc in the source package of the program, see also opensource.org and wikipedia.org)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

back to the top

Download

Download page with most recent version.

back to the top

Contact

E-mail:		christian@borgelt.net
Website:		www.borgelt.net

back to the top

(A Brief Documentation of the Programs dti / dtp / dtx / dtr)

Contents