数据挖掘 - 决策树归纳
决策树是一种包含根节点、分支和叶节点的结构。每个内部节点表示对属性的测试,每个分支表示测试的结果,每个叶节点保存一个类标签。树中最顶层的节点是根节点。
以下决策树针对概念 buy_computer,指示公司的客户是否可能购买计算机。每个内部节点代表对属性的测试。每个叶节点代表一个类。
拥有决策树的好处如下 -
- 它不需要任何领域知识。
- 这很容易理解。
- 决策树的学习和分类步骤简单且快速。
决策树归纳算法
1980 年,一位名叫 J. Ross Quinlan 的机器研究员开发了一种称为 ID3(迭代二分法)的决策树算法。后来,他提出了C4.5,它是ID3的继承者。ID3和C4.5采用贪心方法。在这个算法中,没有回溯;这些树是以自上而下递归分而治之的方式构建的。
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
树木修剪
执行树修剪是为了消除训练数据中由于噪声或异常值而导致的异常。修剪后的树木较小且不太复杂。
树木修剪方法
修剪树有两种方法 -
预修剪- 通过提前停止建造来修剪树。
后修剪- 这种方法从完全生长的树中删除子树。
成本复杂性
成本复杂性通过以下两个参数来衡量 -
- 树上的叶子数量,以及
- 树的错误率。