Python的决策树和随机森林
决策树模型是一种简单易用的非参数分类器。它不需要对数据有任何的先验假设,计算速度较快,结果容易解释,而且稳健性强,对噪声数据和缺失数据不敏感。下面示范用titanic中的数据集为做决策树分类,目标变量为survive。
第一步:读取数据
In [2]:
%pylab inline
import pandas as pd
df = pd.read_csv('titanic.csv')
df.head()
#df.info()
Out[2]:
第二步:数据整理
- 只取出三个自变量
- 将将age缺失值进行补全
- 将pclass变量转为三个哑变量
- 将sex转为0-1变量
In [3]:
subdf = df[['pclass','sex','age']]
y = df.survived
# sklearn中的Imputer也可以
age = subdf['age'].fillna(value=subdf.age.mean())
# sklearn OneHotEncoder也可以
pclass = pd.get_dummies(subdf['pclass'],prefix='pclass')
sex = (subdf['sex']=='male').astype('int')
X = pd.concat([pclass,age,sex],axis=1)
X.head()
Out[3]:
第三步:建模
- 数据切分为train和test
In [4]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
- 使用决策树观察在检验集表现
In [6]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5)
clf = clf.fit(X_train,y_train)
print "准确率为:{:.2f}".format(clf.score(X_test,y_test))
- 观察各变量的重要性
In [7]:
clf.feature_importances_
Out[7]:
- 使用更多指标来评估模型
In [17]:
from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True,
show_classification_report=True,
show_confusion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n"
if show_classification_report:
print "Classification report"
print metrics.classification_report(y,y_pred),"\n"
if show_confusion_matrix:
print "Confusion matrix"
print metrics.confusion_matrix(y,y_pred),"\n"
measure_performance(X_test,y_test,clf, show_classification_report=True, show_confusion_matrix=True)
- 使用交叉验证来评估模型
In [8]:
from sklearn import cross_validation
scores1 = cross_validation.cross_val_score(clf, X, y, cv=10)
scores1
Out[8]:
第三步:决策树画图
- 需要安装GraphViz'
In [9]:
import pydot,StringIO
dot_data = StringIO.StringIO()
In [14]:
tree.export_graphviz(clf, out_file=dot_data, feature_names=['age','sex','1st_class','2nd_class','3rd_class'])
dot_data.getvalue()
pydot.graph_from_dot_data(dot_data.getvalue())
graph = pydot.graph_from_dot_data(dot_data.getvalue())
#graph.write_png('titanic.png')
#from IPython.core.display import Image
#Image(filename='titanic.png')
第四步:使用随机森林进行比较
In [11]:
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier(n_estimators=1000,random_state=33)
clf2 = clf2.fit(X_train,y_train)
scores2 = cross_validation.cross_val_score(clf2,X, y, cv=10)
clf2.feature_importances_
Out[11]:
In [12]:
scores2.mean(), scores1.mean()
Out[12]: