星期日, 十月 26, 2014

python的决策树和随机森林

Python的决策树和随机森林

决策树模型是一种简单易用的非参数分类器。它不需要对数据有任何的先验假设,计算速度较快,结果容易解释,而且稳健性强,对噪声数据和缺失数据不敏感。下面示范用titanic中的数据集为做决策树分类,目标变量为survive。

第一步:读取数据

In [2]:
%pylab inline
import pandas as pd
df = pd.read_csv('titanic.csv')
df.head()
#df.info()
Populating the interactive namespace from numpy and matplotlib

Out[2]:
survived pclass name sex age sibsp parch ticket fare cabin embarked
0 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

第二步:数据整理

  • 只取出三个自变量
  • 将将age缺失值进行补全
  • 将pclass变量转为三个哑变量
  • 将sex转为0-1变量
In [3]:
subdf = df[['pclass','sex','age']]
y = df.survived
# sklearn中的Imputer也可以
age = subdf['age'].fillna(value=subdf.age.mean())
# sklearn OneHotEncoder也可以
pclass = pd.get_dummies(subdf['pclass'],prefix='pclass')
sex = (subdf['sex']=='male').astype('int')
X = pd.concat([pclass,age,sex],axis=1)
X.head()
Out[3]:
pclass_1 pclass_2 pclass_3 age sex
0 0 0 1 22 1
1 1 0 0 38 0
2 0 0 1 26 0
3 1 0 0 35 0
4 0 0 1 35 1

第三步:建模

  • 数据切分为train和test
In [4]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
  • 使用决策树观察在检验集表现
In [6]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5)
clf = clf.fit(X_train,y_train)
print "准确率为:{:.2f}".format(clf.score(X_test,y_test))
准确率为:0.83

  • 观察各变量的重要性
In [7]:
clf.feature_importances_
Out[7]:
array([ 0.08398076,  0.        ,  0.23320717,  0.10534824,  0.57746383])
  • 使用更多指标来评估模型
In [17]:
from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True, 
                        show_classification_report=True, 
                        show_confusion_matrix=True):
    y_pred=clf.predict(X)   
    if show_accuracy:
        print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n"

    if show_classification_report:
        print "Classification report"
        print metrics.classification_report(y,y_pred),"\n"
        
    if show_confusion_matrix:
        print "Confusion matrix"
        print metrics.confusion_matrix(y,y_pred),"\n"
        
measure_performance(X_test,y_test,clf, show_classification_report=True, show_confusion_matrix=True)
Accuracy:0.834 

Classification report
             precision    recall  f1-score   support

          0       0.85      0.88      0.86       134
          1       0.81      0.76      0.79        89

avg / total       0.83      0.83      0.83       223


Confusion matrix
[[118  16]
 [ 21  68]] 


  • 使用交叉验证来评估模型
In [8]:
from sklearn import cross_validation
scores1 = cross_validation.cross_val_score(clf, X, y, cv=10)
scores1
Out[8]:
array([ 0.82222222,  0.82222222,  0.7752809 ,  0.87640449,  0.82022472,
        0.76404494,  0.7752809 ,  0.76404494,  0.83146067,  0.78409091])

第三步:决策树画图

  • 需要安装GraphViz'
In [9]:
import pydot,StringIO
dot_data = StringIO.StringIO() 
In [14]:
tree.export_graphviz(clf, out_file=dot_data, feature_names=['age','sex','1st_class','2nd_class','3rd_class']) 
dot_data.getvalue()
pydot.graph_from_dot_data(dot_data.getvalue())
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
#graph.write_png('titanic.png') 
#from IPython.core.display import Image 
#Image(filename='titanic.png')

第四步:使用随机森林进行比较

In [11]:
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier(n_estimators=1000,random_state=33)
clf2 = clf2.fit(X_train,y_train)
scores2 = cross_validation.cross_val_score(clf2,X, y, cv=10)
clf2.feature_importances_
Out[11]:
array([ 0.05526809,  0.02266161,  0.08156048,  0.46552672,  0.37498309])
In [12]:
scores2.mean(), scores1.mean()
Out[12]:
(0.81262938372488946, 0.80352769265690616)

星期六, 十月 25, 2014

python贝叶斯文分类识别垃圾短信

Python贝叶斯文本分类识别垃圾短信

1、读取数据,type表示短信类别,text是短信内容

In [17]:
%pylab inline
import pandas as pd
import numpy as np
df = pd.read_csv('sms_spam.csv')
df.head()
Populating the interactive namespace from numpy and matplotlib

Out[17]:
type text
0 ham Hope you are having a good week. Just checking in
1 ham K..give back my thanks.
2 ham Am also doing in cbe only. But have to pay.
3 spam complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 spam okmail: Dear Dave this is your final notice to...

2、使用sklearn包转换文本为结构化数据,将矩阵分切为训练集和检验集

CountVectorizer负责将文档转为文档词频矩阵,重要的参数有如下几个:

  • ngram_range:ngrame频率范围,如果需要识别词组的话需要设置
  • stop_words:停词列表
  • token_pattern:分词的字符模式,默认空格
  • max_df:词频上限,超过该值的词项不作为特征,即过滤常用词
  • min_df:词频下限,低于该值的词项不作为特征
  • max_features:只选择词频较高的几个作为特征
In [18]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,1),stop_words='english',lowercase=True,min_df=1)
X = vectorizer.fit_transform(df.text) 
y = (df.type == 'spam').values.astype(int)

TfidfVectorizer则可以计算tfidf值,而非仅仅文档词频矩阵

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,1),stop_words='english',lowercase=True,min_df=1)
X = vectorizer.fit_transform(df.text)

3、将数据切分为train和test

In [20]:
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X, y)

4、使用贝叶斯分类器进行训练

  • 重要的参数alpha用于设置平滑系数
In [21]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha =1).fit(xtrain, ytrain)

5、观察分类效果

In [22]:
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)
print "训练集准确率:  {:.2f}".format(training_accuracy)
print "检验集准确率:  {:.2f}".format(test_accuracy)  
训练集准确率:  0.98
检验集准确率:  0.97

6、使用CV选择最优参数,参数为0.2

In [23]:
from sklearn import svm, grid_search
nb = MultinomialNB()
parameters = {'alpha':np.linspace(0,10,101)}
clf = grid_search.GridSearchCV(nb, parameters)
clf.fit(X, y)
Out[23]:
GridSearchCV(cv=None,
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'alpha': array([  0. ,   0.1, ...,   9.9,  10. ])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)
In [100]:
print "最佳参数:  {:.3f}".format(clf.best_params_['alpha'] )
print "最佳准确率:  {:.3f}".format(clf.best_score_)  
最佳参数:  0.200
最佳准确率:  0.984

In [62]:
accuracy = [t[1] for t in clf.grid_scores_]
para = [t[0]['alpha'] for t in clf.grid_scores_]
In [94]:
import matplotlib.pylab as plt
accuracy = [t[1] for t in clf.grid_scores_]
para = [t[0]['alpha'] for t in clf.grid_scores_]
plt.plot(para,accuracy,lw=3)
Out[94]:
[]

星期三, 十月 22, 2014

ipython notebook server配置及数据库连接

ipython notebook是个好东西,它的另一优点就是可以在本地用浏览器,去远程连接服务器的计算资源,就类似于Rstudio公司推出的rstudio server的功能。下面记录一下配置步骤:

第一步:服务器上安装ipython系列,推荐是安装anaconda套件,非常方便。安装完毕后将路径加在PATH环境变量中。

第二步:设置notebook server。用在ipython中如下命令设置密码:
from IPython.lib import passwd
passwd()
记下生成的字符串。

第三步:创建一个ipython配置文件,比如起名叫myserver
ipython profile create myserver
vim ~/.ipython/profile_myserver/ipython_notebook_config.py
编辑文件,加入下面几项:
c = get_config()
c.IPKernelApp.pylab = 'inline' #启动inline模式
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:yourhashedpassword'  #把第二步的密码考进来
c.NotebookApp.port = 9999   #自己设一个端口号

第四步:启动服务
ipython notebook --profile=myserver

最后你就可以在本地浏览器中登陆,输入密码,即可进入ipython notebook。

因为公司的数据库是Oracle的,所以下面的例子没有包括其它的数据库,不过方法类似。

最原始的连接数据库方式是cx_Oracle包,使用pip安装后import进来即可调用,出来的结果是一个list。
import cx_Oracle
conn = cx_Oracle.connect('user','password','ip/dbname')
cr = conn.cursor()
cr.execute('select * from table')
result = cr.fetchall()
cr.close()
conn.close()

对于数据分析而言,方便的调用方式是通过pandas封装的sql接口来做,这样出来的数据直接就是一个dataframe。使用它有几个前提要求
1 安装oracle瘦客户端
2 设置好环境变量,例如ORACLE_HOME和LD_LIBRARY_PATH
3 设置好tnsnames.ora
4 安装sqlalchemy包

设置好以后使用如下例:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('oracle://user:password@service_name')
df = pd.read_sql_query('select * from table', engine)

星期六, 十月 18, 2014

python中的线性回归

python中的线性回归

对于统计模型来说,最简单也最经典的模型要数线性回归模型,它可以满足统计建模的所有标准流程,并且适用范围也非常广。R里面是使用lm函数来做回归,而在python里面有几个包都提供了这一功能,首先介绍sklearn包中的回归函数,然后介绍statsmodels包中的回归函数。前者适合于机器学习中的预测,不需要太多中间结果的观察。后者适合于分析,需要对中间结果,例如系数,残差以及效果做判断的时候使用。

  • 第一步:加载各种包
In [228]:
%pylab inline
import pandas as pd
import matplotlib.pylab as ply
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import statsmodels.formula.api as sm
Populating the interactive namespace from numpy and matplotlib

  • 第二步:读取数据并画图
In [222]:
df = pd.read_csv('iris.csv')
lmdf = df[['Petal_Length','Petal_Width']]
lmdf.plot(kind='scatter',x=0,y=1)
Out[222]:
<matplotlib.axes.AxesSubplot at 0x119b1f110>
  • 第三步:使用sklearn包中的函数进行回归
In [230]:
# 建立回归对象
linear_sk = LinearRegression()
X=lmdf[['Petal_Length']]
y=lmdf['Petal_Width']
linear_sk.fit(X,y)
linear_fit.intercept_,linear_fit.coef_  # coef
Out[230]:
(-0.36651404521672837, array([ 0.41641913]))
In [226]:
linear_fit.score(X,y) # R2
Out[226]:
0.92690122792200369
In [214]:
yhat = linear_fit.predict(X=lmdf[['Petal_Length']])
mean_squared_error(lmdf['Petal_Width'],yhat)  #MSE
Out[214]:
0.04228994631948424
  • 第四步:观察回归效果
In [196]:
plt.scatter(lmdf['Petal_Length'],lmdf['Petal_Width'])
plt.plot(lmdf['Petal_Length'],yhat)
Out[196]:
[<matplotlib.lines.Line2D at 0x11991a790>]
  • 使用statmodels包的过程和结果,可以使用和R类似的公式实施
In [229]:
linear_model = sm.ols(formula='Petal_Width ~ Petal_Length', data=lmdf)
results = linear_model.fit()
results.summary()
Out[229]:
OLS Regression Results
Dep. Variable: Petal_Width R-squared: 0.927
Model: OLS Adj. R-squared: 0.926
Method: Least Squares F-statistic: 1877.
Date: Sat, 18 Oct 2014 Prob (F-statistic): 5.78e-86
Time: 18:00:43 Log-Likelihood: 24.400
No. Observations: 150 AIC: -44.80
Df Residuals: 148 BIC: -38.78
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -0.3665 0.040 -9.188 0.000 -0.445 -0.288
Petal_Length 0.4164 0.010 43.320 0.000 0.397 0.435
Omnibus: 5.498 Durbin-Watson: 1.461
Prob(Omnibus): 0.064 Jarque-Bera (JB): 5.217
Skew: 0.353 Prob(JB): 0.0736
Kurtosis: 3.579 Cond. No. 10.3