星期六, 十月 25, 2014

python贝叶斯文分类识别垃圾短信

Python贝叶斯文本分类识别垃圾短信

1、读取数据,type表示短信类别,text是短信内容

In [17]:
%pylab inline
import pandas as pd
import numpy as np
df = pd.read_csv('sms_spam.csv')
df.head()
Populating the interactive namespace from numpy and matplotlib

Out[17]:
type text
0 ham Hope you are having a good week. Just checking in
1 ham K..give back my thanks.
2 ham Am also doing in cbe only. But have to pay.
3 spam complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 spam okmail: Dear Dave this is your final notice to...

2、使用sklearn包转换文本为结构化数据,将矩阵分切为训练集和检验集

CountVectorizer负责将文档转为文档词频矩阵,重要的参数有如下几个:

  • ngram_range:ngrame频率范围,如果需要识别词组的话需要设置
  • stop_words:停词列表
  • token_pattern:分词的字符模式,默认空格
  • max_df:词频上限,超过该值的词项不作为特征,即过滤常用词
  • min_df:词频下限,低于该值的词项不作为特征
  • max_features:只选择词频较高的几个作为特征
In [18]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,1),stop_words='english',lowercase=True,min_df=1)
X = vectorizer.fit_transform(df.text) 
y = (df.type == 'spam').values.astype(int)

TfidfVectorizer则可以计算tfidf值,而非仅仅文档词频矩阵

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,1),stop_words='english',lowercase=True,min_df=1)
X = vectorizer.fit_transform(df.text)

3、将数据切分为train和test

In [20]:
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X, y)

4、使用贝叶斯分类器进行训练

  • 重要的参数alpha用于设置平滑系数
In [21]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha =1).fit(xtrain, ytrain)

5、观察分类效果

In [22]:
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)
print "训练集准确率:  {:.2f}".format(training_accuracy)
print "检验集准确率:  {:.2f}".format(test_accuracy)  
训练集准确率:  0.98
检验集准确率:  0.97

6、使用CV选择最优参数,参数为0.2

In [23]:
from sklearn import svm, grid_search
nb = MultinomialNB()
parameters = {'alpha':np.linspace(0,10,101)}
clf = grid_search.GridSearchCV(nb, parameters)
clf.fit(X, y)
Out[23]:
GridSearchCV(cv=None,
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'alpha': array([  0. ,   0.1, ...,   9.9,  10. ])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)
In [100]:
print "最佳参数:  {:.3f}".format(clf.best_params_['alpha'] )
print "最佳准确率:  {:.3f}".format(clf.best_score_)  
最佳参数:  0.200
最佳准确率:  0.984

In [62]:
accuracy = [t[1] for t in clf.grid_scores_]
para = [t[0]['alpha'] for t in clf.grid_scores_]
In [94]:
import matplotlib.pylab as plt
accuracy = [t[1] for t in clf.grid_scores_]
para = [t[0]['alpha'] for t in clf.grid_scores_]
plt.plot(para,accuracy,lw=3)
Out[94]:
[]

1 条评论: