Python贝叶斯文本分类识别垃圾短信
1、读取数据,type表示短信类别,text是短信内容
In [17]:
%pylab inline
import pandas as pd
import numpy as np
df = pd.read_csv('sms_spam.csv')
df.head()
Out[17]:
2、使用sklearn包转换文本为结构化数据,将矩阵分切为训练集和检验集
CountVectorizer负责将文档转为文档词频矩阵,重要的参数有如下几个:
- ngram_range:ngrame频率范围,如果需要识别词组的话需要设置
- stop_words:停词列表
- token_pattern:分词的字符模式,默认空格
- max_df:词频上限,超过该值的词项不作为特征,即过滤常用词
- min_df:词频下限,低于该值的词项不作为特征
- max_features:只选择词频较高的几个作为特征
In [18]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,1),stop_words='english',lowercase=True,min_df=1)
X = vectorizer.fit_transform(df.text)
y = (df.type == 'spam').values.astype(int)
TfidfVectorizer则可以计算tfidf值,而非仅仅文档词频矩阵
In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,1),stop_words='english',lowercase=True,min_df=1)
X = vectorizer.fit_transform(df.text)
3、将数据切分为train和test
In [20]:
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X, y)
4、使用贝叶斯分类器进行训练
- 重要的参数alpha用于设置平滑系数
In [21]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha =1).fit(xtrain, ytrain)
5、观察分类效果
In [22]:
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)
print "训练集准确率: {:.2f}".format(training_accuracy)
print "检验集准确率: {:.2f}".format(test_accuracy)
6、使用CV选择最优参数,参数为0.2
In [23]:
from sklearn import svm, grid_search
nb = MultinomialNB()
parameters = {'alpha':np.linspace(0,10,101)}
clf = grid_search.GridSearchCV(nb, parameters)
clf.fit(X, y)
Out[23]:
In [100]:
print "最佳参数: {:.3f}".format(clf.best_params_['alpha'] )
print "最佳准确率: {:.3f}".format(clf.best_score_)
In [62]:
accuracy = [t[1] for t in clf.grid_scores_]
para = [t[0]['alpha'] for t in clf.grid_scores_]
In [94]:
import matplotlib.pylab as plt
accuracy = [t[1] for t in clf.grid_scores_]
para = [t[0]['alpha'] for t in clf.grid_scores_]
plt.plot(para,accuracy,lw=3)
Out[94]:
sms_spam.csv 这个数据在哪里?谢谢
回复删除