tm包是R语言中为文本挖掘提供综合性处理的package,进行操作前载入tm包,vignette命令可以让你得到相关的文档说明
library(tm)
vignette("tm")
首先要读取文本,本次操作所用的文本是tm包自带的20个XML格式文本,存放在library\tm\texxts\crude文件夹中。用Corpus命令读取文本并生成语料库文件
reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML))
下一步用tm_map命令对语料库文件进行预处理,将其转为纯文本并去除多余空格,转换小写,去除常用词汇、合并异形同意词汇
reuters <- tm_map(reuters, as.PlainTextDocument)
reuters <- tm_map(reuters, stripWhitespace)
reuters <- tm_map(reuters, tolower)
reuters <- tm_map(reuters, removeWords, stopwords("english"))
tm_map(reuters, stemDocument)
利用DocumentTermMatrix将处理后的语料库进行断字处理,生成词频权重矩阵
dtm <- DocumentTermMatrix(reuters)
部分矩阵内容可通过inspect来观察
inspect(dtm[1:5, 100:105])
Docs abdul-aziz ability able abroad, abu accept
127 0 0 0 0 0 0
144 0 2 0 0 0 0
191 0 0 0 0 0 0
194 0 0 0 0 0 0
211 0 0 0 0 0 0
如果需要考察多个文档中特有词汇的出现频率,可以手工生成字典,并将它作为生成矩阵的参数
(d <- Dictionary(c("prices", "crude", "oil")))
inspect(DocumentTermMatrix(reuters, list(dictionary = d)))
因为生成的矩阵是一个稀疏矩阵,再进行降维处理,之后转为标准数据框格式
dtm2 <- removeSparseTerms(dtm, sparse=0.95)
data <- as.data.frame(inspect(dtm2))
再之后就可以利用R语言中任何工具加以研究了,下面用层次聚类试试看
先进行标准化处理,再生成距离矩阵,再用层次聚类
data.scale <- scale(data)
d <- dist(data.scale, method = "euclidean")
fit <- hclust(d, method="ward")
绘制聚类图
plot(fit)
可以看到在20个文档中,489号和502号聚成一类,与其它文档区别较大。
想请教一下,为什么要进行tm_map(reuters, stemDocument) 这一步呢?比较stemDocument(crude[[1]])和 crude[[1]],发现经过stem之后感觉很多单词都错了
回复删除> data("crude")
> crude[[1]]
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter
> stemDocument(crude[[1]])
Diamond Shamrock Corp said that
effect today it had cut it contract price for crude oil by
1.50 dlrs a barrel.
The reduct bring it post price for West Texas
Intermedi to 16.00 dlrs a barrel, the copani said.
"The price reduct today was made in the light of falling
oil product price and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil compani that
hav cut it contract, or posted, price over the last two days
cit weak oil markets.
Reuter