论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2008-07-29 13:05 |只看该作者 |倒序浏览

随着当代计算机硬件的发展，硬件功能越来越强大，价格越来越低，企业可以记录的数据也越来越多，这些因素就为数据挖掘的普及做了比较好的前提准备，树挖掘是未来信息处理的重要技术，而且就目前而言已经取得了决定性成功而且得到了比较广泛的应用。
数据挖掘中有很多领域，分类就是其中之一，什么是分类，
分类就是把一些新得数据项映射到给定类别的中的某一个类别，比如说当我们发表一篇文章的时候，就可以自动的把这篇文章划分到某一个文章类别，一般的过程是根据样本数据利用一定的分类算法得到分类规则，新的数据过来就依据该规则进行类别的划分。
分类在数据挖掘中是一项非常重要的任务，有很多用途，比如说预测，即从历史的样本数据推算出未来数据的趋向，有一个比较著名的预测的例子就是大豆学习。再比如说分析用户行为，我们常称之为受众分析，通过这种分类，我们可以得知某一商品的用户群，对销售来说有很大的帮助。
分类器的构造方法有统计方法，机器学习方法，神经网络方法等等。常见的统计方法有knn算法，基于事例的学习方法。机器学习方法包括决策树法和归纳法，上面讲到的受众分析可以使用决策树方法来实现。神经网络方法主要是bp算法，这个俺也不太了解。
文本分类，所谓的文本分类就是把文本进行归类，不同的文章根据文章的内容应该属于不同的类别，文本分类离不开分词，要将一个文本进行分类，首先需要对该文本进行分词，利用分词之后的的项向量作为计算因子，再使用一定的算法和样本中的词汇进行计算，从而可以得出正确的分类结果。在这个例子中，我将使用庖丁分词器对文本进行分词。
下面这个例子将使用反余弦进行词汇单元进行匹配，
第一步，训练样本：
Java代码

protected Map> getClassVector(List categoryList) throws Exception {

if (categoryList == null || categoryList.size() == 0) {

if (logger.isDebugEnabled()) {

logger.debug("The list of new categoryList which should be classified is null or size = 0");

}

return Collections.emptyMap();

}

Map> categoryMap = new HashMap>();

Directory ramDir = new RAMDirectory();

IndexWriter writer = new IndexWriter(ramDir, new PaodingAnalyzer(), true);

for (Category cRc : categoryList) {

for (Article item : cRc.getArticleList()) {

Document doc = new Document();

doc.add(new Field("description", item.getContent(), Field.Store.NO,

Field.Index.TOKENIZED, TermVector.YES));

doc.add(new Field("category", cRc.getId().toString(), Field.Store.YES, Field.Index.NO));

writer.addDocument(doc);

}

if (logger.isDebugEnabled()) {

logger.debug("Generate the index in the memory, the size of categoryList list is " + categoryList.size());

}

writer.close();

buildContentVectors(ramDir, categoryMap, "category", "description");

return categoryMap;

} protected Map> getClassVector(List categoryList) throws Exception {

if (categoryList == null || categoryList.size() == 0) {
if (logger.isDebugEnabled()) {
logger.debug("The list of new categoryList which should be classified is null or size = 0");
}
return Collections.emptyMap();
}

Map> categoryMap = new HashMap>();

Directory ramDir = new RAMDirectory();
IndexWriter writer = new IndexWriter(ramDir, new PaodingAnalyzer(), true);

for (Category cRc : categoryList) {
for (Article item : cRc.getArticleList()) {

Document doc = new Document();
doc.add(new Field("description", item.getContent(), Field.Store.NO,
Field.Index.TOKENIZED, TermVector.YES));
doc.add(new Field("category", cRc.getId().toString(), Field.Store.YES, Field.Index.NO));
writer.addDocument(doc);
}
}

if (logger.isDebugEnabled()) {
logger.debug("Generate the index in the memory, the size of categoryList list is " + categoryList.size());
}

writer.close();

buildContentVectors(ramDir, categoryMap, "category", "description");
return categoryMap;

}
第二步：对待分类的文章进行分词（原理和样本训练类似）：
Java代码

protected Map> getArticleVector(List articleList) throws Exception {

if (articleList == null || articleList.size() == 0) {

if (logger.isDebugEnabled()) {

logger.debug("The list of articles which should be classified is null or size = 0");

}

Map> articleMap = new HashMap>();

Directory articleRamDir = new RAMDirectory();

// IndexWriter writer = new IndexWriter(articleRamDir, new ChineseAnalyzer(), true);

IndexWriter writer = new IndexWriter(articleRamDir, new PaodingAnalyzer(), true);

for (Article article : articleList) {

Document doc = new Document();

doc.add(new Field("articleId", article.getId(),

Field.Store.YES, Field.Index.NO));

doc.add(new Field("description", article.getContent(), Field.Store.NO, Field.Index.TOKENIZED, TermVector.YES));

writer.addDocument(doc);

}

writer.flush();

writer.close();

buildContentVectors(articleRamDir, articleMap, "articleId", "description");

return articleMap;

} protected Map> getArticleVector(List articleList) throws Exception {
if (articleList == null || articleList.size() == 0) {
if (logger.isDebugEnabled()) {
logger.debug("The list of articles which should be classified is null or size = 0");
}
}

Map> articleMap = new HashMap>();

Directory articleRamDir = new RAMDirectory();
// IndexWriter writer = new IndexWriter(articleRamDir, new ChineseAnalyzer(), true);
IndexWriter writer = new IndexWriter(articleRamDir, new PaodingAnalyzer(), true);

for (Article article : articleList) {
Document doc = new Document();
doc.add(new Field("articleId", article.getId(),
Field.Store.YES, Field.Index.NO));
doc.add(new Field("description", article.getContent(), Field.Store.NO, Field.Index.TOKENIZED, TermVector.YES));
writer.addDocument(doc);
}

writer.flush();
writer.close();

buildContentVectors(articleRamDir, articleMap, "articleId", "description");
return articleMap;
}
分类的核心算法(下面这段代码的原理来自于lucene in action)：
Java代码

public double caculateVectorSpace(Map articleVectorMap, Map classVectorMap) {

if (articleVectorMap == null || classVectorMap == null) {

if (logger.isDebugEnabled()) {

logger.debug("itemVectorMap or classVectorMap is null");

}

return 20;

}

int dotItem = 0;

double denominatorOne = 0;

double denominatorTwo = 0;

for (Entry entry : articleVectorMap.entrySet()) {

String word = entry.getKey();

double categoryWordFreq = 0;

double articleWordFreq = 0;

if (classVectorMap.containsKey(word)) {

categoryWordFreq = classVectorMap.get(word).intValue() / classVectorMap.size();

articleWordFreq = entry.getValue().intValue() / articleVectorMap.size();

}

dotItem += categoryWordFreq * articleWordFreq;

denominatorOne += categoryWordFreq * categoryWordFreq;

denominatorTwo += articleWordFreq * articleWordFreq;

}

double denominator = Math.sqrt(denominatorOne) * Math.sqrt(denominatorTwo);

double ratio = dotItem / denominator;

return Math.acos(ratio);

} public double caculateVectorSpace(Map articleVectorMap, Map classVectorMap) {
if (articleVectorMap == null || classVectorMap == null) {
if (logger.isDebugEnabled()) {
logger.debug("itemVectorMap or classVectorMap is null");
}

return 20;
}

int dotItem = 0;
double denominatorOne = 0;
double denominatorTwo = 0;

for (Entry entry : articleVectorMap.entrySet()) {
String word = entry.getKey();
double categoryWordFreq = 0;
double articleWordFreq = 0;

if (classVectorMap.containsKey(word)) {
categoryWordFreq = classVectorMap.get(word).intValue() / classVectorMap.size();
articleWordFreq = entry.getValue().intValue() / articleVectorMap.size();
}

dotItem += categoryWordFreq * articleWordFreq;
denominatorOne += categoryWordFreq * categoryWordFreq;
denominatorTwo += articleWordFreq * articleWordFreq;
}

double denominator = Math.sqrt(denominatorOne) * Math.sqrt(denominatorTwo);

double ratio = dotItem / denominator;

return Math.acos(ratio);
}
效果：
测试数据：
Java代码

public static List prepareCategoryList() {

List categoryList = new ArrayList();

List articleList = new ArrayList();

Category c = new Category();

c.setArticleList(articleList);

c.setId("1");

categoryList.add(c);

Article a1 = new Article();

a1.setId("1");

a1.setTitle("Hibernate初探");

a1.setContent("开始看Hibernate reference,运行hibernate的test中的代码。 Environment是一个非常重要的类。它定义了很多常量，最重要的是hibernate的入口在这里。");

Article a2 = new Article();

a2.setId("2");

a2.setTitle("Hibernate SQL方言");

a2.setContent("PO的数据类型设置 int 还是Integer Integer 允许为 null Hibernate 既可以访问Field也可以访问Property");

Article a3 = new Article();

a3.setId("3");

a3.setTitle("Hibernate 杂烩");

a3.setContent("Hibernate 中聚合函数的使用 Criteria接口的Projections类主要用于帮助Criteria接口完成数据的分组查询和统计功能:");

Article a4 = new Article();

a4.setId("4");

a4.setTitle("Hibernate映射类型");

a4.setContent("Hibernate映射类型Hibernate映射类型,对应的基本类型及对应的标准SQL类型");

articleList.add(a1);

articleList.add(a2);

articleList.add(a3);

articleList.add(a4);

return categoryList;

}

public static List prepareArticleList() {

List articleList = new ArrayList();

Article a1 = new Article();

a1.setId("1");

a1.setTitle("Hibernate学习笔记(一)");

a1.setContent("本笔记的内容: 分层体系结构 ORM介绍 Hibernate简介 Hibernate开发步骤 Hibernate核心API ");

Article a2 = new Article();

a2.setId("2");

a2.setTitle("Hibernate的性能问题");

a2.setContent("各位老大，使用hibernate做企业级别的应用，会不会有性能问题啊？比如大数据量的搜索或者客户端同时大量的请求，会不会严重影响性能啊？有没有什么好的解决办法? 谢了先!");

Article a3 = new Article();

a3.setId("3");

a3.setTitle("Spring2.5全面支持JEE5的实现");

a3.setContent("Spring 2.5 发布已经有一段时间了，一直没有时间研究一下，只是听说有很多方面的提升。有一点十分重要的就是全面支持JEE5风格的annotation。");

Article a4 = new Article();

a4.setId("4");

a4.setTitle("谈谈Spring的SqlMapClientTemplate对SqlMapClientCallback");

a4.setContent("谈谈Spring的SqlMapClientTemplate对SqlMapClientCallback的使用 ■记得以前在看SqlMapClientTemplate的源代码的时候，下面的这两段代码硬是没看懂当时我很疑惑：真的有必要用到内部匿名类这样诡异的手法么？");

Article a5 = new Article();

a5.setId("5");

a5.setTitle("spring 2.0 学习笔记");

a5.setContent("前几天学习hibernate!在mysql下都能正常跑出来.! 但是我一换成oracle就出现下面这种情况: 小弟不解呀..google了很多次也解决不了此问题::.希望老大门帮忙看一下哈.!!! 环境:MyEclipse5.5,hibernate2.0,spring2.0");

articleList.add(a1);

articleList.add(a2);

articleList.add(a3);

articleList.add(a4);

articleList.add(a5);

return articleList;

} public static List prepareCategoryList() {
List categoryList = new ArrayList();
List articleList = new ArrayList();

Category c = new Category();
c.setArticleList(articleList);
c.setId("1");
categoryList.add(c);

Article a1 = new Article();
a1.setId("1");
a1.setTitle("Hibernate初探");
a1.setContent("开始看Hibernate reference,运行hibernate的test中的代码。 Environment是一个非常重要的类。它定义了很多常量，最重要的是hibernate的入口在这里。");

Article a2 = new Article();
a2.setId("2");
a2.setTitle("Hibernate SQL方言");
a2.setContent("PO的数据类型设置 int 还是Integer Integer 允许为 null Hibernate 既可以访问Field也可以访问Property");

Article a3 = new Article();
a3.setId("3");
a3.setTitle("Hibernate 杂烩");
a3.setContent("Hibernate 中聚合函数的使用 Criteria接口的Projections类主要用于帮助Criteria接口完成数据的分组查询和统计功能:");

Article a4 = new Article();
a4.setId("4");
a4.setTitle("Hibernate映射类型");
a4.setContent("Hibernate映射类型Hibernate映射类型,对应的基本类型及对应的标准SQL类型");

articleList.add(a1);
articleList.add(a2);
articleList.add(a3);
articleList.add(a4);

return categoryList;
}

public static List prepareArticleList() {
List articleList = new ArrayList();
Article a1 = new Article();
a1.setId("1");
a1.setTitle("Hibernate学习笔记(一)");
a1.setContent("本笔记的内容: 分层体系结构 ORM介绍 Hibernate简介 Hibernate开发步骤 Hibernate核心API ");

Article a2 = new Article();
a2.setId("2");
a2.setTitle("Hibernate的性能问题");
a2.setContent("各位老大，使用hibernate做企业级别的应用，会不会有性能问题啊？比如大数据量的搜索或者客户端同时大量的请求，会不会严重影响性能啊？有没有什么好的解决办法? 谢了先!");

Article a3 = new Article();
a3.setId("3");
a3.setTitle("Spring2.5全面支持JEE5的实现");
a3.setContent("Spring 2.5 发布已经有一段时间了，一直没有时间研究一下，只是听说有很多方面的提升。有一点十分重要的就是全面支持JEE5风格的annotation。");

Article a4 = new Article();
a4.setId("4");
a4.setTitle("谈谈Spring的SqlMapClientTemplate对SqlMapClientCallback");
a4.setContent("谈谈Spring的SqlMapClientTemplate对SqlMapClientCallback的使用 ■记得以前在看SqlMapClientTemplate的源代码的时候，下面的这两段代码硬是没看懂当时我很疑惑：真的有必要用到内部匿名类这样诡异的手法么？");

Article a5 = new Article();
a5.setId("5");
a5.setTitle("spring 2.0 学习笔记");
a5.setContent("前几天学习hibernate!在mysql下都能正常跑出来.! 但是我一换成oracle就出现下面这种情况: 小弟不解呀..google了很多次也解决不了此问题::.希望老大门帮忙看一下哈.!!! 环境:MyEclipse5.5,hibernate2.0,spring2.0");

articleList.add(a1);
articleList.add(a2);
articleList.add(a3);
articleList.add(a4);
articleList.add(a5);

return articleList;
}
以上测试代码中的数据来源于javaeye的文章。
输出：
2008-02-19 11:05:42,031 DEBUG ArticleClassifierImpl:74 - articleId=2---------acos value=1.412016112149136
2008-02-19 11:05:42,031 DEBUG ArticleClassifierImpl:74 - articleId=1---------acos value=1.3258176636680326
2008-02-19 11:05:42,031 DEBUG ArticleClassifierImpl:74 - articleId=5---------acos value=1.4244090675006476
有此可见文章id号为1,2,5的文章符合hibernate分类，事实上我们还要更进一步，假设我们有两个分类，hibernate，spring，各有5各样本，那么最后的结果应该再次作最小符合判断，acos值最小的则认为该article属于该分类，同学们可以自己做一下实验。
文本分类中有很多注意点，比如说噪音词去除（上面的代码中并包括最简单的噪音词去除功能）等，接下来我会使用knn算法改造以上代码，并使用相同的测试数据并比对测试结果。

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u2/70940/showart_1095745.html

文库|博客

返回列表

Chinaunix › 论坛 › 程序设计 › Java › Java文档中心 › 数据挖掘之分类

数据挖掘之分类 [复制链接]

浏览过的版块