dewlily 发表于 2011-12-20 09:47

KNN算法描述

<DIV>
<H3 style="TEXT-ALIGN: center; MARGIN: 0cm 0cm 0pt" align=center><FONT face=宋体><SPAN style="FONT-SIZE: 15pt">数据挖掘之</SPAN><SPAN style="FONT-SIZE: 15pt; mso-hansi-font-family: Arial" lang=EN-US>web</SPAN><SPAN style="FONT-SIZE: 15pt">文本归类算法的设计与实现<SPAN lang=EN-US><?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></SPAN></FONT></H3>
<H3 style="TEXT-ALIGN: center; MARGIN: 0cm 0cm 0pt" align=center><SPAN style="FONT-SIZE: 15pt; mso-hansi-font-family: Arial" lang=EN-US><o:p><FONT face=宋体>&nbsp;</FONT></o:p></SPAN></H3>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-layout-grid-align: none" class=MsoNormal align=left><B style="mso-bidi-font-weight: normal"><SPAN style="FONT-FAMILY: 仿宋_GB2312; FONT-SIZE: 12pt; mso-hansi-font-family: 华文仿宋; mso-font-kerning: 0pt; mso-bidi-font-family: AdobeHeitiStd-Regular" lang=EN-US>1 </SPAN></B><B style="mso-bidi-font-weight: normal"><SPAN style="FONT-FAMILY: 仿宋_GB2312; FONT-SIZE: 12pt; mso-hansi-font-family: 华文仿宋; mso-font-kerning: 0pt; mso-bidi-font-family: AdobeHeitiStd-Regular">引言<SPAN lang=EN-US><o:p></o:p></SPAN></SPAN></B></P>
<P style="LINE-HEIGHT: 150%; BACKGROUND: white"><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT size=3 face=宋体>自动分类器利用机器学习原理对网页进行分类。分类器的工作步骤如下:</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US><BR></SPAN><FONT size=3><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>(</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>1</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>)利用搜狐分类目录的分类样本集进行训练,获得每个分类在特征空间上的聚类中心。</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana"> <SPAN lang=EN-US><o:p></o:p></SPAN></SPAN></FONT></P>
<P style="LINE-HEIGHT: 150%; BACKGROUND: white"><FONT size=3><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>(</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>2</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>)采用用户选择的分类器对接收的文本进行自动分类,并输出结果。</FONT></SPAN></FONT><SPAN style="FONT-FAMILY: Verdana" lang=EN-US><BR></SPAN><FONT size=3><FONT face=宋体><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana">分类器被应用在网页搜索的分类提示功能中。分类提示功能不是简单的关键字叠加。</SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US><o:p></o:p></SPAN></FONT></FONT></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-pagination: widow-orphan; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto" class=MsoNormal align=left><SPAN style="FONT-FAMILY: 宋体; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-bidi-font-size: 10.5pt"><FONT size=3>下面对分类流程作个简要描述:</FONT></SPAN><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 9pt; mso-font-kerning: 0pt"> <SPAN lang=EN-US><o:p></o:p></SPAN></SPAN></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-pagination: widow-orphan; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto" class=MsoNormal align=left><SPAN style="FONT-FAMILY: 宋体; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-bidi-font-size: 10.5pt"><FONT size=3>训练:训练集<SPAN lang=EN-US>——&gt;</SPAN>特征选取<SPAN lang=EN-US>——&gt;</SPAN>训练<SPAN lang=EN-US>——&gt;</SPAN>分类器</FONT></SPAN><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 9pt; mso-font-kerning: 0pt" lang=EN-US> <o:p></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-pagination: widow-orphan; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto" class=MsoNormal align=left><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-ascii-font-family: 'Times New Roman'">分类:新样本</SPAN><SPAN style="mso-font-kerning: 0pt" lang=EN-US><FONT face="Times New Roman">——&gt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-ascii-font-family: 'Times New Roman'">特征选取</SPAN><SPAN style="mso-font-kerning: 0pt" lang=EN-US><FONT face="Times New Roman">——&gt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-ascii-font-family: 'Times New Roman'">分类</SPAN><SPAN style="mso-font-kerning: 0pt" lang=EN-US><FONT face="Times New Roman">——&gt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-hansi-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-ascii-font-family: 'Times New Roman'">判决</SPAN></FONT><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 9pt; mso-font-kerning: 0pt; mso-bidi-font-family: 'Times New Roman'"> </SPAN><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 9pt; mso-font-kerning: 0pt" lang=EN-US><o:p></o:p></SPAN></P>
<P style="LINE-HEIGHT: 150%; BACKGROUND: white"><FONT size=3><FONT face=宋体><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana">分类器自动判断出文本所属的分类,然后按分类下结果多少给出提示,帮助用户有效缩小结果集,更精准地找到所需的信息。</SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US><o:p></o:p></SPAN></FONT></FONT></P>
<P style="LINE-HEIGHT: 150%; BACKGROUND: white"><FONT size=3><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>训练语料</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>:</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>搜狗实验室提供的分类语料精简版</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>另自己爬了些数据</FONT></SPAN></FONT><SPAN style="FONT-FAMILY: Verdana" lang=EN-US><BR></SPAN><FONT size=3><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>训练词典</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>:</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>搜狗实验室提供的词典</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>包含</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>IDF,</SPAN><FONT face=宋体><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana">词性等值</SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US><o:p></o:p></SPAN></FONT></FONT></P>
<P style="LINE-HEIGHT: 150%; BACKGROUND: white"><FONT size=3><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>包含大类</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>:</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>房产,汽车,财经,</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>IT</SPAN><FONT face=宋体><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana">,医疗,教育,旅游,体育,招聘,政治,军事</SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US><o:p></o:p></SPAN></FONT></FONT></P>
<P style="LINE-HEIGHT: 150%; BACKGROUND: white"><FONT size=3><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>实现算法</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>:<o:p></o:p></SPAN></FONT></P>
<P style="LINE-HEIGHT: 150%; BACKGROUND: white"><FONT size=3><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>训练</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>解压搜狗分类语料</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>对各个类别分别训练</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>利用我自己变通的</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>KNN</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>算法</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>得到每种语料的学习结果</FONT></SPAN></FONT><FONT size=3><SPAN style="FONT-FAMILY: Verdana" lang=EN-US> <BR></SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>分类</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>给定一篇文本</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>分词</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>得到每一个词的权重</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>然后根学习结果比较</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>,</SPAN><SPAN style="mso-hansi-font-family: Verdana; mso-ascii-font-family: Verdana"><FONT face=宋体>最相近的一类就是它所属的大类</FONT></SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US>.<o:p></o:p></SPAN></FONT></P>
<P style="LINE-HEIGHT: 150%; BACKGROUND: white"><FONT size=3><FONT face=宋体>缺点<SPAN lang=EN-US>:</SPAN>训练语料都是新闻类<SPAN lang=EN-US>,</SPAN>所以测试最好用新闻语料<SPAN lang=EN-US>.</SPAN>另<SPAN lang=EN-US>IT</SPAN>类训练语料问题<SPAN lang=EN-US>,IT</SPAN>类误差较大<SPAN lang=EN-US>.</SPAN><SPAN style="FONT-FAMILY: Verdana" lang=EN-US><o:p></o:p></SPAN></FONT></FONT></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-layout-grid-align: none" class=MsoNormal align=left><B style="mso-bidi-font-weight: normal"><SPAN style="FONT-FAMILY: 仿宋_GB2312; FONT-SIZE: 14pt; mso-hansi-font-family: 华文仿宋; mso-font-kerning: 0pt; mso-bidi-font-family: AdobeHeitiStd-Regular" lang=EN-US>2 KNN</SPAN></B><B style="mso-bidi-font-weight: normal"><SPAN style="FONT-FAMILY: 仿宋_GB2312; FONT-SIZE: 14pt; mso-hansi-font-family: 华文仿宋; mso-font-kerning: 0pt; mso-bidi-font-family: AdobeHeitiStd-Regular">文本分类算法描述<SPAN lang=EN-US><o:p></o:p></SPAN></SPAN></B></P>
<P style="TEXT-ALIGN: left; LINE-HEIGHT: 18pt; MARGIN: 0cm 0cm 0pt; WORD-BREAK: break-all; mso-pagination: widow-orphan; mso-outline-level: 3" class=MsoNormal align=left><B><SPAN style="FONT-FAMILY: Arial; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>KNN</SPAN></B><B><SPAN style="FONT-FAMILY: 宋体; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">算法的决策过程</SPAN></B><B><SPAN style="FONT-FAMILY: Arial; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US><o:p></o:p></SPAN></B></P>
<P style="TEXT-ALIGN: left; LINE-HEIGHT: 18pt; MARGIN: 0cm 0cm 0pt; WORD-BREAK: break-all; mso-pagination: widow-orphan" class=MsoNormal align=left><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">  </SPAN><B><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>k-Nearest Neighbor algorithm</SPAN></B><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US><o:p></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; LINE-HEIGHT: 18pt; MARGIN: 0cm 0cm 0pt; WORD-BREAK: break-all; mso-pagination: widow-orphan" class=MsoNormal align=left><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">  图中,绿色圆要被决定赋予哪个类,是红色三角形还是蓝色四方形?如果</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>K=3</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">,由于红色三角形所占比例为</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>2/3</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">,绿色圆将被赋予红色三角形那个类,如果</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>K=5</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">,由于蓝色四方形比例为</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>3/5</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">,因此绿色圆被赋予蓝色四方形类。</SPAN><SPAN style="FONT-FAMILY: Arial; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US><A href="http://imgsrc.baidu.com/baike/pic/item/b74124f3c4b8a34e352accfb.jpg" target=_blank><SPAN style="COLOR: #3366cc; TEXT-DECORATION: none; text-underline: none"><?xml:namespace prefix = v ns = "urn:schemas-microsoft-com:vml" /><v:shapetype id=_x0000_t75 stroked="f" filled="f" path="m@4@5l@4@11@9@11@9@5xe" o:preferrelative="t" o:spt="75" coordsize="21600,21600">&nbsp;<v:stroke joinstyle="miter"></v:stroke><v:formulas><v:f eqn="if lineDrawn pixelLineWidth 0"></v:f><v:f eqn="sum @0 1 0"></v:f><v:f eqn="sum 0 0 @1"></v:f><v:f eqn="prod @2 1 2"></v:f><v:f eqn="prod @3 21600 pixelWidth"></v:f><v:f eqn="prod @3 21600 pixelHeight"></v:f><v:f eqn="sum @0 0 1"></v:f><v:f eqn="prod @6 1 2"></v:f><v:f eqn="prod @7 21600 pixelWidth"></v:f><v:f eqn="sum @8 21600 0"></v:f><v:f eqn="prod @7 21600 pixelHeight"></v:f><v:f eqn="sum @10 21600 0"></v:f></v:formulas><v:path o:connecttype="rect" gradientshapeok="t" o:extrusionok="f"></v:path><o:lock aspectratio="t" v:ext="edit"></o:lock></v:shapetype></SPAN></A> </P>
<DIV forimg="1"><IMG class=blogimg border=0 src="http://hiphotos.baidu.com/oop_ming/pic/item/0d06e9f3f3d5b8ef0a46e029.jpg" small="0" _fcksavedurl="http://hiphotos.baidu.com/oop%5Fming/pic/item/0d06e9f3f3d5b8ef0a46e029.jpg"></DIV>
<P style="TEXT-ALIGN: left; LINE-HEIGHT: 18pt; MARGIN: 0cm 0cm 0pt; WORD-BREAK: break-all; mso-pagination: widow-orphan" class=MsoNormal align=left><SPAN style="LETTER-SPACING: 0.4pt"><o:p></o:p></SPAN></SPAN></P>
<P style="TEXT-ALIGN: left; LINE-HEIGHT: 18pt; MARGIN: 0cm 0cm 0pt; WORD-BREAK: break-all; mso-pagination: widow-orphan" class=MsoNormal align=left><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">  </SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>K</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">最近邻</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>(k-Nearest Neighbor</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">,</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>KNN)</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">分类算法,是一个理论上比较成熟的方法,也是最简单的机器学习算法之一。该方法的思路是:如果一个样本在特征空间中的</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>k</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">个最相似</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>(</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">即特征空间中最邻近</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>)</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">的样本中的大多数属于某一个类别,则该样本也属于这个类别。</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>KNN</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">算法中,所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US> KNN</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">方法虽然从原理上也依赖于极限定理,但在类别决策时,只与极少量的相邻样本有关。由于</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>KNN</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>KNN</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">方法较其他方法更为适合。</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US><o:p></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; LINE-HEIGHT: 18pt; MARGIN: 0cm 0cm 0pt; WORD-BREAK: break-all; mso-pagination: widow-orphan" class=MsoNormal align=left><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">  </SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>KNN</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">算法不仅可以用于分类,还可以用于回归。通过找出一个样本的</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>k</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">个最近邻居,将这些邻居的属性的平均值赋给该样本,就可以得到该样本的属性。更有用的方法是将不同距离的邻居对该样本产生的影响给予不同的权值</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>(weight)</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">,如权值与距离成正比。</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US><o:p></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; LINE-HEIGHT: 18pt; MARGIN: 0cm 0cm 0pt; WORD-BREAK: break-all; mso-pagination: widow-orphan" class=MsoNormal align=left><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">  该算法在分类时有个主要的不足是,当样本不平衡时,如一个类的样本容量很大,而其他类样本容量很小时,有可能导致当输入一个新样本时,该样本的</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>K</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">个邻居中大容量类的样本占多数。因此可以采用权值的方法(和该样本距离小的邻居权值大)来改进。该方法的另一个不足之处是计算量较大,因为对每一个待分类的文本都要计算它到全体已知样本的距离,才能求得它的</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US>K</SPAN><SPAN style="FONT-FAMILY: 宋体; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-hansi-font-family: Arial; mso-font-kerning: 0pt; mso-bidi-font-family: Arial; mso-ascii-font-family: Arial">个最近邻点。目前常用的解决方法是事先对已知样本点进行剪辑,事先去除对分类作用不大的样本。该算法比较适用于样本容量比较大的类域的自动分类,而那些样本容量较小的类域采用这种算法比较容易产生误分。</SPAN><SPAN style="FONT-FAMILY: Arial; LETTER-SPACING: 0.4pt; COLOR: black; FONT-SIZE: 12pt; mso-font-kerning: 0pt" lang=EN-US> <o:p></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-layout-grid-align: none" class=MsoNormal align=left><SPAN style="FONT-SIZE: 12pt" lang=EN-US><o:p><FONT face="Times New Roman">&nbsp;</FONT></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-layout-grid-align: none" class=MsoNormal align=left><SPAN style="FONT-SIZE: 12pt" lang=EN-US><o:p><FONT face="Times New Roman">相关资料:</FONT></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-layout-grid-align: none" class=MsoNormal align=left><SPAN style="FONT-SIZE: 12pt" lang=EN-US><o:p>1. k-nearest neighbor algorithm:&nbsp; <A href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm"><FONT color=#810081>http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm</FONT></A></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-layout-grid-align: none" class=MsoNormal align=left><SPAN style="FONT-SIZE: 12pt" lang=EN-US><o:p></o:p></SPAN>&nbsp;</P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-layout-grid-align: none" class=MsoNormal align=left><SPAN style="FONT-SIZE: 12pt" lang=EN-US><o:p>2. <A class="external text" href="http://paul.luminos.nl/documents/show_document.php?d=197" rel=nofollow><FONT color=#800080><I>k</I>-nearest neighbor algorithm in Visual Basic and Java</FONT></A>&nbsp;<A href="http://paul.luminos.nl/documents/show_document.php?d=197"><FONT color=#810081>http://paul.luminos.nl/documents/show_document.php?d=197</FONT></A></o:p></SPAN></P>
<P style="TEXT-ALIGN: left; MARGIN: 0cm 0cm 0pt; mso-layout-grid-align: none" class=MsoNormal align=left><SPAN style="FONT-SIZE: 12pt" lang=EN-US><o:p><A href="http://paul.luminos.nl/update/408"><FONT color=#0000ff>http://paul.luminos.nl/update/408</FONT></A></o:p></SPAN></P></DIV>
页: [1]
查看完整版本: KNN算法描述