论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2011-11-15 15:49 |只看该作者 |倒序浏览

Apache Lucene 评分原理及代码分析

在IndexSearcher类中有一个管理Lucene得分情况的方法，如下所示：

1 public Explanation explain(Weight weight, int doc) throws IOException {2 return weight.explain(reader, doc);3 }返回的这个Explanation的实例解释了Lucene中Document的得分情况。我们可以测试一下，直观地感觉一下到底这个Explanation的实例都记录了一个Document的哪些信息。

写一个测试类，如下所示：

1 package org.shirdrn.lucene.learn; 2 3 import java.io.IOException; 4 import java.util.Date; 5 6 import net.teamhot.lucene.ThesaurusAnalyzer; 7 8 import org.apache.lucene.document.Document; 9 import org.apache.lucene.document.Field; 10 import org.apache.lucene.index.CorruptIndexException; 11 import org.apache.lucene.index.IndexWriter; 12 import org.apache.lucene.index.Term; 13 import org.apache.lucene.index.TermDocs; 14 import org.apache.lucene.search.Explanation; 15 import org.apache.lucene.search.Hits; 16 import org.apache.lucene.search.IndexSearcher; 17 import org.apache.lucene.search.Query; 18 import org.apache.lucene.search.TermQuery; 19 import org.apache.lucene.store.LockObtainFailedException; 20 21 public class AboutLuceneScore { 22 23 private String path = "E:\\Lucene\\index"; 24 25 public void createIndex(){ 26 IndexWriter writer; 27 try { 28 writer = new IndexWriter(path,new ThesaurusAnalyzer(),true); 29 30 Field fieldA = new Field("contents","一人",Field.Store.YES,Field.Index.TOKENIZED); 31 Document docA = new Document(); 32 docA.add(fieldA); 33 34 Field fieldB = new Field("contents","一人之交一人之交",Field.Store.YES,Field.Index.TOKENIZED); 35 Document docB = new Document(); 36 docB.add(fieldB); 37 38 Field fieldC = new Field("contents","一人之下一人之下",Field.Store.YES,Field.Index.TOKENIZED); 39 Document docC = new Document(); 40 docC.add(fieldC); 41 42 Field fieldD = new Field("contents","一人做事一人当一人做事一人当",Field.Store.YES,Field.Index.TOKENIZED); 43 Document docD = new Document(); 44 docD.add(fieldD); 45 46 Field fieldE = new Field("contents","一人做事一人當一人做事一人當",Field.Store.YES,Field.Index.TOKENIZED); 47 Document docE = new Document(); 48 docE.add(fieldE); 49 50 writer.addDocument(docA); 51 writer.addDocument(docB); 52 writer.addDocument(docC); 53 writer.addDocument(docD); 54 writer.addDocument(docE); 55 56 writer.close(); 57 } catch (CorruptIndexException e) { 58 e.printStackTrace(); 59 } catch (LockObtainFailedException e) { 60 e.printStackTrace(); 61 } catch (IOException e) { 62 e.printStackTrace(); 63 } 64 } 65 66 public static void main(String[] args) { 67 AboutLuceneScore aus = new AboutLuceneScore(); 68 aus.createIndex(); // 建立索引 69 try { 70 String keyword = "一人"; 71 Term term = new Term("contents",keyword); 72 Query query = new TermQuery(term); 73 IndexSearcher searcher = new IndexSearcher(aus.path); 74 Date startTime = new Date(); 75 Hits hits = searcher.search(query); 76 TermDocs termDocs = searcher.getIndexReader().termDocs(term); 77 while(termDocs.next()){ 78 System.out.print("搜索关键字<"+keyword+">在编号为 "+termDocs.doc()); 79 System.out.println(" 的Document中出现过 "+termDocs.freq()+" 次"); 80 } 81 System.out.println("********************************************************************"); 82 for(int i=0;i<hits.length();i++){ 83 System.out.println("Document的内部编号为： "+hits.id(i)); 84 System.out.println("Document内容为： "+hits.doc(i)); 85 System.out.println("Document得分为： "+hits.score(i)); 86 Explanation e = searcher.explain(query, hits.id(i)); 87 System.out.println("Explanation为： \n"+e); 88 System.out.println("Document对应的Explanation的一些参数值如下： "); 89 System.out.println("Explanation的getValue()为： "+e.getValue()); 90 System.out.println("Explanation的getDescription()为： "+e.getDescription()); 91 System.out.println("********************************************************************"); 92 } 93 System.out.println("共检索出符合条件的Document "+hits.length()+" 个。"); 94 Date finishTime = new Date(); 95 long timeOfSearch = finishTime.getTime() - startTime.getTime(); 96 System.out.println("本次搜索所用的时间为 "+timeOfSearch+" ms"); 97 } catch (CorruptIndexException e) { 98 e.printStackTrace(); 99 } catch (IOException e) {100 e.printStackTrace();101 }

复制代码

102 103 }104 }该测试类中实现了一个建立索引的方法createIndex()方法；然后通过检索一个关键字“一人”，获取到与它相关的Document的信息。

打印出结果的第一部分为：这个检索关键字“一人”在每个Document中出现的次数。

打印出结果的第二部分为：相关的Explanation及其得分情况的信息。

测试结果输出如下所示：

搜索关键字<一人>在编号为 0 的Document中出现过 1 次
搜索关键字<一人>在编号为 1 的Document中出现过 1 次
搜索关键字<一人>在编号为 2 的Document中出现过 1 次
搜索关键字<一人>在编号为 3 的Document中出现过 2 次
搜索关键字<一人>在编号为 4 的Document中出现过 2 次
********************************************************************

Document的内部编号为： 0
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人>>
Document得分为： 0.81767845
Explanation为：
0.81767845 = (MATCH) fieldWeight(contents:一人 in 0), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
1.0 = fieldNorm(field=contents, doc=0)
Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.81767845
Explanation的getDescription()为： fieldWeight(contents:一人 in 0), product of:

复制代码

********************************************************************

Document的内部编号为： 3
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人做事一人当一人做事一人当>>
Document得分为： 0.5059127
Explanation为：
0.5059127 = (MATCH) fieldWeight(contents:一人 in 3), product of:
1.4142135 = tf(termFreq(contents:一人)=2)
0.81767845 = idf(docFreq=5)
0.4375 = fieldNorm(field=contents, doc=3)
Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.5059127
Explanation的getDescription()为： fieldWeight(contents:一人 in 3), product of:

复制代码

********************************************************************

Document的内部编号为： 4
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人做事一人當一人做事一人當>>
Document得分为： 0.5059127
Explanation为：
0.5059127 = (MATCH) fieldWeight(contents:一人 in 4), product of:
1.4142135 = tf(termFreq(contents:一人)=2)
0.81767845 = idf(docFreq=5)
0.4375 = fieldNorm(field=contents, doc=4)
Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.5059127
Explanation的getDescription()为： fieldWeight(contents:一人 in 4), product of:

复制代码

********************************************************************

Document的内部编号为： 1
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人之交一人之交>>
Document得分为： 0.40883923
Explanation为：
0.40883923 = (MATCH) fieldWeight(contents:一人 in 1), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
0.5 = fieldNorm(field=contents, doc=1)
Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.40883923
Explanation的getDescription()为： fieldWeight(contents:一人 in 1), product of:

复制代码

********************************************************************

Document的内部编号为： 2
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人之下一人之下>>
Document得分为： 0.40883923
Explanation为：
0.40883923 = (MATCH) fieldWeight(contents:一人 in 2), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
0.5 = fieldNorm(field=contents, doc=2)

复制代码

Document对应的Explanation的一些参数值如下：

Explanation的getValue()为： 0.40883923
Explanation的getDescription()为： fieldWeight(contents:一人 in 2), product of:

复制代码

********************************************************************
共检索出符合条件的Document 5 个。
本次搜索所用的时间为 79 ms

先从测试的输出结果进行分析，可以获得到如下信息：

■ 测试类中hits.score(i)的值与Explanation的getValue()的值是一样的，即Lucene默认使用的得分；

■ 默认情况下，Lucene按照Document的得分进行排序检索结果；

■ 默认情况下，如果两个Document的得分相同，按照Document的内部编号进行排序，比如上面编号为(3和4)、(1和2)是两组得分相同的Document，结果排序时按照Document的编号进行了排序；

通过从IndexSearcher类中的explain方法：

1 public Explanation explain(Weight weight, int doc) throws IOException {2 return weight.explain(reader, doc);3 }可以看出，实际上是调用了Weight接口类中的explain()方法，而Weight是与一个Query相关的，它记录了一次查询构造的Query的情况，从而保证一个Query实例可以重用。
具体地，可以在实现Weight接口的具体类TermWeight中追溯到explain()方法，而TermWeight类是一个内部类，定义在TermQuery类内部。TermWeight类的explain()方法如下所示：
1 public Explanation explain(IndexReader reader, int doc) 2 throws IOException { 3 4 ComplexExplanation result = new ComplexExplanation(); 5 result.setDescription("weight("+getQuery()+" in "+doc+"), product of:"); 6 7 Explanation idfExpl = new Explanation(idf, "idf(docFreq=" + reader.docFreq(term) + ")"); 8 9 // explain query weight10 Explanation queryExpl = new Explanation();11 queryExpl.setDescription("queryWeight(" + getQuery() + "), product of:");12 13 Explanation boostExpl = new Explanation(getBoost(), "boost");14 if (getBoost() != 1.0f)15 queryExpl.addDetail(boostExpl);16 queryExpl.addDetail(idfExpl);17 18 Explanation queryNormExpl = new Explanation(queryNorm,"queryNorm");19 queryExpl.addDetail(queryNormExpl);20 21 queryExpl.setValue(boostExpl.getValue() *idfExpl.getValue() *queryNormExpl.getValue());22 23 result.addDetail(queryExpl);24 25 // 说明Field的权重26 String field = term.field();27 ComplexExplanation fieldExpl = new ComplexExplanation();28 fieldExpl.setDescription("fieldWeight("+term+" in "+doc+"), product of:");29 30 Explanation tfExpl = scorer(reader).explain(doc);31 fieldExpl.addDetail(tfExpl);32 fieldExpl.addDetail(idfExpl);33 34 Explanation fieldNormExpl = new Explanation();35 byte[] fieldNorms = reader.norms(field);36 float fieldNorm =37 fieldNorms!=null ? Similarity.decodeNorm(fieldNorms[doc]) : 0.0f;38 fieldNormExpl.setValue(fieldNorm);39 fieldNormExpl.setDescription("fieldNorm(field="+field+", doc="+doc+")");40 fieldExpl.addDetail(fieldNormExpl);41 42 fieldExpl.setMatch(Boolean.valueOf(tfExpl.isMatch()));43 fieldExpl.setValue(tfExpl.getValue() *idfExpl.getValue() *fieldNormExpl.getValue());44 45 result.addDetail(fieldExpl);46 result.setMatch(fieldExpl.getMatch());47 48 // combine them49 result.setValue(queryExpl.getValue() * fieldExpl.getValue());50 51 if (queryExpl.getValue() == 1.0f)52 return fieldExpl;53 54 return result;55 }根据检索结果，以及上面的TermWeight类的explain()方法，可以看出输出的字符串部分正好一一对应，比如：idf(Inverse Document Frequency，即反转文档频率)、fieldNorm、fieldWeight。

复制代码

检索结果的第一个Document的信息：

Document的内部编号为： 0
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人>>
Document得分为： 0.81767845
Explanation为：
0.81767845 = (MATCH) fieldWeight(contents:一人 in 0), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
1.0 = fieldNorm(field=contents, doc=0)
Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.81767845
Explanation的getDescription()为： fieldWeight(contents:一人 in 0), product of:

复制代码

tf的计算

上面的tf值Term Frequency，即词条频率，可以在org.apache.lucene.search.Similarity类中看到具体地说明。在Lucene中，并不是直接使用的词条的频率，而实际使用的词条频率的平方根，即：

tf(t in d) = frequency½

这是使用org.apache.lucene.search.Similarity类的子类DefaultSimilarity中的方法计算的，如下：

1 /** Implemented as <code>sqrt(freq)</code>. */2 public float tf(float freq) {3 return (float)Math.sqrt(freq);4 }

复制代码

即：某个Document的tf = 检索的词条在该Document中出现次数freq取平方根值

也就是freq的平方根。

例如，从我们的检索结果来看：

搜索关键字<一人>在编号为 0 的Document中出现过 1 次
搜索关键字<一人>在编号为 1 的Document中出现过 1 次
搜索关键字<一人>在编号为 2 的Document中出现过 1 次
搜索关键字<一人>在编号为 3 的Document中出现过 2 次
搜索关键字<一人>在编号为 4 的Document中出现过 2 次
各个Document的tf计算如下所示：
编号为0的Document的 tf 为： (float)Math.sqrt(1) = 1.0；
编号为1的Document的 tf 为： (float)Math.sqrt(1) = 1.0；
编号为2的Document的 tf 为： (float)Math.sqrt(1) = 1.0；
编号为3的Document的 tf 为： (float)Math.sqrt(2) = 1.4142135；
编号为4的Document的 tf 为： (float)Math.sqrt(2) = 1.4142135；

复制代码

idf的计算

检索结果中，每个检索出来的Document的都对应一个idf，在DefaultSimilarity类中可以看到idf计算的实现方法，如下：

1 /** Implemented as <code>log(numDocs/(docFreq+1)) + 1</code>. */2 public float idf(int docFreq, int numDocs) {3 return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);4 }其中，docFreq是根据指定关键字进行检索，检索到的Document的数量，我们测试的docFreq=5；numDocs是指索引文件中总共的Document的数量，我们的测试比较特殊，将全部的Document都检索出来了，我们测试的numDocs=5。

复制代码

各个Document的idf的计算如下所示：

编号为0的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
编号为1的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
编号为2的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
编号为3的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
编号为4的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；

复制代码

lengthNorm的计算

在DefaultSimilarity类中可以看到lengthNorm计算的实现方法，如下：

public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.sqrt(numTerms));
}

复制代码

转自：http://www.blogjava.net/ashutc/archive/2011/04/15/348339.html

java, java

文库|博客

返回列表

Chinaunix › 论坛 › 程序设计 › Java › Apache Lucene 评分原理及代码分析

Apache Lucene 评分原理及代码分析 [复制链接]