免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 9814 | 回复: 8
打印 上一主题 下一主题

[C] 发布我的倒排索引 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-07-03 16:01 |只看该作者 |倒序浏览
下载地址 http://libibase.googlecode.com/

主要功能:
  解析HTML
  中文分词(反向最大匹配,用trie实现)
  生成正向文档(我自己定义的格式,暂时是这样)
  生成倒排索引(分块存储,bytecode压缩算法, 正文和快照采用zlib压缩)
  提交查询串检索(只实现了向量空间模型, 动态摘要还没完成)
  目前只有一个命令行测试工具hibase
  包内自带10w中文词库(doc目录下,gzip格式, 使用的时候需要解开)
  使用方法可以看README

接下来就是测试和优化,因为写的时候宏比较多,所以编译还是有点慢....呵呵

要一块学习的可以加我的MSN/GTAIL : sounos@gmail.com

顺便贴一个使用实例:
我用wget下了chinaunix的首页到/data/html目录下 /data/dict下是我的词典

  1. ./hibase --basedir=/tmp --dict=/data/dict/dict.txt --add --doc=/data/html/index.html --url=http://www.chinaunix.net/ --date="Thu, 03 Jul 2008 10:12:18 GMT" --charset="gbk" --query --request="chinaunix" --topN=1000
  2. parsing document[[url]http://www.chinaunix.net/[/url]] time used:16825 microseconds
  3. adding document[[url]http://www.chinaunix.net/[/url]] time used:47955 microseconds
  4. parse query time used:36
  5. read hits[1] posting time used:1897
  6. Caculated 1 documents time used:22
  7. read 1 documents content time used:1404
  8. (0) title[ChinaUnix.net = 全球最大的Linux/Unix应用与开发者社区 = IT人的网上家园]
  9. summary[(null)]
  10. url[[url]http://www.chinaunix.net/[/url]]
  11. size[84892]date[Thu, 03 Jul 2008 10:12:18 GMT]

  12. search [chinaunix] time used:3502
复制代码

[ 本帖最后由 redor 于 2008-7-4 21:08 编辑 ]

论坛徽章:
0
2 [报告]
发表于 2008-07-03 16:02 |只看该作者
不错~

论坛徽章:
0
3 [报告]
发表于 2008-07-03 17:06 |只看该作者
牛x

论坛徽章:
0
4 [报告]
发表于 2008-07-03 17:16 |只看该作者
无法编译通过,
charcode.h 采用什么编码格式编写的?

vc 下面 出现 字符串无法识别。应该是内码gb2312 utf-8 unicode 都不行

论坛徽章:
0
5 [报告]
发表于 2008-07-03 17:57 |只看该作者
原帖由 benjiam 于 2008-7-3 17:16 发表
无法编译通过,
charcode.h 采用什么编码格式编写的?

vc 下面 出现 字符串无法识别。应该是内码gb2312 utf-8 unicode 都不行


我现在是UTF-8的,VC下我没编过,估计够呛....
我给贴一个上来吧

[code]
#include <stdio.h>
#include <string.h>
#ifndef _CHARCODE_H
#define _CHARCODE_H
#define CHARCODE_NUM 252
typedef struct _CHARCODE
{
        char *dec;
        char *code;
        char *chr;
        char *desc;
}CHARCODE;
static CHARCODE charcodelist[] =
{
                {" ", "&nbsp;", " ", "no-break space"},
                {"¡", "&iexcl;", "¡", "inverted exclamation mark"},
                {"¢", "&cent;", "¢", "cent sign"},
                {"£", "&pound;", "£", "pound sign"},
                {"¤", "&curren;", "¤", "currency sign"},
                {"¥", "&yen;", "¥", "yen sign = yuan sign"},
                {"¦", "&brvbar;", "|", "broken bar = brolen vertical bar"},
                {"§", "&sect;", "§", "section sign"},
                {"¨", "&uml;", "¨", "diaeresis = spacing diaeresis"},
                {"©", "&copy;", "©", "copyright sign"},
                {"ª", "&ordf;", "a", "feminine ordinal indicator"},
                {"«", "&laquo;", "«", "left-pointing double angle quotation mark = left pointing guillemet"},
                {"¬", "&not;", "¬", "not sign = discretionary hyphen"},
                {"­", "&shy;", "-", "soft hyphen = discretionary hyphen"},
                {"®", "&reg;", "®", "registered sign = registered trade mark sign"},
                {"¯", "&macr;", "ˉ", "macron = spacing macron = overline = APL overbar"},
                {"°", "&deg;", "°", "degree sign"},
                {"±", "&plusmn;", "±", "plus-minus sign = plus-or-minus sign"},
                {"²", "&sup2;", "2", "superscript two = superscript digit two = squared"},
                {"³", "&sup3;", "3", "superscript three = superscript digit three = cubed"},
                {"´", "&acute;", "′", "acute accent = spacing acute"},
                {"µ", "&micro;", "μ", "micro sign"},
                {"¶", "&para;", "¶", "pilcrow sign = paragraph sign"},
                {"·", "&middot;", "·", "middle dot = Georgian comma = Greek middle dot"},
                {"¸", "&ccedil;", "¸", "cedilla = spacing cedilla"},
                {"¹", "&sup1;", "1", "superscript one = superscript digit one"},
                {"º", "&ordm;", "o", "masculine ordinal indicator"},
                {"»", "&raquo;", "»", "right-pointing double angle quotation mark = right pointing guillemet"},
                {"¼", "&frac14;", "¼", "vulgar fraction one quarter = fraction one quarter"},
                {"½", "&frac12;", "½", "vulgar fraction one half = fraction one half"},
                {"¾", "&frac34;", "¾", "vulgar fraction three quarters = fraction three quarters"},
                {"¿", "&iquest;", "¿", "inverted question mark = turned question mark"},
                {"À", "&Agrave;", "à", "latin capital letter A with grave = latin capital letter A grave"},
                {"Á", "&Aacute;", "á", "latin capital letter A with acute"},
                {"Â", "&Acirc;", "Â", "latin capital letter A with circumflex"},
                {"Ã", "&Atilde;", "Ã", "latin capital letter A with tilde"},
                {"Ä", "&Auml;", "Ä", "latin capital letter A with diaeresis"},
                {"Å", "&Aring;", "Å", "latin capital letter A with ring above = latin capital letter A ring"},
                {"Æ", "&AElig;", "Æ", "latin capital letter AE = latin capital ligature AE"},
                {"Ç", "&Ccedil;", "Ç", "latin capital letter C with cedilla"},
                {"È", "&Egrave;", "è", "latin capital letter E with grave"},
                {"É", "&Eacute;", "é", "latin capital letter E with acute"},
                {"Ê", "&Ecirc;", "ê", "latin capital letter E with circumflex"},
                {"Ë", "&Euml;", "Ë", "latin capital letter E with diaeresis"},
                {"Ì", "&Igrave;", "ì", "latin capital letter I with grave"},
                {"Í", "&Iacute;", "í", "latin capital letter I with acute"},
                {"Î", "&Icirc;", "Î", "latin capital letter I with circumflex"},
                {"Ï", "&Iuml;", "Ï", "latin capital letter I with diaeresis"},
                {"Ð", "&ETH;", "D", "latin capital letter ETH"},
                {"Ñ", "&Ntilde;", "Ñ", "latin capital letter N with tilde"},
                {"Ò", "&Ograve;", "ò", "latin capital letter O with grave"},
                {"Ó", "&Oacute;", "ó", "latin capital letter O with acute"},
                {"Ô", "&Ocirc;", "Ô", "latin capital letter O with circumflex"},
                {"Õ", "&Otilde;", "Õ", "latin capital letter O with tilde"},
                {"Ö", "&Ouml;", "Ö", "latin capital letter O with diaeresis"},
                {"×", "&times;", "×", "multiplication sign"},
                {"Ø", "&Oslash;", "Ø", "latin capital letter O with stroke = latin capital letter O slash"},
                {"Ù", "&Ugrave;", "ù", "latin capital letter U with grave"},
                {"Ú", "&Uacute;", "ú", "latin capital letter U with acute"},
                {"Û", "&Ucirc;", "Û", "latin capital letter U with circumflex"},
                {"Ü", "&Uuml;", "ü", "latin capital letter U with diaeresis"},
                {"Ý", "&Yacute;", "Y", "latin capital letter Y with acute"},
                {"Þ", "&THORN;", "T", "latin capital letter THORN"},
                {"ß", "&szlig;", "ß", "latin small letter sharp s = ess-zed"},
                {"à", "&agrave;", "à", "latin small letter a with grave = latin small letter a grave"},
                {"á", "&aacute;", "á", "latin small letter a with acute"},
                {"â", "&acirc;", "a", "latin small letter a with circumflex"},
                {"ã", "&atilde;", "ã", "latin small letter a with tilde"},
                {"ä", "&auml;", "ä", "latin small letter a with diaeresis"},
                {"å", "&aring;", "å", "latin small letter a with ring above = latin small letter a ring"},
                {"æ", "&aelig;", "æ", "latin small letter ae = latin small ligature ae"},
                {"ç", "&ccedil;", "ç", "latin small letter c with cedilla"},
                {"è", "&egrave;", "è", "latin small letter e with grave"},
                {"é", "&eacute;", "é", "latin small letter e with acute"},
                {"ê", "&ecirc;", "ê", "latin small letter e with circumflex"},
                {"ë", "&euml;", "ë", "latin small letter e with diaeresis"},
                {"ì", "&igrave;", "ì", "latin small letter i with grave"},
                {"í", "&iacute;", "í", "latin small letter i with acute"},
                {"î", "&icirc;", "î", "latin small letter i with circumflex"},
                {"ï", "&iuml;", "ï", "latin small letter i with diaeresis"},
                {"ð", "&eth;", "e", "latin small letter eth"},
                {"ñ", "&ntilde;", "ñ", "latin small letter n with tilde"},
                {"ò", "&ograve;", "ò", "latin small letter o with grave"},
                {"ó", "&oacute;", "ó", "latin small letter o with acute"},
                {"ô", "&ocirc;", "ô", "latin small letter o with circumflex"},
                {"õ", "&otilde;", "õ", "latin small letter o with tilde"},
                {"ö", "&ouml;", "ö", "latin small letter o with diaeresis"},
                {"÷", "&divide;", "÷", "division sign"},
                {"ø", "&oslash;", "ø", "latin small letter o with stroke = latin small letter o slash"},
                {"ù", "&ugrave;", "ù", "latin small letter u with grave"},
                {"ú", "&uacute;", "ú", "latin small letter u with acute"},
                {"û", "&ucirc;", "û", "latin small letter u with circumflex"},
                {"ü", "&uuml;", "ü", "latin small letter u with diaeresis"},
                {"ý", "&yacute;", "y", "latin small letter y with acute"},
                {"þ", "&thorn;", "t", "latin small letter thorn with"},
                {"ÿ", "&yuml;", "ÿ", "latin small letter y with diaeresis"},
                {"ƒ", "&fnof;", "ƒ ", "latin small f with hook = function = florin"},
                {"Α", "&Alpha;", "Α ", "greek capital letter alpha"},
                {"Β", "&Beta;", "Β ", "greek capital letter beta"},
                {"Γ", "&Gamma;", "Γ ", "greek capital letter gamma"},
                {"Δ", "&Delta;", "Δ ", "greek capital letter delta"},
                {"Ε", "&Epsilon;", "Ε ", "greek capital letter epsilon"},
                {"Ζ", "&Zeta;", "Ζ ", "greek capital letter zeta"},
                {"Η", "&Eta;", "Η ", "greek capital letter eta"},
                {"Θ", "&Theta;", "Θ ", "greek capital letter theta"},
                {"Ι", "&Iota;", "Ι ", "greek capital letter iota"},
                {"Κ", "&Kappa;", "Κ ", "greek capital letter kappa"},
                {"Λ", "&Lambda;", "Λ ", "greek capital letter lambda"},
                {"Μ", "&Mu;", "Μ ", "greek capital letter mu"},
                {"Ν", "&Nu;", "Ν ", "greek capital letter nu"},
                {"Ξ", "&Xi;", "Ξ ", "greek capital letter xi"},
                {"Ο", "&Omicron;", "Ο ", "greek capital letter omicron"},
                {"Π", "&Pi;", "Π ", "greek capital letter pi"},
                {"Ρ", "&Rho;", "Ρ ", "greek capital letter rho"},
                {"Σ", "&Sigma;", "Σ ", "greek capital letter sigma"},
                {"Τ", "&Tau;", "Τ ", "greek capital letter tau"},
                {"Υ", "&Upsilon;", "Υ ", "greek capital letter upsilon"},
                {"Φ", "&Phi;", "Φ ", "greek capital letter phi"},
                {"Χ", "&Chi;", "Χ ", "greek capital letter chi"},
                {"Ψ", "&Psi;", "Ψ ", "greek capital letter psi"},
                {"Ω", "&Omega;", "Ω ", "greek capital letter omega"},
                {"α", "&alpha;", "α ", "greek small letter alpha"},
                {"β", "&beta;", "β ", "greek small letter beta"},
                {"γ", "&gamma;", "γ ", "greek small letter gamma"},
                {"δ", "&delta;", "δ ", "greek small letter delta"},
                {"ε", "&epsilon;", "ε ", "greek small letter epsilon"},
                {"ζ", "&zeta;", "ζ ", "greek small letter zeta"},
                {"η", "&eta;", "η ", "greek small letter eta"},
                {"θ", "&theta;", "θ ", "greek small letter theta"},
                {"ι", "&iota;", "ι ", "greek small letter iota"},
                {"κ", "&kappa;", "κ ", "greek small letter kappa"},
                {"λ", "&lambda;", "λ ", "greek small letter lambda"},
                {"μ", "&mu;", "μ ", "greek small letter mu"},
                {"ν", "&nu;", "ν ", "greek small letter nu"},
                {"ξ", "&xi;", "ξ ", "greek small letter xi"},
                {"ο", "&omicron;", "ο ", "greek small letter omicron"},
                {"π", "&pi;", "π ", "greek small letter pi"},
                {"ρ", "&rho;", "ρ ", "greek small letter rho"},
                {"ς", "&sigmaf;", "ς ", "greek small letter final sigma"},
                {"σ", "&sigma;", "σ ", "greek small letter sigma"},
                {"τ", "&tau;", "τ ", "greek small letter tau"},
                {"υ", "&upsilon;", "υ ", "greek small letter upsilon"},
                {"φ", "&phi;", "φ ", "greek small letter phi"},
                {"χ", "&chi;", "χ ", "greek small letter chi"},
                {"ψ", "&psi;", "ψ ", "greek small letter psi"},
                {"ω", "&omega;", "ω ", "greek small letter omega"},
                {"ϑ", "&thetasym;", "ϑ ", "greek small letter theta symbol"},
                {"ϒ", "&upsih;", "ϒ ", "greek upsilon with hook symbol"},
                {"ϖ", "&piv;", "ϖ ", "greek pi symbol"},
                {"•", "&bull;", "•", "bullet = black small circle"},
                {"…", "&hellip;", "…", "horizontal ellipsis = three dot leader"},
                {"′", "&prime;", "′", "prime = minutes = feet"},
                {"″", "&Prime;", "″", "double prime = seconds = inches"},
                {"‾", "&oline;", " ̄", "overline = spacing overscore"},
                {"⁄", "&frasl;", "⁄", "fraction slash"},
                {"℘", "&weierp;", "℘", "script capital P = power set = Weierstrass p"},
                {"ℑ", "&image;", "ℑ", "blackletter capital I = imaginary part"},
                {"ℜ", "&real;", "ℜ", "blackletter capital R = real part symbol"},
                {"™", "&trade;", "™", "trade mark sign"},
                {"ℵ", "&alefsym;", "ℵ", "alef symbol = first transfinite cardinal"},
                {"←", "&larr;", "←", "leftwards arrow"},
                {"↑", "&uarr;", "↑", "upwards arrow"},
                {"→", "&rarr;", "→", "rightwards arrow"},
                {"↓", "&darr;", "↓", "downwards arrow"},
                {"↔", "&harr;", "↔", "left right arrow"},
                {"↵", "&crarr;", "↵", "downwards arrow with corner leftwards = carriage return"},
                {"⇐", "&lArr;", "⇐", "leftwards double arrow"},
                {"⇑", "&uArr;", "⇑", "upwards double arrow"},
                {"⇒", "&rArr;", "⇒", "rightwards double arrow"},
                {"⇓", "&dArr;", "⇓", "downwards double arrow"},
                {"⇔", "&hArr;", "⇔", "left right double arrow"},
                {"∀", "&forall;", "∀", "for all"},
                {"∂", "&part;", "∂", "partial differential"},
                {"∃", "&exist;", "∃", "there exists"},
                {"∅", "&empty;", "∅", "empty set = null set = diameter"},
                {"∇", "&nabla;", "∇", "nabla = backward difference"},
                {"∈", "&isin;", "∈", "element of"},
                {"∉", "&notin;", "∉", "not an element of"},
                {"∋", "&ni;", "∋", "contains as member"},
                {"∏", "&prod;", "∏", "n-ary product = product sign"},
                {"∑", "&sum;", "∑", "n-ary sumation"},
                {"−", "&minus;", "−", "minus sign"},
                {"∗", "&lowast;", "∗", "asterisk operator"},
                {"√", "&radic;", "√", "square root = radical sign"},
                {"∝", "&prop;", "∝", "proportional to"},
                {"∞", "&infin;", "∞", "infinity"},
                {"∠", "&ang;", "∠", "angle"},
                {"∧", "&and;", "∧", "logical and = wedge"},
                {"∨", "&or;", "∨", "logical or = vee"},
                {"∩", "&cap;", "∩", "intersection = cap"},
                {"∪", "&cup;", "∪", "union = cup"},
                {"∫", "&int;", "∫", "integral"},
                {"∴", "&there4;", "∴", "therefore"},
                {"∼", "&sim;", "~", "tilde operator = varies with = similar to"},
                {"≅", "&cong;", "≅", "approximately equal to"},
                {"≈", "&asymp;", "≈", "almost equal to = asymptotic to"},
                {"≠", "&ne;", "≠", "not equal to"},
                {"≡", "&equiv;", "≡", "identical to"},
                {"≤", "&le;", "≤", "less-than or equal to"},
                {"≥", "&ge;", "≥", "greater-than or equal to"},
                {"⊂", "&sub;", "⊂", "subset of"},
                {"⊃", "&sup;", "⊃", "superset of"},
                {"⊄", "&nsub;", "⊄", "not a subset of"},
                {"⊆", "&sube;", "⊆", "subset of or equal to"},
                {"⊇", "&supe;", "⊇", "superset of or equal to"},
                {"⊕", "&oplus;", "⊕", "circled plus = direct sum"},
                {"⊗", "&otimes;", "⊗", "circled times = vector product"},
                {"⊥", "&perp;", "⊥", "up tack = orthogonal to = perpendicular"},
                {"⋅", "&sdot;", "⋅", "dot operator"},
                {"⌈", "&lceil;", "⌈", "left ceiling = apl upstile"},
                {"⌉", "&rceil;", "⌉", "right ceiling"},
                {"⌊", "&lfloor;", "⌊", "left floor = apl downstile"},
                {"⌋", "&rfloor;", "⌋", "right floor"},
                {"〈", "&lang;", "〈", "left-pointing angle bracket = bra"},
                {"〉", "&rang;", "〉", "right-pointing angle bracket = ket"},
                {"◊", "&loz;", "◊", "lozenge"},
                {"♠", "&spades;", "♠", "black spade suit"},
                {"♣", "&clubs;", "♣", "black club suit = shamrock"},
                {"♥", "&hearts;", "♥", "black heart suit = valentine"},
                {"♦", "&diams;", "♦", "black diamond suit"},
                {"&#34;", "&quot;", "\"", "quotation mark = APL quote"},
                {"&#38;", "&amp;", "& ", "ampersand"},
                {"&#60;", "&lt;", "< ", "less-than sign"},
                {"&#62;", "&gt;", "> ", "greater-than sign"},
                {"Œ", "&OElig;", "Œ ", "latin capital ligature OE"},
                {"œ", "&oelig;", "œ ", "latin small ligature oe"},
                {"Š", "&Scaron;", "Š ", "latin capital letter S with caron"},
                {"š", "&scaron;", "š ", "latin small letter s with caron"},
                {"Ÿ", "&Yuml;", "Ÿ ", "latin capital letter Y with diaeresis"},
                {"ˆ", "&circ;", "ˆ ", "modifier letter circumflex accent"},
                {"˜", "&tilde;", "˜ ", "small tilde"},
                {" ", "&ensp;", " ", "en space"},
                {" ", "&emsp;", " ", "em space"},
                {" ", "&thinsp;", " ", "thin space"},
                {"‌", "&zwnj;", "‌", "zero width non-joiner"},
                {"‍", "&zwj;", "‍", "zero width joiner"},
                {"‎", "&lrm;", "‎", "left-to-right mark"},
                {"‏", "&rlm;", "‏", "right-to-left mark"},
                {"–", "&ndash;", "–", "en dash"},
                {"—", "&mdash;", "—", "em dash"},
                {"‘", "&lsquo;", "‘", "left single quotation mark"},
                {"’", "&rsquo;", "’", "right single quotation mark"},
                {"‚", "&sbquo;", "‚", "single low-9 quotation mark"},
                {"“", "&ldquo;", "“", "left double quotation mark"},
                {"”", "&rdquo;", "”", "right double quotation mark"},
                {"„", "&bdquo;", "„", "double low-9 quotation mark"},
                {"†", "&dagger;", "†", "dagger"},
                {"‡", "&Dagger;", "‡", "double dagger"},
                {"‰", "&permil;", "‰", "per mille sign"},
                {"‹", "&lsaquo;", "‹", "single left-pointing angle quotation mark"},
                {"›", "&rsaquo;", "›", "single right-pointing angle quotation mark"},
                {"€", "&euro;", "

论坛徽章:
0
6 [报告]
发表于 2008-07-03 19:28 |只看该作者
不错
好东西

[ 本帖最后由 wilbur8415 于 2008-7-3 19:58 编辑 ]

论坛徽章:
0
7 [报告]
发表于 2008-07-03 22:20 |只看该作者
什么叫“倒排索引库”,啥意思,LZ能否解释下下

论坛徽章:
0
8 [报告]
发表于 2008-07-03 23:03 |只看该作者
大学毕业设计做的是搜索引擎。直接用lucene.(是这么拼吧,都给忘了)

论坛徽章:
0
9 [报告]
发表于 2008-07-04 08:28 |只看该作者
原帖由 tyc611 于 2008-7-3 22:20 发表
什么叫“倒排索引库”,啥意思,LZ能否解释下下



实现了一个倒排索引,是库的形式发布.... 不是完成的搜索解决方案, 也就是只负责索引数据和检索....
要做一个完成的搜索引擎就需要自己开发其他的东西,比如数据下载,daemon服务等....
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP