免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
12下一页
最近访问板块 发新帖
查看: 6338 | 回复: 13
打印 上一主题 下一主题

用 perl 实现的一个 拼写检查器,与 python 代码作对比。 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-05-30 15:09 |只看该作者 |倒序浏览
几日前在 python 版看到一个链接,作者用 python 实现了一个简单的 拼写检查器,忍不住用 perl 重写了一遍。
望高手指点,看能不能再简化一下
注释里有 python 的代码

原文链接

http://norvig.com/spell-correct.html

[ 本帖最后由 cobrawgl 于 2008-5-30 21:27 编辑 ]

论坛徽章:
0
2 [报告]
发表于 2008-05-30 17:18 |只看该作者

  1. return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
复制代码


用 map + grep

[ 本帖最后由 cobrawgl 于 2008-5-30 21:27 编辑 ]

论坛徽章:
0
3 [报告]
发表于 2008-05-30 20:57 |只看该作者
说我字数超了,只好重新回一个。这次用 grep 解决问题啦

[ 本帖最后由 cobrawgl 于 2008-5-30 21:25 编辑 ]

论坛徽章:
0
4 [报告]
发表于 2008-05-30 21:20 |只看该作者
python 的用了 21 行,perl 的做到了 20 行,我很满意

import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):   
    model = collections.defaultdict(lambda: 1)   
    for f in features:        
        model[f] += 1   
    return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):   
    n = len(word)   
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion               
           [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition               
           [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration               
           [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion
def known_edits2(word):   
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):   
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]   
    return max(candidates, key=lambda w: NWORDS[w])


use IO::File;
my $fh = IO::File->new('big.txt') or die;
my $words = join '', <$fh>;
sub words { return (lc shift) =~ /([a-z]+)/g;}
sub train {
&nbsp;&nbsp;&nbsp;&nbsp;my %model;
&nbsp;&nbsp;&nbsp;&nbsp;$model{$_} = ($model{$_} || 1)+1 for @_;
&nbsp;&nbsp;&nbsp;&nbsp;return %model; }
my %nwords = train(words($words));
sub edits1 {
&nbsp;&nbsp;&nbsp;&nbsp;my $word = shift;
&nbsp;&nbsp;&nbsp;&nbsp;return ((map {(substr $word, 0, $_) . (substr $word, $_+1)} 0 .. (length($word)-1)),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(map {(substr $word, 0, $_) . (substr $word, $_+1, 1) . (substr $word, $_, 1) . (substr $word, $_+2)} 0 .. (length($word)-2)),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(map {my $c = $_; map {(substr $word, 0, $_) . $c . (substr $word, $_+1)} 0 .. (length($word)-1)} 'a'..'z'),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(map {my $c = $_; map {(substr $word, 0, $_) . $c . (substr $word, $_)} 0 .. length($word)} 'a'..'z')); }
sub known_edits2 {return map {grep {exists $nwords{$_}} edits1($_)} edits1(shift)}
sub known {return grep {exists $nwords{$_}} @_}
sub correct {
&nbsp;&nbsp;&nbsp;&nbsp;my @candidates = known(@_) ? known(@_) : known(edits1(@_)) ? known(edits1(@_)) : known_edits2(@_) ? known_edits2(@_) : @_;
&nbsp;&nbsp;&nbsp;&nbsp;return (sort {$nwords{$b} <=> $nwords{$a}} @candidates)[0]; }


[ 本帖最后由 cobrawgl 于 2008-5-30 21:25 编辑 ]

论坛徽章:
0
5 [报告]
发表于 2008-05-31 08:56 |只看该作者
又照原文写了个测试

sub spelltest {
&nbsp;&nbsp;&nbsp;&nbsp;my %test = @_;

&nbsp;&nbsp;&nbsp;&nbsp;my $start&nbsp;&nbsp;&nbsp;&nbsp;= time;
&nbsp;&nbsp;&nbsp;&nbsp;my $n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;= 0;
&nbsp;&nbsp;&nbsp;&nbsp;my $bad&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;= 0;
&nbsp;&nbsp;&nbsp;&nbsp;my $unknown = 0;

&nbsp;&nbsp;&nbsp;&nbsp;for my $word (keys %test) {
&nbsp;&nbsp;&nbsp;&nbsp;    for my $wrong ((split ' ', $test{$word})) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    $n += 1;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;my $w = correct($wrong);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if ($w ne $word) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    $bad += 1;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$unknown += !(exists $nwords{$word});
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;my $secs = time - $start;
&nbsp;&nbsp;&nbsp;&nbsp;my $pct = int(100 - 100 * $bad/$n);
&nbsp;&nbsp;&nbsp;&nbsp;return "bad= $bad, unknown= $unknown, secs= $secs, pct= $pct, n= $n\n";
}


def spelltest(tests, bias=None, verbose=False):
&nbsp;&nbsp;&nbsp;&nbsp;import time
&nbsp;&nbsp;&nbsp;&nbsp;n, bad, unknown, start = 0, 0, 0, time.clock()
&nbsp;&nbsp;&nbsp;&nbsp;if bias:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for target in tests: NWORDS[target] += bias
&nbsp;&nbsp;&nbsp;&nbsp;for target,wrongs in tests.items():
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for wrong in wrongs.split():
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;n += 1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;w = correct(wrong)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if w!=target:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bad += 1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;unknown += (target not in NWORDS)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if verbose:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print 'correct(%r) => %r (%d); expected %r (%d)' % (
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;wrong, w, NWORDS[w], target, NWORDS[target])
&nbsp;&nbsp;&nbsp;&nbsp;return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;unknown=unknown, secs=int(time.clock()-start) )


运行结果

  1. >perl -w spell.pl
  2. bad= 68, unknown= 15, secs= 85, pct= 74, n= 270
  3. bad= 130, unknown= 43, secs= 132, pct= 67, n= 400
  4. >Exit code: 0

  5. >pythonw -u "spell.py"
  6. {'bad': 68, 'bias': None, 'unknown': 15, 'secs': 17, 'pct': 74, 'n': 270}
  7. {'bad': 130, 'bias': None, 'unknown': 43, 'secs': 29, 'pct': 67, 'n': 400}
  8. >Exit code: 0
复制代码


用perl写这个慢了好多,不知道原因在哪里

[ 本帖最后由 cobrawgl 于 2008-5-31 09:22 编辑 ]

论坛徽章:
0
6 [报告]
发表于 2008-05-31 11:55 |只看该作者
python 中的 word[0:2] 这种语法看来很高效。
perl 有没有什么类似的方法?

另外,map 好像也比较耗时间。

论坛徽章:
0
7 [报告]
发表于 2008-06-02 16:50 |只看该作者
收藏了

论坛徽章:
0
8 [报告]
发表于 2008-06-02 17:55 |只看该作者
其实…… perl 可以是一行

论坛徽章:
0
9 [报告]
发表于 2008-06-02 18:13 |只看该作者
原帖由 redspider 于 2008-6-2 17:55 发表
其实…… perl 可以是一行


我们没必要在这点上欺负 python

关键是 python 那个版本真的很快,你有没有解决办法

论坛徽章:
0
10 [报告]
发表于 2008-06-03 14:48 |只看该作者
还没看代码,下班了来学习学习
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP