免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1930 | 回复: 0
打印 上一主题 下一主题

Spell Checking in Python [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2009-04-16 22:33 |只看该作者 |倒序浏览

I was looking into spell checking in Python.  I found
spell4py
, and downloaded the zip, but couldn’t get it to build on my system.  If I tried a bit longer maybe, but in the end my solution worked out fine.  This library was overkill for my needs too.
I found this article here:
http://code.activestate.com/recipes/117221/
This seemed to work well for my purposes, but I wanted to test out other spell checking libraries.    Mozilla Firefox , Google Chrome, and OpenOffice all use hunspell, so I wanted to try that one (as I’m testing the spelling of words on the Internet).  Here are some python snippets to get you up and running with the popular spelling checkers. I modified these to take more than 1 word, split them up, and then return a list of suggestions.  They do require each spelling checker to be installed.  I was able to do this through the openSuSE package manager.
Ispell
import popen2

class ispell:
    def __init__(self):
        self._f = popen2.Popen3("ispell")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip()
            self._f.fromchild.readline() #skip the blank line
            if s[:8] == "word: ok":
                output.append(None)
            else:
                output.append((s[17:-1]).strip().split(', '))
        return output
Aspell
import popen2

class aspell:
    def __init__(self):
        self._f = popen2.Popen3("aspell -a")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip()
            self._f.fromchild.readline() #skip the blank line
            if s == "*":
                output.append(None)
            elif s[0] == '#':
                output.append("No Suggestions")
            else:
                output.append(s.split(':')[1].strip().split(', '))
        return output
Hunspell
import popen2

class hunspell:
    def __init__(self):
        self._f = popen2.Popen3("hunspell")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip().lower()
            self._f.fromchild.readline() #skip the blank line
            if s == "*":
                output.append(None)
            elif s[0] == '#':
                output.append("No Suggestions")
            elif s[0] == '+':
                pass
            else:
                output.append(s.split(':')[1].strip().split(', '))
        return output
Now, after doing this and seeing the suggestions. I decided a spell checker isn’t really what I was looking for.  A spelling checker always tries to make a suggestion, and I wanted to filter out things from a database.  I started this with the hope that I would be able to take misspellings and convert them into the correct word.  In the end, I just removed words that were not spelled correctly using
WordNET
through
NLTK
.  WordNET had a bigger dictionary than most of the spell checkers which also helped in the filtering task.  NLTK has a simple
how to
on how to get started using WordNET.


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/78/showart_1901954.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP