- 论坛徽章:
- 2
|
想通过nltk中的clean_html功能来清除html内容
import nltk,re,pprint
import urllib2
html='‘ /论坛发不了html链接/
h=urllib2.urlopen(html)
c=h.read()
raw=nltk.clean_html(c)
但是报如下的错误
Traceback (most recent call last):
File "E:\python_project\test1.py", line 7, in <module>
raw=nltk.clean_html(c)
File "E:\python2.7\lib\site-packages\nltk\util.py", line 346, in clean_html
raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function
|
|