- 求职 : 系统工程师等
- 论坛徽章:
- 5
|
1、我把一个网页保存为html,然后用beautifulsoup去分析。
结果发现,通过封装的函数buildSoupFromStr(content)调用会报错,
直接使用BeautifulSoup(content,fromEncoding="GBK")则不会。
2、另外,请教各位大侠,如果抓取一个网页的文本部分,除了正则、beautifulsoup、还有比较好的办法吗?
感觉beautifulsoup也不太方便。
报错内容如下:- Traceback (most recent call last):
- File "C:\Program Files\Python27\code\hanhan.py", line 29, in <module>
- buildSoupFromStr(content)
- File "C:\Program Files\Python27\code\hanhan.py", line 20, in buildSoupFromStr
- soup = BeautifulSoup(content,fromEncoding)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1522, in __init__
- BeautifulStoneSoup.__init__(self, *args, **kwargs)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1147, in __init__
- self._feed(isHTML=isHTML)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1189, in _feed
- SGMLParser.feed(self, markup)
- File "C:\Program Files\Python27\lib\sgmllib.py", line 104, in feed
- self.goahead(0)
- File "C:\Program Files\Python27\lib\sgmllib.py", line 174, in goahead
- k = self.parse_declaration(i)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1463, in parse_declaration
- j = SGMLParser.parse_declaration(self, i)
- File "C:\Program Files\Python27\lib\markupbase.py", line 109, in parse_declaration
- self.handle_decl(data)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1448, in handle_decl
- self._toStringSubclass(data, Declaration)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1379, in _toStringSubclass
- self.endData()
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1251, in endData
- (not self.parseOnlyThese.text or \
- AttributeError: 'str' object has no attribute 'text'
复制代码 全部代码如下:- # -*- coding: cp936 -*-
- from sys import *
- from BeautifulSoup import *
- def getContent(filename):
- try:
- file_object = open(filename, 'r')
- except IOError:
- print 'Can not find file'
- return -1
- try:
- content = file_object.read( )
- finally:
- file_object.close( )
- return content
- def buildSoupFromStr(content,fromEncoding="GBK"):
- print type(content)
- soup = BeautifulSoup(content,fromEncoding)
- #return soup
-
- if __name__ == '__main__':
- content = getContent('han.html')
- #print content
- if -1 == content:
- print 'error happen'
- buildSoupFromStr(content)
- #BeautifulSoup(content,fromEncoding="GBK")
-
复制代码 |
|