请教各位大侠一个关于beautifulsoup的奇怪的问题
1、我把一个网页保存为html,然后用beautifulsoup去分析。结果发现,通过封装的函数buildSoupFromStr(content)调用会报错,
直接使用BeautifulSoup(content,fromEncoding="GBK")则不会。
2、另外,请教各位大侠,如果抓取一个网页的文本部分,除了正则、beautifulsoup、还有比较好的办法吗?
感觉beautifulsoup也不太方便。
报错内容如下:Traceback (most recent call last):
File "C:\Program Files\Python27\code\hanhan.py", line 29, in <module>
buildSoupFromStr(content)
File "C:\Program Files\Python27\code\hanhan.py", line 20, in buildSoupFromStr
soup = BeautifulSoup(content,fromEncoding)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "C:\Program Files\Python27\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Program Files\Python27\lib\sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1463, in parse_declaration
j = SGMLParser.parse_declaration(self, i)
File "C:\Program Files\Python27\lib\markupbase.py", line 109, in parse_declaration
self.handle_decl(data)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1448, in handle_decl
self._toStringSubclass(data, Declaration)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1379, in _toStringSubclass
self.endData()
File "build\bdist.win32\egg\BeautifulSoup.py", line 1251, in endData
(not self.parseOnlyThese.text or \
AttributeError: 'str' object has no attribute 'text'全部代码如下:# -*- coding: cp936 -*-
from sys import *
from BeautifulSoup import *
def getContent(filename):
try:
file_object = open(filename, 'r')
except IOError:
print 'Can not find file'
return -1
try:
content = file_object.read( )
finally:
file_object.close( )
return content
def buildSoupFromStr(content,fromEncoding="GBK"):
print type(content)
soup = BeautifulSoup(content,fromEncoding)
#return soup
if __name__ == '__main__':
content = getContent('han.html')
#print content
if -1 == content:
print 'error happen'
buildSoupFromStr(content)
#BeautifulSoup(content,fromEncoding="GBK")
1.bs默认是UTF-8,如果真实页面是其它编码(ASCII除外),就会报错。在参数设置事先知道的编码方式。
2. bs支持多种API,比如CSS selector就比较方便,前提是要了解CSS selector的语法。总体来看,HTML/XML结构化文档的解析工作都是比较繁琐复杂,没什么捷径。 timespace 发表于 2014-05-31 19:31 static/image/common/back.gif
1.bs默认是UTF-8,如果真实页面是其它编码(ASCII除外),就会报错。在参数设置事先知道的编码方式。
2. b ...
谢谢大侠!
关于1,我直接调用没有报错,但是我将参数传递进去之后就报错了!不是很奇怪的事情嘛!
貌似跟编码方式无关吧! 回复 3# qxhgd
是说调用方式啊,确实和编码无关。知道怎么用positional参数和keyword参数吗?改buildSoupFromStr:soup = BeautifulSoup(content,fromEncoding= fromEncoding) 回复 4# timespace
按照你的修改的确正常了,多谢了!
但是我做了个测试:
def func2(a,b):
print 'a is ',a,',b is',b
def func1(a,b=5):
func2(a,b)
func1(3,4)
最终打印结果:
a is3 ,b is 4
没有报错呢! 为什么要报错?bs的encoding需要keyword参数,你传了positional参数,导致bs的内部逻辑错误,但不是语法错误。 用lxml模块,然后用xpath语法析出你要的数据。 回复 7# r2007
谢谢大神,有空试试!
页:
[1]