Chinaunix

标题: 为什么我这里正则匹配列表是空捏？ [打印本页]

作者: zchao4251500 时间: 2012-11-05 11:23
标题: 为什么我这里正则匹配列表是空捏？
为什么我这里正则匹配列表是空捏？

#!/usr/bin/env python
#-*- encoding: u8 -*-
import sys,re,urllib
url = str("http://www.baidu.com/s?tn=baiduhome_pg&ie=utf-8&bs=inurl%3Aaction&f=8&rsv_bp=1&rsv_spt=1&wd=%E6%95%99%E5%AD%A6&rsv_sug3=4&rsv_sug=0&rsv_sug1=4&rsv_sug4=75&inputT=4535")
oper = urllib.urlopen(url).read().replace(""," ")
urls = re.findall(r"<a.*?href=.*?</a>",oper,re.I)
print urls

我想抓取baidu这个连接搜的url地址提取出来

但是我print出urls地址是空？还请知道的指点下谢谢

作者: zchao4251500 时间: 2012-11-05 11:25
而且我直接print oper 显示的是乱码。如何设置编码呢？谢谢大家。

作者: zchao4251500 时间: 2012-11-05 11:32
#!/usr/bin/env python
#-*- encoding: u8 -*-
import sys,re,urllib
url = str("http://www.baidu.com/s?tn=baiduhome_pg&ie=utf-8&bs=inurl%3Aaction&f=8&rsv_bp=1&rsv_spt=1&wd=%E6%95%99%E5%AD%A6&rsv_sug3=4&rsv_sug=0&rsv_sug1=4&rsv_sug4=75&inputT=4535")
oper = urllib.urlopen(url).read().replace(""," ").decode('utf-8')
urls = re.findall(r"<a href=.*?</a>",oper,re.S)
print oper

D:\script>python test.py
Traceback (most recent call last):
File "test.py", line 5, in <module>
oper = urllib.urlopen(url).read().replace(""," ").decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 341: invalid continuation byte

这里也提示错了但是不知道怎么解决编码问题

作者: xiaoding0377 时间: 2012-11-05 12:18
1 搜索结果是不允许抓取的
2 搜索结果的url是加过密的，除非你能解，
3 <a href=.*?</a>",oper,re.S)这个抓到的是搜索结果页面上每条结果下面的小链接，比如说：展现结果第一条：教学百度百科下边的汉语词语 - 一、哲学探讨 - 二、特殊作用 - 三、学校教育，抓这个没意义

作者: crifan 时间: 2012-11-08 12:50
回复 1# zchao4251500

首先，你这个思路，就是错的。
想要正确的找到百度搜出来的结果中的url，首先要自己用工具分析出来，其中的内部逻辑。
具体可参考：
【整理】关于抓取网页，分析网页内容，模拟登陆网站的逻辑/流程和注意事项

【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程

分析出内部逻辑了，然后再参考：
【教程】模拟登陆网站之 Python版（内含两种版本的完整的可运行的代码）

自己去用python模拟出来即可。

欢迎光临 Chinaunix (http://bbs.chinaunix.net/)