Chinaunix
标题:
如何用python采集百度下拉框搜索数据
[打印本页]
作者:
alexkh
时间:
2013-09-22 17:49
标题:
如何用python采集百度下拉框搜索数据
如题,知道百度下拉框搜索的关键词数据是这样的URL:
http://suggestion.baidu.com/su?wd=
{关键词},但使用requests库的get时,却返回为空,也伪装了UA,请问怎么处理?
我的代码如下:
#coding=utf-8
import requests
def get_box(word):
url = 'http://suggestion.baidu.com/su?wd=%s&p=3&cb=window.bdsug.sug&from=superpage' % word
headers = {
'User-Agent': 'Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+5.1;+Trident/4.0;+GTB7.1;+.NET+CLR+2.0.50727)'
}
r = requests.post(url, headers = headers)
print r.status_code
print r.content
get_box('途牛')
复制代码
在网上搜索时发现了PHP版,但不了解,仅供参考:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<link type="text/css" rel="stylesheet"
href="http://zone.wooyun.org/themes/wooyun/css/style.css"/></head>
<body>
<?php
/*
another:VIP
date:2013-2-26
*/
$word=$_GET['word'];
if ($word=="")
{
echo <<<EOF
<form action="" method="get">
<p>关键词: <input type="text" name="word" /></p>
<input type="submit" value="采集" />
</form>
EOF;
}
else
{
$data=file_get_contents('http://suggestion.baidu.com/su?wd='.$word);
$data=mb_convert_encoding($data, 'UTF-8', 'UTF-8,GBK,GB2312,BIG5' );
$data_temp=strpos($data,"x");
$data=substr_replace($data,"",$data_temp,17);
$data = trim($data,");");
$data = trim($data,"{");
$data=preg_replace("/q:.+?.e,/",'', $data);
$data = str_replace("[","",$data);
$data = str_replace("]","",$data);
$data = "[".$data."]";
$data = str_replace(",","},s:",$data);
$data = str_replace("s:","{\"s\":",$data);//复杂的处理,以符合json格式
$dc=json_decode($data);
for ($n=0; $n<=9; $n++)
{
$wd[$n]=$dc[$n]->s;
echo "</br>".$wd[$n];
}
}
?>
</body>
</html>
复制代码
欢迎光临 Chinaunix (http://bbs.chinaunix.net/)
Powered by Discuz! X3.2