- 论坛徽章:
- 0
|
本帖最后由 Galenlong 于 2010-12-27 16:35 编辑
要对这个网站http://www.cquae.com/search.shtml?ob=ProjectCenter&lx=2&jg=1进行抓取,可是总是只能抓取第一页的,无法模拟分页的过程。分析过该网站的分页参数传递,用snoopy、curl等方法试过之后还是不行,麻烦有心人帮一起研究哈,谢谢!
这是抓取的header- FIRST REQUEST:
- GET http://www.cquae.com/search_list.aspx?ob=ProjectCenter&lx=2 HTTP/1.1
- Host: www.cquae.com
- User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13
- Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
- Accept-Language: zh-cn,zh;q=0.5
- Accept-Encoding: gzip,deflate
- Accept-Charset: GB2312,utf-8;q=0.7,*;q=0.7
- Keep-Alive: 115
- Proxy-Connection: keep-alive
- FIRST response:
- HTTP/1.0 200 OK
- Cache-Control: private
- Content-Type: text/html; charset=utf-8
- Server: Microsoft-IIS/7.5
- X-Powered-By: UrlRewriter.NET 2.0.0
- Set-Cookie: ASP.NET_SessionId=5050pcjv24rxsejce1dtbjas; path=/; HttpOnly
- X-AspNet-Version: 2.0.50727
- X-Powered-By: ASP.NET
- Date: Mon, 27 Dec 2010 03:31:38 GMT
- Content-Length: 62646
- Connection: keep-alive
- SECOND REQUEST:
- POST http://www.cquae.com/search_list.aspx?ob=ProjectCenter&lx=2 HTTP/1.1
- Host: www.cquae.com
- User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13
- Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
- Accept-Language: zh-cn,zh;q=0.5
- Accept-Encoding: gzip,deflate
- Accept-Charset: GB2312,utf-8;q=0.7,*;q=0.7
- Keep-Alive: 115
- Proxy-Connection: keep-alive
- Referer: http://www.cquae.com/search_list.aspx?ob=ProjectCenter&lx=2
- Cookie: ASP.NET_SessionId=5050pcjv24rxsejce1dtbjas
- Content-Type: application/x-www-form-urlencoded
- Content-Length: 33393
- SECOND RESOPONSE:
- HTTP/1.0 200 OK
- Cache-Control: private
- Content-Type: text/html; charset=utf-8
- Server: Microsoft-IIS/7.5
- X-Powered-By: UrlRewriter.NET 2.0.0
- X-AspNet-Version: 2.0.50727
- X-Powered-By: ASP.NET
- Date: Mon, 27 Dec 2010 03:32:32 GMT
- Content-Length: 62133
- Connection: keep-alive
复制代码 部分代码如下:
- $param['__EVENTARGET']="_ctl11";
- $param['__EVENTARGUMENT']="";
- $param['__EVENTVALIDATION']=$eventvalidation;
- $param['__VIEWSTATE']= $viewstate;
- $param['wd']="";
- $param['p1']="";
- $param['p2']="";
- $param['yema']="2";
- function get_html_by_url_proxy($url,$options)
- {
- global $sessionId;
- $ch = curl_init();
- if (!is_resource($ch))
- {
- if(DEBUG ==true) log_write("error url: ".$url);
- return false;
- }
- $proxy = 'proxy.jgb:8081';
- $headary= array('Content-Type: multipart/form-data');
- echo "开启代理$proxy"."\n";
- var_dump($options);
- curl_setopt($ch, CURLOPT_PROXY, $proxy);
- //curl_setopt($ch,CURLOPT_HEADER,0);
- curl_setopt($ch,CURLOPT_HTTPHEADER,$headary);
- curl_setopt($ch,CURLOPT_URL,$url);
- curl_setopt($ch,CURLOPT_COOKIE,$sessionId);
- curl_setopt($ch,CURLOPT_POSTFIELDS,http_build_query($options));
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
- curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
- $output = curl_exec($ch) ;
- curl_close($ch);
- return $output;
- }
复制代码 |
|