免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 3132 | 回复: 1
打印 上一主题 下一主题

php模拟某网站分页脚本 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2010-12-27 16:32 |只看该作者 |倒序浏览
本帖最后由 Galenlong 于 2010-12-27 16:35 编辑

要对这个网站http://www.cquae.com/search.shtml?ob=ProjectCenter&lx=2&jg=1进行抓取,可是总是只能抓取第一页的,无法模拟分页的过程。分析过该网站的分页参数传递,用snoopy、curl等方法试过之后还是不行,麻烦有心人帮一起研究哈,谢谢!
这是抓取的header
  1. FIRST REQUEST:

  2. GET http://www.cquae.com/search_list.aspx?ob=ProjectCenter&lx=2 HTTP/1.1
  3. Host: www.cquae.com
  4. User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13
  5. Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
  6. Accept-Language: zh-cn,zh;q=0.5
  7. Accept-Encoding: gzip,deflate
  8. Accept-Charset: GB2312,utf-8;q=0.7,*;q=0.7
  9. Keep-Alive: 115
  10. Proxy-Connection: keep-alive


  11. FIRST response:

  12. HTTP/1.0 200 OK
  13. Cache-Control: private
  14. Content-Type: text/html; charset=utf-8
  15. Server: Microsoft-IIS/7.5
  16. X-Powered-By: UrlRewriter.NET 2.0.0
  17. Set-Cookie: ASP.NET_SessionId=5050pcjv24rxsejce1dtbjas; path=/; HttpOnly
  18. X-AspNet-Version: 2.0.50727
  19. X-Powered-By: ASP.NET
  20. Date: Mon, 27 Dec 2010 03:31:38 GMT
  21. Content-Length: 62646
  22. Connection: keep-alive

  23. SECOND REQUEST:
  24. POST http://www.cquae.com/search_list.aspx?ob=ProjectCenter&lx=2 HTTP/1.1
  25. Host: www.cquae.com
  26. User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13
  27. Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
  28. Accept-Language: zh-cn,zh;q=0.5
  29. Accept-Encoding: gzip,deflate
  30. Accept-Charset: GB2312,utf-8;q=0.7,*;q=0.7
  31. Keep-Alive: 115
  32. Proxy-Connection: keep-alive
  33. Referer: http://www.cquae.com/search_list.aspx?ob=ProjectCenter&lx=2
  34. Cookie: ASP.NET_SessionId=5050pcjv24rxsejce1dtbjas
  35. Content-Type: application/x-www-form-urlencoded
  36. Content-Length: 33393


  37. SECOND RESOPONSE:
  38. HTTP/1.0 200 OK
  39. Cache-Control: private
  40. Content-Type: text/html; charset=utf-8
  41. Server: Microsoft-IIS/7.5
  42. X-Powered-By: UrlRewriter.NET 2.0.0
  43. X-AspNet-Version: 2.0.50727
  44. X-Powered-By: ASP.NET
  45. Date: Mon, 27 Dec 2010 03:32:32 GMT
  46. Content-Length: 62133
  47. Connection: keep-alive
复制代码
部分代码如下:

  1.         $param['__EVENTARGET']="_ctl11";
  2.         $param['__EVENTARGUMENT']="";
  3.         $param['__EVENTVALIDATION']=$eventvalidation;
  4.         $param['__VIEWSTATE']= $viewstate;
  5.         $param['wd']="";
  6.         $param['p1']="";
  7.         $param['p2']="";
  8.         $param['yema']="2";

  9. function get_html_by_url_proxy($url,$options)
  10. {
  11.         global $sessionId;
  12.         $ch = curl_init();
  13.         if (!is_resource($ch))
  14.         {
  15.                 if(DEBUG ==true) log_write("error url: ".$url);
  16.                 return false;
  17.         }
  18.         $proxy  = 'proxy.jgb:8081';

  19.         $headary= array('Content-Type: multipart/form-data');
  20.         echo "开启代理$proxy"."\n";
  21.         var_dump($options);
  22.         curl_setopt($ch, CURLOPT_PROXY, $proxy);
  23.         //curl_setopt($ch,CURLOPT_HEADER,0);
  24.         curl_setopt($ch,CURLOPT_HTTPHEADER,$headary);
  25.         curl_setopt($ch,CURLOPT_URL,$url);
  26.         curl_setopt($ch,CURLOPT_COOKIE,$sessionId);
  27.         curl_setopt($ch,CURLOPT_POSTFIELDS,http_build_query($options));
  28.         curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  29.         curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
  30.         $output = curl_exec($ch) ;
  31.         curl_close($ch);
  32.         return $output;
  33. }
复制代码

论坛徽章:
0
2 [报告]
发表于 2010-12-28 09:26 |只看该作者
{:3_194:}
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP