123 4 5 6 7 8 9 / 9 页下一页

Perl爬虫，爬整站研究 [复制链接]

墨迹哥

富足长乐

论坛徽章:: 3

11楼 [报告]

发表于 2013-06-02 00:50 |只看该作者

回复 7# iakuf

感谢。。上次帖子太乱真的没注意到。。今天看到你写的代码，上次那个我也测试了，但是上次那个代码我记得也是只能抓取一级页面啊？
我希望的是可以抓取2级甚至三级的页面，这对我来说很重要。爬虫的效率直接影响后续的操作啊。。
MOjo的模块我还真没有研究过。。我好好看看。。之前看到你博客的文章了。确实，你一开始写的那个，我就有关注，后续感觉有点纠结，
因为没看懂你的代码，这两天我会好好研究下，如果可以的话欢迎加入一起讨论爬虫。目前那鸟人也在写。不知道写到哪了。。
我自己写了个只能抓一级页面，也很蛋腾。云舒的那个代码不知道你看过木有？它那个能够抓整站的目的。但是貌似有BUG，在抓取数量到达
10000个URL的时候直接代码异常了。。- - ！纠结ING！

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

墨迹哥

富足长乐

论坛徽章:: 3

12楼 [报告]

发表于 2013-06-02 00:53 |只看该作者

回复 4# iakuf

对了，这个capacity => 100000如果超了就会出异常，这个怎么处理？

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

墨迹哥

富足长乐

论坛徽章:: 3

13楼 [报告]

发表于 2013-06-02 00:54 |只看该作者

回复 2# mcshell

你又开始缩到一个点去。。。你也加入一起讨论撒。。你最近研究什么去了。怎么老是不见你说话。。
- - ！是不是去上生理辅导课去了？。。。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

墨迹哥

富足长乐

论坛徽章:: 3

14楼 [报告]

发表于 2013-06-02 00:57 |只看该作者

回复 7# iakuf

抓163的时候就出现问题了。。。

25331 3 http://reg.163.com/reg/reg.jsp?product=urs
25332 3 /otp/controller/index.jsp
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value $host in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 248.
25333 3 /mibao/mpp/zht.jsp
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value $host in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 248.
25334 3 /mibao/controller/phone/index.jsp
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value $host in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 248.
25335 3 /mibao/controller/ppc/index.jsp
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value $host in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 248.
25336 3 http://reg.163.com/lockuser/LockUserByPingma_0.jsp
25337 3 http://reg.163.com/help/help_oauth2.html
25338 3 http://piao.163.com/shanghai/movie/1643.html#from=hot01.pic
25339 3 http://piao.163.com/guangzhou/movie/43454.html#from=hot09.pic
25340 3 http://piao.163.com/guangzhou/movie/1603.html#from=coming01.pic&tab=comment
25341 3 http://piao.163.com/guangzhou/movie/1642.html#from=banner1
25342 3 http://yuehui.163.com/viewuser.do?id=<#=uinfo.id#>
25343 3 http://yuehui.163.com/viewdate.do?id=<#=dinfo.did#>
25344 3 http://yuehui.163.com/searchdate.do
25345 3 /login.do
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 59.
Use of uninitialized value $host in concatenation (.) or string at /usr/local/share/perl/5.14.2/Mojo/UserAgent.pm line 248.
25346 3 http://fashion.163.com/photoview/43AJ0026/44338.html?from=htp
25347 3 http://fashion.163.com/photoview/25A20026/44334.html?from=htp
25348 3 http://comment.lady.163.com/photoview_bbs/PHOT1B9D002643AJ.html
25349 3 http://fashion.163.com/photoview/43AJ0026/44331.html
25350 3 http://fashion.163.com/photoview/43AJ0026/44337.html
25351 3 #p=902888GO43AJ0026
25352 3 #p=902888U543AJ0026
25353 3 #p=902888Q043AJ0026
25354 3 #p=902888P543AJ0026
25355 3 http://fashion.163.com/photoview/43AJ0026/44338.html
25356 3 http://fashion.163.com/photoview/25A20026/44334.html
25357 3 http://fashion.163.com/photoview/25A20026/44332.html
25358 3 http://fashion.163.com/photoview/25A20026/44323.html

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

墨迹哥

富足长乐

论坛徽章:: 3

15楼 [报告]

发表于 2013-06-02 00:59 |只看该作者

回复 5# dugu072

请教兄台，对于你说的问题，是否有好的办法解决呢？不妨指教一二?

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

iakuf

小富即安

论坛徽章:: 1

16楼 [报告]

发表于 2013-06-02 20:14 |只看该作者

回复 9# dugu072

你如果是要控制的事件的数量，你想你需要的参数是 Mojo::Loop 中修改 max_accepts 和 max_connections 的属性，都有完整的接口。另外，考虑下使用 libev 会性能好多了，所以装个 EV 的模块相当有必要。会自动帮你调用的。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

grshrd49

小富即安

论坛徽章:: 3

17楼 [报告]

发表于 2013-06-02 20:40 |只看该作者

#!C:\Perl\bin\perl.exe
use strict;
use LWP::UserAgent;
use HTTP::Request::Common qw(GET);
my $url = "http://www.chinaunix.net/";
my $base_url = "chinaunix";
my @exclude = ("thread","uid-","uid/","forum-","Start_","forummodule-","fid=","search.php","=","download/","css","js","news","shtml","peixun","page-");#url排除
my @storeurl;
my @newurl;
my @findurl;
my @waitfindurl;
my $temp=0;
push @waitfindurl,$url;
push @storeurl,$url;
for my $wfu (@waitfindurl){
#last if (@storeurl > 150);
print @storeurl ." ". @waitfindurl." $temp $wfu\n";
@findurl = ();
@findurl = &findpageurl($wfu,$base_url,\@exclude);
next unless ($findurl[0] =~ /http/i); #返回不时http就忽略
for my $fu (@findurl){
next if (map {$fu=~/^$_$/i} @storeurl); #比对url库中是否存在,存在就舍弃
next unless ($fu =~ /http/i); #剔除非http的链接
push @storeurl,$fu;
push @newurl,$fu;
print @storeurl." ".@newurl." storeurl _ newurl \n";
}
push @waitfindurl,@newurl;
shift @waitfindurl;
@newurl = ();#清空新找到的url列表
$temp++;
}
open FF,">url.txt";
print FF "$_\n" for(@storeurl);
close FF;
sub findpageurl{
my ($url,$base_url,$a) = @_;
my @exclude = @$a;
my $UA = LWP::UserAgent->new();
my $req = HTTP::Request->new( GET => "$url" );
my $resp = $UA->request($req);
my @page;
my @hrefurl;
my @srcurl;
my $return_code;
my $tempurl;
#链接不是200返回错误代码
$return_code = $resp->code;
print "requset return code:$return_code\n";
return $return_code unless ($return_code == 200);
@page = split /\n/,$resp->content;
for(@page){
if(s/href="(.+?)"//g){
$tempurl = $1;
next if (map {$tempurl=~/$_/i} @exclude);
substr($tempurl,-1) =~ s/\///; #移除url最后的 /
push @hrefurl,$tempurl if ($tempurl =~ /$base_url/i);
}
#push @srcurl,$1 if(/src="(.*?)"/g);
}
print "requset return url\n";
return @hrefurl;
}