- 论坛徽章:
- 1
|
回复 12# grshrd49
用 Mojo 写了个爬虫的小例子,抄了一下云舒的 Bloom::Filter 模块的使用。- #!/usr/bin/perl
- use strict;
- use Mojo::UserAgent;
- use Bloom::Filter;
- my $filter = Bloom::Filter->new(capacity => 100000, error_rate => 0.0001);
- my $ua = Mojo::UserAgent->new;
- my $delay = Mojo::IOLoop->delay;
- my $end = $delay->begin(0);
- my $callback;$callback = sub {
- my ($ua, $tx) = @_;
- $end->() if !$tx->success;
- $tx->res->dom->find("a[href]")->each(sub{
- my $attrs = shift->attrs;
- my $newUrl = $attrs->{href};
- next if $newUrl !~ /php-oa.com/;
- if( !$filter->check($newUrl) ) {
- print $filter->key_count(), " ", $newUrl, "\n";
- $filter->add($newUrl);
- $ua->get($newUrl => $callback);
- }
- });
- $end->();
- };
- $ua->get($ARGV[0] => $callback);
- Mojo::IOLoop->start;
复制代码 使用直接存成文件,然后给上面的 php-oa.com 的域名修改掉,然后 perl ./t11.pl http://www.php-oa.com 这样就行了。
|
|