论坛徽章:: 3

电梯直达

1楼 [收藏(0)] [报告]

发表于 2013-12-17 18:20 |只看该作者 |倒序浏览

发出来请教下。。

#!/usr/bin/perl
use strict;
use warnings;
use threads;
use threads::shared;
use Thread::Queue;
use Thread::Semaphore;
use URI::URL;
use Web::Scraper;
use Digest::MD5 qw(md5 md5_hex md5_base64);
use DBI;
my $max_threads = 15;
my $base_url = $ARGV[0] || 'http://www.icylife.net';
my $host = URI::URL->new($base_url)->host;
my $queue = Thread::Queue->new( );
my $semaphore = Thread::Semaphore->new( $max_threads );
my $mutex = Thread::Semaphore->new( 1 );
$queue->enqueue( $base_url );
my $digest = md5_hex($base_url+rand(100000));
print "数据库地址:$digest";
my $dbh=DBI->connect("dbi:SQLite:dbname=./db/".$digest.".db","","",{RaiseError=>1,AutoCommit=>0});
my $sql="create table url_data(id int primary key,url TEXT not null)";
$dbh->do($sql);
$dbh->do("insert into url_data(url)values('".$base_url."')");
while( 1 )
{
foreach ( threads->list(threads::joinable) )
{
$_->join( );
}
my $item = $queue->pending();
if( $item == 0 )
{
my $active = threads->list(threads::running);
if( $active == 0 )
{
print "All done!\n";
last;
}
else
{
sleep 1;
next;
}
}
$semaphore->down;
threads->create( \&ProcessUrl );
}
foreach ( threads->list() )
{
$_->join( );
}
sub ProcessUrl
{
my $scraper = scraper
{
process '//a', 'links[]' => '@href';
};
my $res;
my $link;
while( my $url = $queue->dequeue_nb() )
{
eval
{
$res = $scraper->scrape( URI->new($url) )->{'links'};
};
if( $@ )
{
warn "$@\n";
next;
}
next if (! defined $res );
foreach( @{$res} )
{
$link = $_->as_string;
$link = URI::URL->new($link, $url);
# not http and not https?
next if( $link->scheme ne 'http' && $link->scheme ne 'https' );
# another domain?
next if( $link->host ne $host );
$link = $link->abs->as_string;
if( $link =~ /(.*?)#(.*)/ )
{
$link = $1;
}
next if( $link =~ /.(jpg|png|bmp|mp3|wma|wmv|gz|zip|rar|iso|pdf)$/i );
$mutex->down();
#$dbconn=$dbh->prepare("select * from url_data where url='".$link."'");
#$dbconn->excute();
#if($dbconn){
print $link, "\n";
$dbh->do("insert into url_data(url)values('".$link."')");
if($dbh->err()){
exit(1);
}
$dbh->commit();
$queue->enqueue($link);
#}
$mutex->up();
undef $link;
}
undef $res;
}
undef $scraper;
$semaphore->up( );
}
$dbh->disconnect();

复制代码

文库|博客

py

版主

论坛徽章:: 1

2楼 [报告]

发表于 2013-12-18 10:21 |只看该作者

这段爬虫代码几乎没有什么可取之处，错误的技术选择，糟糕的代码。。。
扶凯不是给你写了一个爬虫代码吗，你可以基于那个代码往下写

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

墨迹哥

富足长乐

论坛徽章:: 3

3楼 [报告]

发表于 2013-12-19 09:15 |只看该作者

回复 2# py

好吧！。。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

墨迹哥

富足长乐

论坛徽章:: 3

4楼 [报告]

发表于 2013-12-20 09:17 |只看该作者

回复 2# py

我到现在还是没搞清爬虫的原理。。我尝试用PYTHON写的时候，我发现对爬虫的概念很混淆。

且听我说下去：

      从上个月开始我就在纠结这个问题，到现在依然没有靠谱的答案。

      我希望爬取所有网页链接，要求不重复。将其保存到db里面。

      可是在爬虫理念上我完全不能明白，该怎么去做爬虫思维。看了好多教程都是教怎么获得网页上的链接。这个很轻松就能做到。

      但是重点是如何递归下去爬虫呢？这是一个很科学的问题。。