- 论坛徽章:
- 1
|
- open FH,"<","log.txt" or die "Cannot open the file: log.txt\n";
- # 把形如 PAGE:[url]http://ab.com/q?pl=xxx[/url] 的行,作为散列 %url 的键
- while (<FH>) {
- if (m!(PAGE:http://[a-zA-Z]+\.com/q\?pl=\d{3})!) {
- if (exists $url{$1}) {
- $url{$1}->{'time'} = 2;
- }else{
- $url{$1}->{'time'} = 0;
- }
- }
- }
- close FH;
- # 剔除不重复的键
- while (($key,$value) = each %url) {
- delete $url{$key} unless ${$value}{'time'};
- $url{$key}->{'list'} = [];
- }
- open FH,"<","log.txt" or die "Cannot open the file: log.txt\n";
- $/ = '}';
- # 把重复的记录加进散列
- while (<FH>) {
- if (m!(PAGE:http://[a-zA-Z]+\.com/q\?pl=\d{3})!) {
- if (exists $url{$1}) {
- push @{$url{$1}->{'list'}},$_;
- }
- }
- }
- close FH;
- # 好了,现在重复的记录都在数组 @{$url{$key}->{'list'}} 中了,你爱怎么处理就怎么处理了
复制代码
我这个方法的优点是扫描文件2次(我的能力到此为止了),缺点是散列 %url 刚开始时会非常大。
[ 本帖最后由 wxlfh 于 2009-2-5 13:08 编辑 ] |
|