关键字重复内容提取

newfinder 发表于 2018-04-05 06:10

大家好，我想以文件的第一列为关键字，对关键字为重复的所在行内容进行提取，如下所示

data1.txt
---------
a 1 2
b 1 3
a 1 4
a 1 2
b 2 3
c 2 4
c 2 2
d 2 1
e 3 2

想输出的结果为：
data2.txt
---------
a 1 2
b 1 3
a 1 4
a 1 2
b 2 3
c 2 4
c 2 2

我自己写了一个脚本（就是不同的脚本拼了拼），虽然可以输出结果但感觉太冗长了，贴出来请大家看看，请大家能给些建议，也请教大家能给些简便脚本供自己参考学习一下。
此外，对关键字进行排序和不排序两种情况进行输出，该分别如何写呢？

#不排序结果
a 1 2
b 1 3
a 1 4
a 1 2
b 2 3
c 2 4
c 2 2

#排序结果
a 1 2
a 1 2
a 1 4
b 1 3
b 2 3
c 2 2
c 2 4

#!/usr/bin/perl -w

open IN1, "data1.txt";

while (<IN1>) {
chomp;
@array1=split;
push @array2,$array1;
}

close IN1;

my $key;
my $value;
my %hash;
foreach (@array2){
++$hash{$_};
}

while(($key,$value)=each %hash){
if ($value>=2) {
push @array3,$key;
}
}

open IN2,"data1.txt";
open OUT,">data2.txt";

my %name = map {/^(\S+)/,1 } @array3;
map { $name{ (split) } and print OUT $_ } <IN2>;

close IN2;
close OUT;

多谢各位了

laputa73 发表于 2018-04-05 10:55

我有一个想法，就是把这些数据导入到sqlite(mysql）里面去
然后count, sum, having ,
sql 随便怎么折腾了

xiaomm250 发表于 2018-04-05 12:22

newfinder 发表于 2018-04-05 06:10
大家好，我想以文件的第一列为关键字，对关键字为重复的所在行内容进行提取，如下所示

data1.txt

Excel不用,为什么要用perl?
能够土法炼钢,为什么要编程?
excel 2010开始菜单下有条件格式,
条件格式下有重复值,对重复值设置某一种颜色
然后筛选重复值,就达到你的要求了.

xiaomm250 发表于 2018-04-05 12:26

我是perl菜鸟,我觉得先用循环的方式生成hash,然后输出大于等于2的
不过我能用Excel绝对不用perl
Excel多简单呀

newfinder 发表于 2018-04-05 15:36

回复 2# laputa73

尴尬了，我不会mysql哦~~

newfinder 发表于 2018-04-05 15:38

回复 4# xiaomm250

是个好主意。不过EXCEL对于小数据还可以，但是数据量大的话就不是那么方便了。我列的这个数据只是一个示例，实际数据要比这个大很多。

xiaomm250 发表于 2018-04-05 16:37

newfinder 发表于 2018-04-05 15:38
回复 4# xiaomm250

是个好主意。不过EXCEL对于小数据还可以，但是数据量大的话就不是那么方便了。我列 ...

use strict;
use warnings;
open(File1,'<data1.txt') or die "can not open file:$!\n";my @data1=<File1>;close(File1);
open(File2,'>data2.txt') or die "can not open file:$!\n";
my %firstcol=();#第一列的hash
#先获取第一列的hash的值的计数
foreach my $line1(@data1)
{
if($line1=~m/^(\S+)/)
{
   $firstcol{$1}++;
}else
{
   print File2 "9527你大爷有错误!\n";
}
}
#通过循环,如果计数大于1,则输出
foreach my $line1(@data1)
{
if($line1=~m/^(\S+)/ and $firstcol{$1}>1)
{
   print File2 $line1;
}
}
close(File1);
close(File2);
#先排序文本,然后第一列计数大于壹的就打印出来
open(File3,'>data3.txt') or die "can not open file:$!\n";
my @newdata1=sort(@data1);#排序后的data1
foreach my $line1(@newdata1)
{
if($line1=~m/^(\S+)/ and $firstcol{$1}>1)
{
   print File3 $line1;
}
}
close(File3);
这是我写的代码,能得到你的运行结果,还有注释,可能效率不是那么理想,我测试了400w行,十几秒能得到结果,我win8 32bit

xiaomm250 发表于 2018-04-05 16:41

xiaomm250 发表于 2018-04-05 16:37
这是我写的代码,能得到你的运行结果,还有注释,可能效率不是那么理想,我测试了400w行,十几秒能得到结果,我 ...

稍微改进一下代码就是先读取data2的文本,然后直接排序然后输出就可以了

xiaomm250 发表于 2018-04-05 16:46

use strict;
use warnings;
open(File1,'<data1.txt') or die "can not open file:$!\n";my @data1=<File1>;close(File1);
open(File2,'>data2.txt') or die "can not open file:$!\n";
my %firstcol=();#第一列的hash
#先获取第一列的hash的值的计数
foreach my $line1(@data1)
{
if($line1=~m/^(\S+)/)
{
   $firstcol{$1}++;
}else
{
   print File2 "9527你大爷有错误!\n";
}
}
#通过循环,如果计数大于1,则输出
foreach my $line1(@data1)
{
if($line1=~m/^(\S+)/ and $firstcol{$1}>1)
{
   print File2 $line1;
}
}
close(File2);
#读取data2的文本,然后排序,这样排序更快一些
open(File2,'<data2.txt') or die "can not open file:$!\n";my @data2=<File2>;close(File2);
open(File3,'>data3.txt') or die "can not open file:$!\n";
my @newdata2=sort(@data2);#排序后的data2
foreach my $line1(@newdata2)
{
print File3 $line1;
}
close(File3);

改进后的代码

我是perl菜鸟,我还读不懂你的代码

newfinder 发表于 2018-04-05 17:35

回复 9# xiaomm250

我也是菜鸟一枚呀，可以互相学习学习。

页: [1] 2

Chinaunix's Archiver

关键字重复内容提取