12 / 2 页下一页

论坛徽章:: 8

电梯直达

1楼 [收藏(0)] [报告]

发表于 2014-08-18 16:23 |只看该作者 |倒序浏览

30可用积分

本帖最后由 huang6894 于 2014-08-18 16:26 编辑

还有一个问题。。。。对于N个以下的文本，我需要对第2、4、7列相同的第5列求平均值并作分析，想问一下应该怎么优化我的代码~

这N个文件的顺序格式都是一样的，不过每个文件在100M左右，每个文件里面的每一行都是唯一的。第2、第4列是位置信息，第3列则是区域信息，第四列是该区域的每个点，区域文件存放在文件ref上，第7列是正负信息，第5列则是对应的质量值。
设定第6列是质控值，用200除以该值得到权重，求平均时需要把第五列先乘以该权重才是他的真实质量值，每个文件的权重是唯一的。
目的是计算每个位置在所有文本中的平均值，然后分析假如某文件某行的数值小于0.5倍平均值的话，输出该行。。。。
文本：

==> head -6 result_for_depth.13B0005773.1408345911.txt result_for_depth.13B0005805.1408342063.txt result_for_depth.13B0013031.1408342008.txt result_for_depth.13B0013194.1408342004.txt
==> result_for_depth.13B0005773.1408345911.txt <==
13B0005773 chr1 2336191-2337323 2336191 55 123 +
13B0005773 chr1 2336191-2337323 2336191 16 123 -
13B0005773 chr1 2336191-2337323 2336192 52 123 +
13B0005773 chr1 2336191-2337323 2336192 16 123 -
13B0005773 chr1 2336191-2337323 2336193 52 123 +
13B0005773 chr1 2336191-2337323 2336193 16 123 -
==> result_for_depth.13B0005805.1408342063.txt <==
13B0005805 chr1 2336191-2337323 2336191 54 119 +
13B0005805 chr1 2336191-2337323 2336191 11 119 -
13B0005805 chr1 2336191-2337323 2336192 50 119 +
13B0005805 chr1 2336191-2337323 2336192 11 119 -
13B0005805 chr1 2336191-2337323 2336193 50 119 +
13B0005805 chr1 2336191-2337323 2336193 11 119 -
==> result_for_depth.13B0013031.1408342008.txt <==
13B0013031 chr1 2336191-2337323 2336191 51 118 +
13B0013031 chr1 2336191-2337323 2336191 14 118 -
13B0013031 chr1 2336191-2337323 2336192 47 118 +
13B0013031 chr1 2336191-2337323 2336192 14 118 -
13B0013031 chr1 2336191-2337323 2336193 47 118 +
13B0013031 chr1 2336191-2337323 2336193 15 118 -
==> result_for_depth.13B0013194.1408342004.txt <==
13B0013194 chr1 2336191-2337323 2336191 49 117 +
13B0013194 chr1 2336191-2337323 2336191 13 117 -
13B0013194 chr1 2336191-2337323 2336192 43 117 +
13B0013194 chr1 2336191-2337323 2336192 13 117 -
13B0013194 chr1 2336191-2337323 2336193 43 117 +
13B0013194 chr1 2336191-2337323 2336193 13 117 -

复制代码

文件ref：

chr1 2336191 2337323

复制代码

我的代码：

die "-- must set depth_dir and ref_bed --\n" if (@ARGV < 2);
my $dir = $ARGV[0];
my $bed = $ARGV[1];
my(%dep,%sam,%f,%p);
my @allfile=glob "$ARGV[0]/result_for_depth.*.txt";
foreach my $file(@allfile){
open FILE,$file;
while(<FILE>){
chomp;
my ($sample, $chr, $pos, $dep, $qc, $stand) = (split /\t/,$_)[0,1,3,4,5,6];
my $depth = $dep * 200/$qc;
my $yu = $pos.$stand;
$dep{$chr}{$yu}+=$depth;
$sam{$file}{$chr}{$yu}=$depth;
$f{$file}{$chr}{$pos}="$sample\t$chr\t$pos";
}
close FILE;
}
foreach my $file(@allfile){
open OUT,"> $file.cnv.result" || die "$!";
open BED,"$bed" or die "$!";
while(<BED>){
chomp;
my ($chr, $pos1, $pos2) = (split /\t/,$_)[0,1,2];
for my $pos ($pos1 .. $pos2){
my $y = $pos."+";
my $t = $pos."-";
my $mean1 = int(($dep{$chr}{$y}/$#allfile)+1);
my $mean2 = int(($dep{$chr}{$t}/$#allfile)+1);
if(($sam{$file}{$chr}{$y}<=0.5*$mean1)&&($sam{$file}{$chr}{$t}<=0.5*$mean2)){
print OUT "$f{$file}{$chr}{$pos}\n";
}
}
}
}

复制代码

最佳答案

chenhao392

查看完整内容

全读入hash么....要省内存，sort然后while(){}就好。然后

文库|博客

chenhao392

稍有积蓄

论坛徽章:: 1

2楼 [报告]

发表于 2014-08-18 16:23 |只看该作者

本帖最后由 chenhao392 于 2014-08-20 06:06 编辑

全读入hash么....要省内存，sort然后while(<>){}就好。

cat result_for_depth*txt | sort -k2 -k4n -k7 >working_file.bed

复制代码

然后

#!/usr/bin/perl
use strict;
use warnings;
my $pattern='';
my %hash;
my @f_score;
my @r_score;
open FILE, "<$ARGV[0]" or die "$. $!\n";
while(<FILE>){
chomp;
my ($sample, $chr, undef, $pos, $dep, $qc, $strand)=split(/\s+/,$_);
if($pattern ne "$chr$pos"){
$pattern = $chr.$pos;
if($. >1){
&flush();
}
}
else{
}
my $score=$dep*200/$qc;
if($strand eq '+'){
push @f_score,$score;
push @{$hash{'+'}},"$sample\t$chr\t$pos\t$score";
}
else{
push @r_score,$score;
push @{$hash{'-'}},"$sample\t$chr\t$pos\t$score";
}
}
close FILE;
&flush();
sub flush{
my $f_mean=mean(@f_score);
my $r_mean=mean(@r_score);
my %mem;
foreach my $slide(@{$hash{'+'}}){
my ($sample,$chr,$pos,$score)=split(/\t/,$slide);
if($score >= 0.5*$f_mean){
$mem{"$sample\t$chr\t$pos"}="";
}
}
foreach my $slide(@{$hash{'-'}}){
my ($sample,$chr,$pos,$score)=split(/\t/,$slide);
if($score >= 0.5*$r_mean && defined $mem{"$sample\t$chr\t$pos"}){
print "$sample\t$chr\t$pos\n";
}
}
%mem=();
@f_score=();
@r_score=();
%hash=();
}
sub mean{
my $sum;
foreach(@_){
$sum+=$_;
}
return $sum/($#_+1);
}

复制代码

perl tt.pl working_file.bed
13B0005805 chr1 2336191
13B0013194 chr1 2336191
13B0013031 chr1 2336191
13B0005773 chr1 2336191
13B0005805 chr1 2336192
13B0013194 chr1 2336192
13B0013031 chr1 2336192
13B0005773 chr1 2336192
13B0005805 chr1 2336193
13B0013194 chr1 2336193
13B0013031 chr1 2336193
13B0005773 chr1 2336193

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

pitonas

家境小康

论坛徽章:: 5

3楼 [报告]

发表于 2014-08-18 18:32 |只看该作者

优化下 ~ {:2_172:}
my $mean2 = int( ( $dep{$chr}{$t} / $#allfile ) + 1 );

$#allfile: index ==> count - 1
@allfile: count

#!/usr/bin/perl
die "-- must set depth_dir and ref_bed --\n" if ( @ARGV < 2 );
my $dir = $ARGV[0];
my $bed = $ARGV[1];
my ( %dep, %sam, %f, %p );
my @allfile = glob "$ARGV[0]/result_for_depth.*.txt";
foreach my $file (@allfile) {
open FILE, $file;
while (<FILE>) {
my ( $sample, $chr, undef, $pos, $dep, $qc, $stand ) = split;
my $depth = $dep * 200 / $qc;
my $yu = $pos . $stand;
$dep{$chr}{$yu} += $depth;
$sam{$file}{$chr}{$yu} = $depth;
$f{$file}{$chr}{$pos} = "$sample\t$chr\t$pos";
}
close FILE;
}
open BED, "$bed" or die "$!";
my @BED = map [split], <BED>;
foreach my $file (@allfile) {
open OUT, "> $file.cnv.result" || die "$!";
for (@BED) {
my ( $chr, $pos1, $pos2 ) = @$_;
for my $pos ( $pos1 .. $pos2 ) {
my $y = $pos . "+";
my $t = $pos . "-";
# $#allfile: index, @allfile: count
my $mean1 = int( ( $dep{$chr}{$y} / $#allfile ) + 1 );
my $mean2 = int( ( $dep{$chr}{$t} / $#allfile ) + 1 );
if ( ( $sam{$file}{$chr}{$y} <= 0.5 * $mean1 )
&& ( $sam{$file}{$chr}{$t} <= 0.5 * $mean2 ) )
{
print OUT "$f{$file}{$chr}{$pos}\n";
}
}
}
}

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

huang6894

大富大贵

论坛徽章:: 8

4楼 [报告]

发表于 2014-08-18 19:54 |只看该作者

回复 2# pitonas

嗯嗯，谢谢pitonas大大，我研究一下

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

huang6894

大富大贵

论坛徽章:: 8

5楼 [报告]

发表于 2014-08-19 10:02 |只看该作者

回复 2# pitonas

大神。。。程序死掉了，6G内存跑不了。。。。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

pitonas

家境小康

论坛徽章:: 5

6楼 [报告]

发表于 2014-08-19 14:59 |只看该作者

我不知道它是否可以改写?
我是文盲, 看不懂问题. 我做不到了....

非常抱歉 ~ {:2_168:}

回复 4# huang6894

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

huang6894

大富大贵

论坛徽章:: 8

7楼 [报告]

发表于 2014-08-19 15:33 |只看该作者

本帖最后由 huang6894 于 2014-08-19 15:34 编辑

回复 5# pitonas

好吧，我又错了。。。。。

我修改了一下：
不过还在测试。。。希望有好运气吧。。。

die "-- must set depth_dir and ref_bed --\n" unless (-d $ARGV[0] );
my $dir = $ARGV[0];
my %dep;
my @allfile = glob "$dir/result_for_depth.*.txt";
foreach my $file (@allfile) {
my $line;
open FILE, $file;
while (<FILE>) {
chomp;
my ( $sample, $chr, undef, $pos, $depth, $stand ) = split;#输入文件修改了，在原有的基础上做了权重校正，这里去掉了$qc
push @{$dep{$line}{R}},[$sample,$depth];#因为所有文件的每一行的key都是一样的
$dep{$line}{$sample}=[ $sample, $chr, $pos, $depth, $stand ];
$line ++;
}
close FILE;
}
open OUT,"> $dir/all_cnv_result";
while ( my ( $k, $v ) = each %dep ) {
my @tmp = @{ $v->{R} };
my ($a,@R);
for my $t (@tmp) {
$a += $t->[1];
push @R,$t->[0];
}
my $mean = $a/($#allfile+1);
for my $R (@R) {
my ( $A, $B, $C, $D, $E ) = @{ $v->{ $R } };
if($D <= 0.45*$mean){
print OUT "$A\t$B\t$C\t$D\t$E\t$mean\tdel\n";
}
}
}
close OUT;