- 论坛徽章:
- 8
|
30可用积分
本帖最后由 huang6894 于 2014-08-18 16:26 编辑
还有一个问题。。。。对于N个以下的文本,我需要对第2、4、7列相同的第5列求平均值并作分析,想问一下应该怎么优化我的代码~
这N个文件的顺序格式都是一样的,不过每个文件在100M左右,每个文件里面的每一行都是唯一的。第2、第4列是位置信息,第3列则是区域信息,第四列是该区域的每个点,区域文件存放在文件ref上,第7列是正负信息,第5列则是对应的质量值。
设定第6列是质控值,用200除以该值得到权重,求平均时需要把第五列先乘以该权重才是他的真实质量值,每个文件的权重是唯一的。
目的是计算每个位置在所有文本中的平均值,然后分析假如某文件某行的数值小于0.5倍平均值的话,输出该行。。。。
文本:
- ==> head -6 result_for_depth.13B0005773.1408345911.txt result_for_depth.13B0005805.1408342063.txt result_for_depth.13B0013031.1408342008.txt result_for_depth.13B0013194.1408342004.txt
- ==> result_for_depth.13B0005773.1408345911.txt <==
- 13B0005773 chr1 2336191-2337323 2336191 55 123 +
- 13B0005773 chr1 2336191-2337323 2336191 16 123 -
- 13B0005773 chr1 2336191-2337323 2336192 52 123 +
- 13B0005773 chr1 2336191-2337323 2336192 16 123 -
- 13B0005773 chr1 2336191-2337323 2336193 52 123 +
- 13B0005773 chr1 2336191-2337323 2336193 16 123 -
- ==> result_for_depth.13B0005805.1408342063.txt <==
- 13B0005805 chr1 2336191-2337323 2336191 54 119 +
- 13B0005805 chr1 2336191-2337323 2336191 11 119 -
- 13B0005805 chr1 2336191-2337323 2336192 50 119 +
- 13B0005805 chr1 2336191-2337323 2336192 11 119 -
- 13B0005805 chr1 2336191-2337323 2336193 50 119 +
- 13B0005805 chr1 2336191-2337323 2336193 11 119 -
- ==> result_for_depth.13B0013031.1408342008.txt <==
- 13B0013031 chr1 2336191-2337323 2336191 51 118 +
- 13B0013031 chr1 2336191-2337323 2336191 14 118 -
- 13B0013031 chr1 2336191-2337323 2336192 47 118 +
- 13B0013031 chr1 2336191-2337323 2336192 14 118 -
- 13B0013031 chr1 2336191-2337323 2336193 47 118 +
- 13B0013031 chr1 2336191-2337323 2336193 15 118 -
- ==> result_for_depth.13B0013194.1408342004.txt <==
- 13B0013194 chr1 2336191-2337323 2336191 49 117 +
- 13B0013194 chr1 2336191-2337323 2336191 13 117 -
- 13B0013194 chr1 2336191-2337323 2336192 43 117 +
- 13B0013194 chr1 2336191-2337323 2336192 13 117 -
- 13B0013194 chr1 2336191-2337323 2336193 43 117 +
- 13B0013194 chr1 2336191-2337323 2336193 13 117 -
复制代码 文件ref:我的代码:
- die "-- must set depth_dir and ref_bed --\n" if (@ARGV < 2);
- my $dir = $ARGV[0];
- my $bed = $ARGV[1];
- my(%dep,%sam,%f,%p);
- my @allfile=glob "$ARGV[0]/result_for_depth.*.txt";
- foreach my $file(@allfile){
- open FILE,$file;
- while(<FILE>){
- chomp;
- my ($sample, $chr, $pos, $dep, $qc, $stand) = (split /\t/,$_)[0,1,3,4,5,6];
- my $depth = $dep * 200/$qc;
- my $yu = $pos.$stand;
- $dep{$chr}{$yu}+=$depth;
- $sam{$file}{$chr}{$yu}=$depth;
- $f{$file}{$chr}{$pos}="$sample\t$chr\t$pos";
- }
- close FILE;
- }
- foreach my $file(@allfile){
- open OUT,"> $file.cnv.result" || die "$!";
- open BED,"$bed" or die "$!";
- while(<BED>){
- chomp;
- my ($chr, $pos1, $pos2) = (split /\t/,$_)[0,1,2];
- for my $pos ($pos1 .. $pos2){
- my $y = $pos."+";
- my $t = $pos."-";
- my $mean1 = int(($dep{$chr}{$y}/$#allfile)+1);
- my $mean2 = int(($dep{$chr}{$t}/$#allfile)+1);
- if(($sam{$file}{$chr}{$y}<=0.5*$mean1)&&($sam{$file}{$chr}{$t}<=0.5*$mean2)){
- print OUT "$f{$file}{$chr}{$pos}\n";
- }
- }
- }
- }
复制代码 |
最佳答案
查看完整内容
全读入hash么....要省内存,sort然后while(){}就好。然后
|