Chinaunix

标题: 询问提高效率 [打印本页]

作者: chenhao392 时间: 2010-02-05 07:40
标题: 询问提高效率
程序效率太低..我用的全是for,应该是最慢的,问题是换成foreach能提高很多么?
求拍造成效率低下的原因及改进,谢谢!

sub get_interface{
my ($range,@native)= @_;
my @native_one;
my @native_two;
my @predict_one;
my @predict_two;
my @temp1;
my $count;
my $chain="RANDOM";
my $num;
my $miss;
my $mark;

#read in file
@temp1=grep{/^ATOM.*?/} @native;

#store the files as 2D matrix
$count=0;
$miss=0;
for(my $i=0;$i<scalar(@temp1);$i++){
if($temp1[$i]=~/^ATOM.*?/){
$count++;
#mark first res
if($count == 1){
$chain=substr($temp1[$i],23,3);
$mark=1;
}
#Chain One
if($chain eq substr($temp1[$i],23,3) && $count <30 && $mark==1){

$native_one[$i-$miss][0]=substr($temp1[$i],23,3);#res number
$native_one[$i-$miss][1]=substr($temp1[$i],21,1);#Chain ID
$native_one[$i-$miss][2]=substr($temp1[$i],32,6);#X
$native_one[$i-$miss][3]=substr($temp1[$i],40,6);#Y
$native_one[$i-$miss][4]=substr($temp1[$i],48,6);#Z
$native_one[$i-$miss][5]=$i+1;#line number
$num=scalar(@native_one);
}
if($chain ne substr($temp1[$i],23,3) && $count >1 && $mark==1){

$native_one[$i-$miss][0]=substr($temp1[$i],23,3);#res number
$native_one[$i-$miss][1]=substr($temp1[$i],21,1);#Chain ID
$native_one[$i-$miss][2]=substr($temp1[$i],32,6);#X
$native_one[$i-$miss][3]=substr($temp1[$i],40,6);#Y
$native_one[$i-$miss][4]=substr($temp1[$i],48,6);#Z
$native_one[$i-$miss][5]=$i+1;#line number
$num=scalar(@native_one);
}
#Chain Two
if($count > 30 && $chain eq substr($temp1[$i],23,3)){
$mark=2;
}
if($count >30 && $mark==2){

$native_two[$i-$num-$miss][0]=substr($temp1[$i],23,3);#res number
$native_two[$i-$num-$miss][1]=substr($temp1[$i],21,1);#Chain ID
$native_two[$i-$num-$miss][2]=substr($temp1[$i],32,6);#X
$native_two[$i-$num-$miss][3]=substr($temp1[$i],40,6);#Y
$native_two[$i-$num-$miss][4]=substr($temp1[$i],48,6);#Z
$native_two[$i-$num-$miss][5]=$i+1;#line number
}
}
else{
$miss++;
}

}

#get the line numbers for calculation
my $num1=scalar(@native_one);
my $num2=scalar(@native_two);
my @residue=residue(\@native_one,\@native_two,$num1,$num2,$range);
return @residue;
}
#calculate the distance and find the residues
sub residue{
my($first,$second,$num1,$num2,$range)=@_;
my $distance;
my @result_first;
my @result_second;
for(my $i=0;$i<$num1;$i++){
for(my $j=0;$j<$num2;$j++){

$distance=distance($$first[$i][2],$$first[$i][3],$$first[$i][4],$$second[$j][2],$$second[$j][3],$$second[$j][4]);

if($distance <= $range){
my $temp="$$first[$i][0]";
push @result_first,$temp;
my $temp2="$$second[$j][0]";
push @result_second,$temp2;
}
}

}
#delete the reduntant residues
my %hash=();
my @result1 = grep{$hash{$_}++ <1} @result_first;
my %hash2=();
my @result2 = grep{$hash2{$_}++ <1} @result_second;
my @result_final=(@result1,"divide",@result2);
return @result_final;
}

#Function for distance
sub distance{
my($x1,$y1,$z1,$x2,$y2,$z2)=@_;
my $square=($x1-$x2)**2+($y1-$y2)**2+($z1-$z2)**2;
my $result=sqrt($square);
return $result;
}

作者: dugu072_cu 时间: 2010-02-05 08:10
没有人说，for比foreach 效率高，二者是等价的，效率的高低取决于你的使用方式
另外，要想得到理想的效率提升，修改这种细节，很难达到，更多的时候，是修改的算法

直接贴代码，虽然很直观，但基本的描述都没有，你认为的性能瓶颈也没指出来，怎么能指望大家积极的回应？

作者: 兰花仙子 时间: 2010-02-05 10:21
大概看了一下，那2个函数计算很可能是瓶颈吧。
lz可以对它们benchmark一下。

my $t = Benchmark::Timer->new();

$t->start('tag');
func();
$t->stop('tag');

printf("%.2f",$t->result('tag'));

作者: cobrawgl 时间: 2010-02-05 11:49
大概看了一下

1。 for(my $i=0;$i<scalar(@temp1);$i++) 这种用法是很不 perlish 的，并且耗时间。

2。我看到你用了好多 substr ，substr 很费时间的。。。

3。要不你用 python 改写算了

作者: cobrawgl 时间: 2010-02-05 11:55
多余的语句太多。
比如

sub distance {
return sqrt(($_[0] - $_[3])**2 + ($_[1] - $_[4])**2 + ($_[2] -$_[5])**2);
}

作者: cobrawgl 时间: 2010-02-05 12:12
我以前仿照 python 版本写了个拼写检查器 http://blog.chinaunix.net/u/78/showart_720166.html

就比 python 的慢好多，郁闷！

作者: chenhao392 时间: 2010-02-05 19:49
to 2楼:
sorry, 虽然在CU注册很早,但是我确实是新人..以下是一些解释,感谢拍砖

distance是想计算两个坐标在三维空间上的距离,就是x y z,三个维度的数值相减,平方相加,再根号.

residue是根据输入的蛋白质残基信息(氨基酸ID, x, y, z, line number),应用distance遍历在chain1 和chain2两个蛋白质序列中的所有氨基酸组合,算出距离,保存距离短于$range的配对.

get_interface()是读入蛋白质的pdb格式文件,将里面需要的信息拿出来,转存入二维数组.call这个residue进行计算.

PS: 用substr的原因是为了应付pdb格式文件的问题,regular expression有匹配错误的可能..就是说不是每一行的氨基酸ID描述,用regular expression描述都是一致的...

改进算法我确实想过, 就是不再遍历所有的氨基酸组合..问题是那是不是就需要用遗传算法,dynamic programming之类的写程序..不太想写..

作者: chenhao392 时间: 2010-02-05 19:52
to 仙子:

谢谢! 还有这是你首次回复我

to cobrawgl:

哥们说的最实际,或许我该想出合适的regular expression, 然后$1,$2啥的..或者想个方法不让substr的指针(如果有的话) 走N遍?..毕竟一次就够了..

谢谢!

作者: chenhao392 时间: 2010-02-05 19:59
厄. ..我2了,只要substr能做,用regular expresion应该都能做..

作者: ulmer 时间: 2010-02-05 20:25
回复 1# chenhao392

loop in loop, still loop in loop and so many loops make this program slow and slow.

The problem is analyzing of the data struchture.
Discovering the data structure in details and finding the best way
to pattern the required data.

作者: Ray001 时间: 2010-02-05 22:09
楼主编程风格没一句注释啊，我看一眼就晕了。

欢迎光临 Chinaunix (http://bbs.chinaunix.net/)