- 论坛徽章:
- 0
|
本帖最后由 perl初学者 于 2012-09-27 10:33 编辑
我想在five_prime_UTR、CDS和three_prime_UTR之间插入一行inron 的information,使得intron这一行的第1,2,8列是上一行信息,第4列是上一行的第四列的数据+1,第五列数据是下一行数据-1(如果两行之间连续则不输入intron信息),而且gene行和mRNA 这一行之间是不加intron信息行的。我的脚本是这样的:
#!/usr/bin/perl
use Data::Dumper;
my $gff=shift @ARGV;
open GFF,"<",$gff or die "$!";
my $str = do{local $/; <GFF>};
my @chunks = split /^?=<(\#\#)/m,$str;
my (%xml,$title);
while (@chunks) {
my $text = shift @chunks;
open my $fh, "<", \$text or die($!);
while (<$fh>;) {
next if (/^\#/);
next unless (/^Chr/);
my ($seqname, $source, $type, $start, $end, $score, $strand, $phase, $annotation) = split /\t/,$_;
if ($type eq 'gene') {
$title= $_;
$title.=<$fh>;
}elsif ($type eq 'mRNA'){
$title =$_;
}
push @{ $xml{$title} },<$fh>;
close($fh);
}
}
close GFF;
open OUT,">","$gff_intron";
my (%start,%end,$line);
while (my($k,$v) = each %xml) {
print OUT "$k\n";
for ($i=0; $i<@{$v} ;$i++) {
@array=(split/\t/,${$v}[$i]);
$start{$i}=$array[3];
$end{$i}=$array[4];
}
for ($t=0; $t<@{$v} ;$t++) {
@circle=split/\t/,${$v}[$t];
$seqname=$circle[0];
$source=$circle[1];
$annotation=$circle[-1];
$strand=$circle[6];
print OUT "${$v}[$t]\n";
if ($t < $#{$v}-1) {
my $start2=$end{$t}+1;
my $end2=$start{$t+1}-1;
if (($end2-$start2)>1) {
print OUT "$seqname\t$source\tintron\t$start2\t$end2\t\.\t$strand\t\.\t$annotation\n";
}
}
}
}
close OUT;
我的原始文件是这样的:
###
Chr1 MSU_osa1r6 gene 1903 9817 . + . ID=13101.t00001;Name=TBC%20domain%20containing%20protein%2C%20expressed;Alias=LOC_Os01g01010
Chr1 MSU_osa1r6 mRNA 1903 9817 . + . ID=13101.m00001;Parent=13101.t00001;Alias=LOC_Os01g01010.1
Chr1 MSU_osa1r6 five_prime_UTR 1903 2268 . + . Parent=13101.m00001
Chr1 MSU_osa1r6 five_prime_UTR 2354 2448 . + . Parent=13101.m00001
Chr1 MSU_osa1r6 CDS 2449 2616 . + 0 Parent=13101.m00001
Chr1 MSU_osa1r6 CDS 3357 3455 . + 0 Parent=13101.m00001
Chr1 MSU_osa1r6 CDS 4457 4560 . + 0 Parent=13101.m00001
Chr1 MSU_osa1r6 CDS 6136 6944 . + 2 Parent=13101.m00001
Chr1 MSU_osa1r6 CDS 7028 7150 . + 1 Parent=13101.m00001
Chr1 MSU_osa1r6 CDS 9274 9297 . + 0 Parent=13101.m00001
Chr1 MSU_osa1r6 three_prime_UTR 9298 9430 . + . Parent=13101.m00001
Chr1 MSU_osa1r6 three_prime_UTR 9504 9817 . + . Parent=13101.m00001
###
Chr1 MSU_osa1r6 mRNA 1984 9562 . + . ID=13101.m00002;Parent=13101.t00001;Alias=LOC_Os01g01010.2
Chr1 MSU_osa1r6 five_prime_UTR 1984 2255 . + . Parent=13101.m00002
Chr1 MSU_osa1r6 five_prime_UTR 2354 2448 . + . Parent=13101.m00002
Chr1 MSU_osa1r6 CDS 2449 2616 . + 0 Parent=13101.m00002
Chr1 MSU_osa1r6 CDS 9504 9562 . + 1 Parent=13101.m00002
但是输出结果是这样的:
###
intron 1 291843 . . ###
Chr1 MSU_osa1r6 gene 291844 292535 . + . ID=13101.t00059;Name=expressed%20protein;Alias=LOC_Os01g01580
Chr1 MSU_osa1r6 mRNA 291844 292535 . + . ID=13101.m00083;Parent=13101.t00059;Alias=LOC_Os01g01580.1
Chr1 MSU_osa1r6 five_prime_UTR 291844 291846 . + . Parent=13101.m00083
Chr1 MSU_osa1r6 CDS 291847 292329 . + 0 Parent=13101.m00083
Chr1 MSU_osa1r6 three_prime_UTR 292330 292535 . + . Parent=13101.m00083
###
###
intron 1 292647 . . ###
Chr1 MSU_osa1r6 gene 292648 293536 . + . ID=13101.t00060;Name=expressed20protein;Alias=LOC_Os01g01590
Chr1 MSU_osa1r6 mRNA 292648 293536 . + . ID=13101.m00084;Parent=13101.t00060;Alias=LOC_Os01g01590.1
Chr1 MSU_osa1r6 CDS 292648 292850 . + 0 Parent=13101.m00084
Chr1 MSU_osa1r6 intron 292851 293400 . + . Parent=13101.m00084
Chr1 MSU_osa1r6 CDS 293401 293536 . + 2 Parent=13101.m00084
我想得到的结果:##行按原来输出,如下:
##
Chr1 MSU_osa1r6 gene 302103 305800 . + . ID=13101.t00061;Name=ethylene%2Dresponsive%20protein%20related%2C%20putative%2C%20expressed;Alias=LOC_Os01g01600
Chr1 MSU_osa1r6 mRNA 302103 305800 . + . ID=13101.m00085arent=13101.t00061;Alias=LOC_Os01g01600.1
Chr1 MSU_osa1r6 five_prime_UTR 302103 302328 . + . Parent=13101.m00085
Chr1 MSU_osa1r6 CDS 302329 302471 . + 0 Parent=13101.m00085
Chr1 MSU_osa1r6 intron 302472 302980 . + . Parent=13101.m00085
Chr1 MSU_osa1r6 CDS 302981 303509 . + 2 Parent=13101.m00085
Chr1 MSU_osa1r6 intron 303510 304571 . + . Parent=13101.m00085
Chr1 MSU_osa1r6 CDS 304572 304718 . + 0 Parent=13101.m00085
Chr1 MSU_osa1r6 intron 304719 304833 . + . Parent=13101.m00085
Chr1 MSU_osa1r6 CDS 304834 304899 . + 0 Parent=13101.m00085
Chr1 MSU_osa1r6 intron 304900 304992 . + . Parent=13101.m00085
Chr1 MSU_osa1r6 CDS 304993 305058 . + 0 Parent=13101.m00085
Chr1 MSU_osa1r6 intron 305059 305170 . + . Parent=13101.m00085
Chr1 MSU_osa1r6 CDS 305171 305245 . + 0 Parent=13101.m00085
Chr1 MSU_osa1r6 intron 305246 305352 . + . Parent=13101.m00085
Chr1 MSU_osa1r6 CDS 305353 305493 . + 0 Parent=13101.m00085
Chr1 MSU_osa1r6 three_prime_UTR 305494 305800 . + . Parent=13101.m00085
##
Chr1 MSU_osa1r6 gene 305869 307842 . - . ID=13101.t00062;Name=IscA%2Dlike%20iron%2Dsulfur%20assembly%20protein%2C%20mitochondrial%20precursor%2C%20putative%2C%20expressed;Alias=LOC_Os01g01610
Chr1 MSU_osa1r6 mRNA 305869 307842 . - . ID=13101.m00086;Parent=13101.t00062;Alias=LOC_Os01g01610.1
Chr1 MSU_osa1r6 five_prime_UTR 307703 307842 . - . Parent=13101.m00086
Chr1 MSU_osa1r6 intron 305843 307601 . - . Parent=13101.m00085
Chr1 MSU_osa1r6 five_prime_UTR 307602 307626 . - . Parent=13101.m00086
我实在是不知道自己哪里错了,请教各位大侠指点,谢谢啊!~
|
|