- 论坛徽章:
- 7
|
本帖最后由 b114213903 于 2015-08-12 08:45 编辑
回复 3# MMMIX - #!/usr/bin/perl
- use strict;
- use Bio::SeqIO;
- my $fasta=shift @ARGV;
- (my $Out=$fasta)=~s/(\.[^\.]+$)/_out$1/;
- my $IN=Bio::SeqIO->new(-file=>"$fasta",-format=>'fasta');
- my $Lenths={};
- my $Seqs={};
- while(my $seq=$IN->next_seq){
- my $id=$seq->id;
- my $length=$seq->length;
- (my $new_id=$id)=~s/^([^\.]+)\..+$/$1/;
-
- if($Lenths->{$new_id}){
- if($Lenths->{$new_id}<$length){
- $Lenths->{$new_id}=$length;
- $Seqs->{$new_id}=$seq;
- }
- }else{
- $Lenths->{$new_id}=$length;
- $Seqs->{$new_id}=$seq;
- }
- }
- $IN->close();
- my $OUT=Bio::SeqIO->new(-file=>">$Out",-format=>'fasta');
- foreach my $id(sort keys %{$Seqs}){
- print "$id\n";
- $OUT->write_seq($Seqs->{$id});
- }
- $OUT->close();
复制代码- >qwer.1
- AFRTYUIFGHJKLVBNM
- >qwer.2
- BVXNVFGSFYEBCSHB
- >qwer.3
- HDFKSHFGSERYFIEURHFSUFDSHVBSJEUABFUHABFCAHFBC
- >rtyuip00.1
- AFHBVSFHUACFKUSHDBAKFHAKUFHSADKFUA
- >hhjkl.1
- JDNVKHFBAKHFAKFAFJNSADFJAS
- >hhjk.2
- HFSDHNFKANFAKFIJI
- >hhjk.3
- JNFAJNFALSDFLMAD
- >hhjk.4
- KJGSEGJOAKFFDSMFAPOKEF
复制代码- >hhjk.4
- KJGSEGJOAKFFDSMFAPOKEF
- >hhjkl.1
- JDNVKHFBAKHFAKFAFJNSADFJAS
- >qwer.3
- HDFKSHFGSERYFIEURHFSUFDSHVBSJEUABFUHABFCAHFBC
- >rtyuip00.1
- AFHBVSFHUACFKUSHDBAKFHAKUFHSADKFUA
复制代码 另外,FASTA格式的文档,有规定格式:一、ID行:“>”(大于号)开头,后面根ID名称及各种注释信息,不可换行。二、数据行:单纯的核酸或氨基酸序列,可以单行也可多行。
|
|