Chinaunix
标题:
perl新手求教...求教......文件匹配输出指定内容,非常感谢
[打印本页]
作者:
little_joe
时间:
2016-08-17 16:52
标题:
perl新手求教...求教......文件匹配输出指定内容,非常感谢
perl新手求教...求教......文件匹配输出指定内容,非常感谢
文件a:
WP_001198722.1
;YP_026238.1;
37
126
WP_000075876.1;
58
204
WP_000072067.1
;YP_026241.1;
42
50
WP_000077872.1;YP_025308.1;
128
201
文件b:
>lcl|NC_012947.1_prot_
WP_001198722.1
_11 [gene=ECBD_RS00055] [protein=/inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001198703.1; catalyzes the formation of D-glyceraldehyde 3-phosphate an>] [protein_id=WP_001198722.1] [location=10464..11081]
M
QWQTKLPLIAILRGITPDEALAHVGAVIDAGFDAV
E
IPLNSPQWEQSIPAIVDAYGDKALIGAGTVLKP
EQVDALARMGCQLIVTPNIHSEVIRRAVGYGMTVCPGCATATEAFTALEAGAQAL
K
IFPSSAFGPQYIKA
LKAVLPSDIAVFAVGGVTPENLAQWIDAGCAGAGLGSDLYRAGQSVERTAQQAAAFVKAYREAVQ
>lcl|NC_012947.1_prot_
WP_000072067.1
_4 [gene=gyrB] [protein=/inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001341774.1; negatively supercoils closed circular double-stranded DNA] [protein_id=WP_000072067.1] [location=3957..6371]
M
SNSYDSSSIKVLKGLDAVRKRPGMYIGDTDDGTGLHHMVF
E
VVDNAID
E
ALAGHCKEIIVTIHADNSVS
VQDDGRGIPTGIHPEEGVSAAEVIMTVLHAGGKFDDNSYKVSGGLHGVGVSVVNALSQKLELVIQREGKI
HRQIYEHGVPQAPLAVTGETEKTGTMVRFWPSLETFTNVTEFEYEILAKRLRELSFLNSGVSIRLRDKRD
GKEDHFHYEGGIKAFVEYLNKNKTPIHPNIFYFSTEKDGIGVEVALQWNDGFQENIYCFTNNIPQRDGGT
HLAGFRAAMTRTLNAYMDKEGYSKKAKVSATGDDAREGLIAVVSVKVPDPKFSSQTKDKLVSSEVKSAVE
QQMNELLAEYLLENPTDAKIVVGKIIDAARAREAARRAREMTRRKGALDLAGLPGKLADCQERDPALSEL
YLVEGDSAGGSAKQGRNRKNQAILPLKGKILNVEKARFDKMLSSQEVATLITALGCGIGRDEYNPDKLRY
HSIIIMTDADVDGSHIRTLLLTFFYRQMPEIVERGHVYIAQPPLYKVKKGKQEQYIKDDEAMDQYQISIA
LDGATLHTNASAPALAGEALEKLVSEYNATQKMINRMERRYPKAMLKELIYQPTLTEADLSDEQTVTRWV
NALVSELNDKEQHGSQWKFDVHTNAEQNLFEPIVRVRTHGVDTDYPLDHEFITGGEYRRICTLGEKLRGL
LEEDAFIERGERRQPVASFEQALDWLVKESRRGLSIQRYKGLGEMNPEQLWETTMDPESRRMLRVTVKDA
IAADQLFTTLMGDAVEPRRAFIEENALKAANIDI
>lcl|
NC_012947.1_prot_WP_000772941.1_8 [gene=ECBD_RS00040] [protein=/inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001176565.1] [protein_id=WP_000772941.1] [location=complement(7982..863
]
M
KLNFKGFFKAAGLFPLALMLSGCISYALVSHTAKGSSGKYQSQSDTITGLSQAKDSNGTKGYVFVGESL
DYLITDGADDIVKMLNNPALNRHNIQVADDARFVLNAGKKKFTGTISLYYYWNNEEEKALATHYGFACGV
QHCTRSLENLKGTIHEKNKNMDYSKVMAFYHPFKVRFYEYYSPRGIPDGVSAALLPVTVTLDIITAPLQF,
LVVYAVNQ
如上,文件a中的WP_001198722.1匹配到文件b中的WP_001198722.1(均用红色注明)
则用文件a:WP_001198722.1下面的数字37,126查找文件b中的序列,序列的开头为青色标明处(该处序号记为1)
文件b每个以>lcl开头的为一条完整的注释序列,
图中橙色的部分为注释信息(需要用文件a的WP_001198722.1匹配文件b注释信息中的WP_001198722.1),
下面的大写字母为序列信息,即需要进行序号匹配的内容,例如WP_001198722.1下的37,126分别对应文件b中的E,K,
输出结果:
WP_001198722.1 E:1 K:1
WP_000072067.1 E:2
即对匹配到的序号输出的字母进行计数;
谢谢您的关注及帮助;
作者:
sunzhiguolu
时间:
2016-08-17 19:48
#!/usr/bin/perl
use strict;
use warnings;
my ($id, %hData, %hCnt);
open (my $FHa, '<', './a');
while (defined (local $_ = <$FHa>)){
if (/\A(WP_[^;]+)/){
$id = $1;
next;
}
chomp;
push (@{$hData{$id}}, $_);
}
close ($FHa);
open (my $FHb, '<', './b');
local $/ = ">lcl";
while (defined (local $_ = <$FHb>)){
chomp;
next if (/\A\s*\z/);
my @aLines = split (/\n/);
if ($aLines[0] =~ /(WP_[\d.]+)/ and exists $hData{$1}){
my $context = join ("", @aLines[1 .. $#aLines]);
$hCnt{substr ($context, $_ - 1, 1)}++ for @{$hData{$1}};
my @aK2V = map {"$_:$hCnt{$_}"} keys %hCnt;
print "$1 @aK2V\n";
%hCnt = ();
}
}
close ($FHb);
复制代码
作者:
little_joe
时间:
2016-08-18 07:55
非常感谢您的帮助,我将您写的代码命名为new1.txt,用事例运行之后出现这样的提示
QQ截图20160818074940.png
(6.6 KB, 下载次数: 67)
下载附件
2016-08-18 07:52 上传
,没有看懂,希望您能帮忙解答一下,另外那个b文件实际上特别大的,这样会影响代码运行么?
回复
2#
sunzhiguolu
作者:
sunzhiguolu
时间:
2016-08-18 11:39
你的测试环境是啥呀?
作者:
sunzhiguolu
时间:
2016-08-18 12:56
再试下,
#!/usr/bin/perl
use strict;
use warnings;
my ($id, %hData, %hCnt);
open (my $FHa, '<', shift);
while (defined (local $_ = <$FHa>)){
if (/\A(WP_[^;]+)/){
$id = $1;
next;
}
s/\A\s+|\s+\z//g;
push (@{$hData{$id}}, $_);
}
close ($FHa);
open (my $FHb, '<', shift);
local $/ = ">lcl";
while (defined (local $_ = <$FHb>)){
chomp;
next if (!length);
my @aLines = split (/\r?\n/);
if ($aLines[0] =~ /(WP_[\d.]+)/ and exists $hData{$1}){
my $context = join ("", @aLines[1 .. $#aLines]);
$hCnt{substr ($context, $_ - 1, 1)}++ for @{$hData{$1}};
my @aK2V = map {"$_:$hCnt{$_}"} keys %hCnt;
%hCnt = ();
print "$1 @aK2V\n";
}
}
close ($FHb);
复制代码
perl abc.pl a.txt b.txt
---------------------------------------------------------------------------
WP_001198722.1 K:1 E:1
WP_000072067.1 E:2
作者:
sunzhiguolu
时间:
2016-08-18 13:00
open (my $FHa, '<', '
./a
');
我的上个代码将文件路径写死了, 改成 shift 就可以了!
作者:
little_joe
时间:
2016-08-18 17:11
恩恩,谢谢您的热情帮助,有问题再问您,再次感谢
回复
6#
sunzhiguolu
作者:
little_joe
时间:
2016-08-18 17:12
应该是UNIX的
回复
4#
sunzhiguolu
作者:
sunzhiguolu
时间:
2016-09-01 22:58
本帖最后由 sunzhiguolu 于 2016-09-01 23:52 编辑
还真是有点记不住题目的要求,
作者:
sunzhiguolu
时间:
2016-09-01 23:02
本帖最后由 sunzhiguolu 于 2016-09-01 23:49 编辑
试下,
#!/usr/bin/perl
use strict;
use warnings;
my ($id, %hData, %hCnt);
open (my $FHr, '<', shift);
while (local $_ = <$FHr>){
s/\s+\z//;
next if (!length);
if (/\A([a-z]+_[^;]+)/i){
$id = $1;
next;
}
push (@{$hData{$id}}, $_);
}
close ($FHr);
local $/ = '>';
open ($FHr, '<', shift);
while (local $_ = <$FHr>){
chomp;
next if (!length);
my @aSeq = split (/-\d+;\r?\n/);
next if (!exists $hData{$aSeq[0]});
print $aSeq[0];
$hCnt{substr ($aSeq[-1], $_ - 2, 1)}++ for @{$hData{$aSeq[0]}};
print "\t$_:$hCnt{$_}" for keys %hCnt;
print "\n";
%hCnt = ();
}
close ($FHr);
复制代码
perl abc.pl a b
------------------------------------
NP_414552.1 *:1
NP_414557.1 R:1
NP_414562.1 A:1 M:12
作者:
little_joe
时间:
2016-09-02 08:44
回复
10#
sunzhiguolu
又麻烦您了,O(∩_∩)O谢谢,
作者:
华小飞_Perl
时间:
2016-09-20 21:33
写一个~
#!/usr/bin/perl
use warnings;
use strict;
open DATA_1, '<', 'a.txt' or die "Can't open: $!";
open DATA_2, '<', 'b.faa' or die "Can't open: $!";
my (%pro_seqID, $seqID);
while (<DATA_1>) {
chomp;
if (/(WP.*?);/) {
$seqID = $1;
next;
}
push @{ $pro_seqID{$seqID} }, $_;
}
close DATA_1;
local $/ = "\n>";
while (<DATA_2>) {
chomp;
s/^>//;
if ( my @id = m/_(WP.*?\.\d)_/ ) {
my $id = shift @id;
s/^.*\]//;
my $pro_seq = join "", split;
if ( exists $pro_seqID{$id} ) {
my %cal_str;
my $part_str = substr( $pro_seq, ( $pro_seqID{$id}->[0] - 1 ),
( $pro_seqID{$id}->[1] - $pro_seqID{$id}->[0] ) +1 );
my @chr = $part_str =~ m/^(\w).*(\w)$/;
$cal_str{$_}++ for @chr;
print $id, " ", $_, ": ", $cal_str{$_}, "\n" for ( sort keys %cal_str );
}
}
print "\n";
}
close DATA_2;
复制代码
处理fasta文件如果大同小异的话,直接写个子程序,以后方便自己调用就好~
作者:
b114213903
时间:
2016-09-21 09:04
回复
13#
华小飞_Perl
处理fasta格式有现成的包 BioPerl中的Bio::SeqIO模块就可以了
作者:
华小飞_Perl
时间:
2016-09-21 13:49
回复
14#
b114213903
听过Bioperl,有的话就直接拿来用了,谢谢提醒。
欢迎光临 Chinaunix (http://bbs.chinaunix.net/)
Powered by Discuz! X3.2