- 论坛徽章:
- 8
|
各位大大:
我想咨询一下,如果我现在有以下三个文件
__________________________________________________________________________
all.txt:
refseq1 refGene mRNA 1 2 . + . NR_024540 WASH5P
refseq1 refGene exon 2 10 . + . NR_024540 WASH5P E1
refseq1 refGene intr 4 6 . + . NR_024540 WASH5P I1
refseq1 refGene exon 9 11 . + . NR_024540 WASH5P E2
refseq2 refGene exon 1 6 . + . NM_001005221 OR4F29
refseq2 refGene exon 3 6 . + . NM_001005221 OR4F29 E1
refseq2 refGene exon 7 9 . + . NM_001005221 OR4F29 E1
refseq2 refGene mRNA 1 3 . + . NM_001005224 OR4F3
refseq2 refGene exon 6 8 . + . NM_001005224 OR4F3 E1
nm.txt
NR_024540
NM_001005221
seq.txt
>refseq1 chr1 + 1000000 [1,1000000] ""
AGCTCGGTCCCCCCCCCCCCCCCTTTTTTT
>refseq2 chr2 + 1000000 [1,1000000] ""
AGCCCCCCCTCGGTCCCCCCCCCTTTTTTT
_________________________________________________________________________
我想得到的是nm.txt的字符串对应all.txt的第9列且all.txt的第十一列以E开始/^E/(不确定是否每一行都有第11列)的对应行,匹配后根据第一列的信息匹配seq.txt上的序列信息,获取第四列到第五列数值间的碱基,如果同一NM号且碱基序列间有重叠,合并输出.
也就是说最后得到的结果应该是:
_______________________________________________________________
>NR_024540_E1 WASH5P refseq1 + 2 10 9 11
GCTCGGTCCC
> NM_001005221_E1 OR4F29 refseq2 + 3 6
CCCC
> NM_001005221_E2 OR4F29 refseq2 + 7 9
CCC
可是我花费很多时间都得不出结果,怎么办呀??求助求助! |
|