Chinaunix

标题: 求教!求教!文件匹配输出问题 [打印本页]

作者: little_joe    时间: 2017-02-09 14:31
标题: 求教!求教!文件匹配输出问题
有两个文件,文件1格式如下:前面的字母开头这一行表示一个id,下面的是该id号对应的需要处理的数字
NP_415088.1-1
4
11
44
46
72
134
NP_415089.1-1
31
74
83

NP_415560.1-1
4
6
45
68
92
113
137

NP_415561.1-1
14
72
75
77
85
87
NP_415562.1-1
6
30
51
53
71
72
81
84
97
98

文件2:
BLASTP 2.2.29+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.



Database: /public/home/mgb226/alook/Find_AA_STOP/re_AA_STOP/readthrough/all.fa
sta
           185,250 sequences; 57,471,956 total letters

Query= NP_415088.1-1

Length=153
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

lcl|NC_000913.3_prot_YP_588440.1_550  [gene=rzoD] [protein=DLP12 ...    122   1e-35
lcl|CP011323.1_prot_SG47_0559_549  [gene=rzoD] [protein=DLP12 pro...    122   1e-35
lcl|CP011322.1_prot_SG46_0559_548  [gene=rzoD] [protein=DLP12 pro...    122   1e-35
lcl|CP006698.1_prot_N840_0565_549  [gene=rzoD] [protein=DLP12 pro...    122   1e-35
lcl|NC_002695.1_prot_NP_309651.1_1555  [gene=ECs1624] [protein=li...    118   6e-34
lcl|NC_000913.3_prot_YP_588452.1_1351  [gene=rzoR] [protein=Rac p...    115   9e-33
lcl|CP011323.1_prot_SG47_1388_1349  [gene=rzoR] [protein=Rac prop...    115   9e-33
lcl|CP011322.1_prot_SG46_1388_1348  [gene=rzoR] [protein=Rac prop...    115   9e-33
lcl|CP006698.1_prot_N840_1389_1364  [gene=rzoR] [protein=Rac prop...    115   9e-33
lcl|CP013029.1_prot_AKK22_02365_443  [gene=AKK22_02365] [protein=...  84.0    9e-21


>lcl|NC_000913.3_prot_YP_588440.1_550 [gene=rzoD] [protein=DLP12 prophage; putative lipoprotein] [protein_id=YP_588440.1]
[location=578327..578509]
Length=60

Score =   122 bits (307),  Expect = 1e-35, Method: Compositional matrix adjust.
Identities = 60/60 (100%), Positives = 60/60 (100%), Gaps = 0/60 (0%)

Query  74   MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  133
                 MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG
Sbjct    1    MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  60

>lcl|CP011322.1_prot_SG46_0559_548 [gene=rzoD] [protein=DLP12 prophage, putative lipoprotein] [protein_id=AKF62758.1]
[location=572012..572194]
Length=60

Score =   122 bits (307),  Expect = 1e-35, Method: Compositional matrix adjust.
Identities = 60/60 (100%), Positives = 60/60 (100%), Gaps = 0/60 (0%)

Query  74   MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  133
                 MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG
Sbjct    1    MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  60


>lcl|CP006698.1_prot_N840_0565_549 [gene=rzoD] [protein=DLP12 prophage; predicted lipoprotein] [protein_id=AGX32695.1]
[location=577376..577558]
Length=60

Score =   122 bits (307),  Expect = 1e-35, Method: Compositional matrix adjust.
Identities = 60/60 (100%), Positives = 60/60 (100%), Gaps = 0/60 (0%)

Query  74   MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  133
                 MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG
Sbjct    1    MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  60

>lcl|NC_000913.3_prot_YP_588452.1_1351 [gene=rzoR] [protein=Rac prophage; putative lipoprotein] [protein_id=YP_588452.1]
[location=1423400..1423585]
Length=61

Score =   115 bits (287),  Expect = 9e-33, Method: Compositional matrix adjust.
Identities = 57/61 (93%), Positives = 57/61 (93%), Gaps = 0/61 (0%)

Query  74   MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  133
                 MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPP PPAWIMQPPPDWQTPLNGIISPS   
Sbjct    1    MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPPPPAWIMQPPPDWQTPLNGIISPSGND  60

Query  134  W  134
                  W
Sbjct    61   W  61
>lcl|CP011323.1_prot_SG47_1388_1349 [gene=rzoR] [protein=Rac prophage, putative lipoprotein] [protein_id=AKF67672.1]
[location=1415887..1416072]
Length=61

Score =   115 bits (287),  Expect = 9e-33, Method: Compositional matrix adjust.
Identities = 57/61 (85%), Positives = 57/61 (93%), Gaps = 0/61 (0%)

Query  74   MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  133
                 MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPP PPAWIMQPPPDWQTPLNGIISPS   
Sbjct    1    MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPPPPAWIMQPPPDWQTPLNGIISPSGND  60

Query  134  W  134
                  W
Sbjct    61   W  61

文件1和文件2如上面所示,具体文件如下: 实验数据.zip (238.23 KB, 下载次数: 3)
首先用文件1的id去匹配文件2中的内容,如NP_415088.1-1匹配到文件2中的NP_415088.1-1,
此时则用文件1下面的数字去匹配文件2下面的Query行序号,其中如:
Query  74   MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPRPPAWIMQPPPDWQTPLNGIISPSERG  133
                 MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPP PPAWIMQPPPDWQTPLNGIISPS   
Sbjct    1    MRKLKMMLCVMMLPLVVVGCTSKQSVSQCVKPPPPPAWIMQPPPDWQTPLNGIISPSGND  60

Query  134 W  134
                  W
Sbjct    61   W  61
这里的74便表示第一个M的序号是74,该行到最后的G其序号是133,每一个字母分别对应一个序号(如M下来的R为75,以此类推),文件1中id对应下面的数字
NP_415088.1-1
4
11
44
46
72
134

通过比较发现只有134对应了字母即标注的W,则此时输出该W对应的Sbjct 行的W,上面标注的 Identities需要进行过滤,
即只输出 Identities后面 (93%)大于90%的内容,若小于90%则可以将结果忽略
这里需要计算出W的个数此处是1则:上面的例子输出应该是
NP_415088.1-1 W:1










欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2