- 论坛徽章:
- 0
|
两道题,问了N多人,没结果,再问一下看看
原帖由 "THEBEST" 发表:
我不太明白你的意思,我不懂AWK,我用C++写的在处理上没有优化,方法也很臭.所以我认为写的很不好.而且还不正确(有BUG的).
引用:
下载 10000 个字符的数据文件.(为显示我作了fold 处理, 你在把它恢复成单行), 然后查找所有不重叠的串长为 10 - 30 的重复串, 计算一下时间.
为什么是10-30?大于10就是了.(题目如此要求的),所以不重叠?我处理的是考虑了重叠方式,如果不重叠会处理的更少情况,所以速度也有所改善.你顺便处理一下10*10000个源字符串并最小长度为10(没有最大长度)的重复字符串,看看速度.
你可以直接运行我的程序,改一下存放文件的C:\\result.txt就可以了.我现在没有LINUX/UNIX,执行不了你的程序.
你的程序完全正确吗?我觉得很难验证.icon_smile.gif
1. 经过与出题人交流, 此题有如下条件:
1) 查找最长重复匹配串,先到先得
2) 匹配串不能重叠
3) 每个匹配串必须包括全部 ACGT 四个字符。(这是才是有效的基因片断), 所以 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 不算重复匹配串
真实的 DNA 文件, 重复匹配串的最大长度一般小于 100。
比如, 我用的 data1 (10000 字符, 正序重复匹配串的最大长度是 46, 逆序是 15)
data2 (100000 字符, 正序重复匹配串的最大长度是 43)
所以我希望你能用我的 data1 来在你的机器下测一下你的程序所花的时间。(因我没有编译器)
我的程序不能说完全正确 (最新版本), 但到目前为止没有发现问题。
前面我说的测试 10 - 30 的长度, 不用理会, 测试 >;= 10 的就可以了.
2. data1 (10000 字符) 的结果
CPU: 1.2G P3, Memory 384M, OS UWIN 3.2, AWK
正序匹配测试: 1 min 45 sec
- Repeat: TGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCC, Size: 46, Start Positions: 2679,2992
- Repeat: TTGGCTGGGCACAGTGGCTCACGCCTGTAATCC, Size: 33, Start Positions: 1086,5893
- Repeat: TGGCCAACATGGTGAAACCCCGTCTCTA, Size: 28, Start Positions: 5983,8614
- Repeat: CCTGTAATCCCAGCACTTTGGGAGGC, Size: 26, Start Positions: 1613,2948
- Repeat: CGGGCATGGTGGCTCACGCTTGTAAT, Size: 26, Start Positions: 2617,8526
- Repeat: CAGCACTTTGGGAGGCTGAGGCAGG, Size: 25, Start Positions: 5926,8554
- Repeat: GAACTCCTGACCTCAGGTGATCC, Size: 23, Start Positions: 3913,9062
- Repeat: CGTGCCTGTAATCCCAGCTACT, Size: 22, Start Positions: 1241,8671
- Repeat: TGAGGCAGGAGAATTGCTTGAA, Size: 22, Start Positions: 1271,6075
- Repeat: GGGAGGCTGAGGTGGGTGGAT, Size: 21, Start Positions: 329,2654
- Repeat: GAGGTTGTAGTGAGCCGAGAT, Size: 21, Start Positions: 1805,2832
- Repeat: GGAGGTGGAGGTTGCAGTGA, Size: 20, Start Positions: 502,8727
- Repeat: ACTCCAGCCTGGGCGACAGA, Size: 20, Start Positions: 541,1336
- Repeat: GTGCCACTGCACTCCAGCCT, Size: 20, Start Positions: 2854,6130
- Repeat: CTAAAAATACAAAAATTAG, Size: 19, Start Positions: 1708,8642
- Repeat: AGCTACTTGGGAGGCTGAG, Size: 19, Start Positions: 2784,3093
- Repeat: AAAAATACAAAAATTAGCC, Size: 19, Start Positions: 3046,6013
- Repeat: AGGAGAATCACTTGAACC, Size: 18, Start Positions: 1778,8707
- Repeat: CCCAGGCTGGAGTGCAAT, Size: 18, Start Positions: 3732,8883
- Repeat: AAAGTGCTGGGATTACAG, Size: 18, Start Positions: 4558,9101
- Repeat: ACTGCACTCCAGCCTGG, Size: 17, Start Positions: 1832,8753
- Repeat: TCGCTTGAACCCGGGAG, Size: 17, Start Positions: 2812,3121
- Repeat: TGGAGTTTTGCTCTTGT, Size: 17, Start Positions: 3713,8864
- Repeat: GCCTTGGCCTCCCAAA, Size: 16, Start Positions: 1460,3940
- Repeat: TATTTTTAGTAGAGAC, Size: 16, Start Positions: 4473,9015
- Repeat: CCACCTCGCCTGGCT, Size: 15, Start Positions: 208,8993
- Repeat: TAAACAAGGACTTTT, Size: 15, Start Positions: 1510,1556
- Repeat: GGGTTTCTCCATGTT, Size: 15, Start Positions: 4490,9032
- Repeat: AGACTCCATCTCAA, Size: 14, Start Positions: 2887,8781
- Repeat: CTGCCTCAGCCTCC, Size: 14, Start Positions: 3802,8953
- Repeat: GATTACAGGCATGC, Size: 14, Start Positions: 3827,8978
- Repeat: TGTGGTGGTGCA, Size: 12, Start Positions: 436,1732
- Repeat: ACAATGCTGTAA, Size: 12, Start Positions: 847,9825
- Repeat: ACCCTGTCTCTA, Size: 12, Start Positions: 1194,5426
- Repeat: TGAGGTCAGGAG, Size: 12, Start Positions: 5958,8588
- Repeat: GCCTGTAATCC, Size: 11, Start Positions: 309,3081
- Repeat: CAGCCTGGCCA, Size: 11, Start Positions: 374,1173
- Repeat: GGGAGGCTGAG, Size: 11, Start Positions: 469,1128
- Repeat: AGGCTGGTCTC, Size: 11, Start Positions: 1436,9051
- Repeat: GTGTTTCTAAC, Size: 11, Start Positions: 2256,7173
- Repeat: ATGAACAAGGG, Size: 11, Start Positions: 7604,9373
- Repeat: AAGCAATTCTC, Size: 11, Start Positions: 8435,8942
- Repeat: TTCTTTTTGA, Size: 10, Start Positions: 65,4308
- Repeat: CTGTGAATAT, Size: 10, Start Positions: 260,6206
- Repeat: CCCAGCTACT, Size: 10, Start Positions: 458,6057
- Repeat: GATTTTCTAT, Size: 10, Start Positions: 633,9283
- Repeat: GCTGTCATTT, Size: 10, Start Positions: 652,5261
- Repeat: ATTAGTTTTC, Size: 10, Start Positions: 738,7084
- Repeat: AAGTTTCAAG, Size: 10, Start Positions: 928,5153
- Repeat: TTAGTTCTCA, Size: 10, Start Positions: 1955,6839
- Repeat: TCAGCCAGAT, Size: 10, Start Positions: 1988,9888
- Repeat: ATTTGCTTTT, Size: 10, Start Positions: 2246,9636
- Repeat: TGAGCTCTTA, Size: 10, Start Positions: 3565,9590
- Repeat: GCCCACATTA, Size: 10, Start Positions: 7007,8318
-
复制代码
逆序匹配测试: 1 min 7 sec
- Reverted Repeat: AGTTTTTCTTTTTTT, Size: 15, Start Positions: 674,4303
- Reverted Repeat: TCATTCATGGTA, Size: 12, Start Positions: 4999,9542
- Reverted Repeat: TTCTCAGACTAA, Size: 12, Start Positions: 6843,7896
- Reverted Repeat: AGGTGGGCGGA, Size: 11, Start Positions: 1641,2976
- Reverted Repeat: GTCTCTTAAAA, Size: 11, Start Positions: 2591,8498
- Reverted Repeat: TTGAGGTGACA, Size: 11, Start Positions: 5358,5856
- Reverted Repeat: ACAGAATAAAA, Size: 11, Start Positions: 6222,6567
- Reverted Repeat: ACTAGAGCTTG, Size: 11, Start Positions: 7340,8599
- Reverted Repeat: GTTTTCTTAA, Size: 10, Start Positions: 230,8448
- Reverted Repeat: TTTACTTTAG, Size: 10, Start Positions: 611,1360
- Reverted Repeat: TACAAAGAAC, Size: 10, Start Positions: 871,2576
- Reverted Repeat: GTTCTCAACT, Size: 10, Start Positions: 882,2497
- Reverted Repeat: GTGAAACCCT, Size: 10, Start Positions: 1189,1469,4554
- Reverted Repeat: AAGGACTTTT, Size: 10, Start Positions: 1515,2225
- Reverted Repeat: CTTTTTCTGA, Size: 10, Start Positions: 1545,2382
- Reverted Repeat: CAATTTGATC, Size: 10, Start Positions: 2094,4775
- Reverted Repeat: TCAAAAAAGA, Size: 10, Start Positions: 2897,5675
- Reverted Repeat: GTGACAACAG, Size: 10, Start Positions: 3172,4730
- Reverted Repeat: ATTTAATCGT, Size: 10, Start Positions: 3597,9181
- Reverted Repeat: AATATCTTTG, Size: 10, Start Positions: 5550,7952
- Reverted Repeat: CCTGGGAAGG, Size: 10, Start Positions: 5563,9524
- Reverted Repeat: TCTCAAATAG, Size: 10, Start Positions: 5620,9293
- Reverted Repeat: AGTATTATCA, Size: 10, Start Positions: 5650,7393
- Reverted Repeat: CAATAAATGG, Size: 10, Start Positions: 8800,9810
- Reverted Repeat: TTGTACGTAT, Size: 10, Start Positions: 9563,9578
复制代码
3. data2 (100000 字符) 的结果
当长度增加到 100000 时, 速度变的很慢, 我只测了正序, 而且没有完全
完成, 但可以估计时间。
由于存在的重复匹配串的最大长度很小, 我采用二分查找首先找到重复匹配串的最大长度
然后再循环至最小长度。 下面结果前面的
9 50002 0
9 25005 0
.
就是显示查找的过程。 找到重复匹配串的最大长度为 43 用时 67 min.
然后, 从 43 循环至 10. 每个长度用时约 5 min
总时间大约为 67 + 34 * 5 = 237 min 也就是大约 4 小时。
- 9 50002 0
- 9 25005 0
- 9 12507 0
- 9 6258 0
- 9 3133 0
- 9 1571 0
- 9 790 0
- 9 399 0
- 9 204 0
- 9 106 0
- 9 57 1
- 33 57 0
- 33 45 1
- 39 45 1
- 42 45 1
- 43 45 0
- Repeat: ATTGGATCATTGATCTAATCCAACCACATAACTATAATTACAG, Size: 43, Start Positions: 25161,25204
- Repeat: TCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCG, Size: 41, Start Positions: 26357,75092
- Repeat: AGTCTTGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCG, Size: 39, Start Positions: 31148,69805
- .
- .
复制代码 |
|