- 论坛徽章:
- 0
|
两道题,问了N多人,没结果,再问一下看看
重新写过以解决重叠问题:
- #!/bin/awk -f
- #
- # A script can be used to check any repeat pieces of nucleotide sequences.
- #
- # Design: lighspeed
- # Date: Dec. 14, 2004
- #
- # Repeat Match Usage:: $0 datafile
- # Reverted Repeat Match Usage:: $0 -v r=1 datafile
- #
- function is_overlap(p, l) {
- e = p + l - 1
- for (i in record) {
- a = i + record[i] - 1
- if (( i >= p && i <= e ) || ( a >= p && a <= e ) || ( p >= i && p <= a ) || ( e >= i && e <= a ))
- return 1
- }
- return 0
- }
- {
- L=length($0)
- STR_MIN=10
- # STR_MAX=int(L / 2)
- STR_MAX=30
- if ( r == 1 )
- print "---------------Reverted Repeat Match Line# "NR" -----------------\n"
- else
- print "------------------Repeat Match Line# "NR" --------------------\n"
- for ( Str_Len=STR_MAX; Str_Len >= STR_MIN; Str_Len -- ) {
- for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {
- if ( is_overlap(Position,Str_Len) == 1 )
- continue
- count=0
- pos=Position
- offset=Position + Str_Len - 1
- left=substr($0,Position,Str_Len)
- if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
- continue
- right=substr($0, Position + Str_Len)
- if ( r == 1 ) {
- old_left=left
- rev_left=""
- for ( i=length(left); i>=1; i-- )
- rev_left=rev_left""substr(left,i,1)
- left=rev_left
- }
- while ( Str_Len <= length(right) ) {
- i=index(right,left)
- if ( i > 0 ) {
- j=offset + i
- if ( is_overlap(j,Str_Len) == 0 ) {
- count ++
- record[Position]=Str_Len
- record[j]=Str_Len
- pos=pos","j
- }
- right=substr(right, i + Str_Len)
- offset+=(i + Str_Len - 1)
- }
- else
- break
- }
- if (count > 0) {
- match_number[Str_Len] ++
- if (r == 1) {
- left=old_left
- print "Reverted Repeat: " left",", "Size: "Str_Len",", "Start Positions: "pos
- }
- else
- print "Repeat: " left",", "Size: "Str_Len",", "Start Positions: "pos
- }
-
- }
- }
- }
-
复制代码
测试你的文件
- # cat data1
- ACGTGCGATCACAGGCCGTGCAGAGACTGACGATCAGACGACGTGACAGGCCGTGCAGAGACTGACGATCAG
- # ./1 data1
- ------------------Repeat Match Line# 1 --------------------
- Repeat: ACAGGCCGTGCAGAGACTGACGATCAG, Size: 27, Start Positions: 11,46
复制代码
Repeat 测试 (前面的 10000 个字符的文件, STR_MIN=10, STR_MAX=30)
- # time ./1 datafile > report1
- real 1m45.46s
- user 1m24.95s
- sys 0m0.03s
- # cat report1
- ------------------Repeat Match Line# 1 --------------------
- Repeat: TTGGCTGGGCACAGTGGCTCACGCCTGTAA, Size: 30, Start Positions: 1086,5893
- Repeat: GGAGTTCAAGACCAGCCTGGCCAACATGGT, Size: 30, Start Positions: 1161,2687
- Repeat: TGGCCAACATGGTGAAACCCCGTCTCTA, Size: 28, Start Positions: 5983,8614
- Repeat: CCTGTAATCCCAGCACTTTGGGAGGC, Size: 26, Start Positions: 1613,2948
- Repeat: CGGGCATGGTGGCTCACGCTTGTAAT, Size: 26, Start Positions: 2617,8526
- Repeat: CCAGCACTTTGGGAGGCTGAGGCAGG, Size: 26, Start Positions: 5925,8553
- Repeat: GAACTCCTGACCTCAGGTGATCC, Size: 23, Start Positions: 3913,9062
- Repeat: CCTAGCACTTTGGGAGGCTGAG, Size: 22, Start Positions: 1117,2643
- Repeat: CGTGCCTGTAATCCCAGCTACT, Size: 22, Start Positions: 1241,8671
- Repeat: TGAGGCAGGAGAATTGCTTGAA, Size: 22, Start Positions: 1271,6075
- Repeat: GAGGTTGTAGTGAGCCGAGAT, Size: 21, Start Positions: 1805,2832
- Repeat: GGAGGTGGAGGTTGCAGTGA, Size: 20, Start Positions: 502,8727
- Repeat: ACTCCAGCCTGGGCGACAGA, Size: 20, Start Positions: 541,1336
- Repeat: GTGCCACTGCACTCCAGCCT, Size: 20, Start Positions: 2854,6130
- Repeat: CTAAAAATACAAAAATTAG, Size: 19, Start Positions: 1708,8642
- Repeat: AGCTACTTGGGAGGCTGAG, Size: 19, Start Positions: 2784,3093
- Repeat: AAAAATACAAAAATTAGCC, Size: 19, Start Positions: 3046,6013
- Repeat: AGGAGAATCACTTGAACC, Size: 18, Start Positions: 1778,8707
- Repeat: CCCAGGCTGGAGTGCAAT, Size: 18, Start Positions: 3732,8883
- Repeat: AAAGTGCTGGGATTACAG, Size: 18, Start Positions: 4558,9101
- Repeat: ACTGCACTCCAGCCTGG, Size: 17, Start Positions: 1832,8753
- Repeat: TGGATCACTTGAGGTCA, Size: 17, Start Positions: 2670,8579
- Repeat: TCGCTTGAACCCGGGAG, Size: 17, Start Positions: 2812,3121
- Repeat: TGGAGTTTTGCTCTTGT, Size: 17, Start Positions: 3713,8864
- Repeat: GCCTTGGCCTCCCAAA, Size: 16, Start Positions: 1460,3940
- Repeat: TATTTTTAGTAGAGAC, Size: 16, Start Positions: 4473,9015
- Repeat: CCACCTCGCCTGGCT, Size: 15, Start Positions: 208,8993
- Repeat: TGGGGAGGCTGAGGT, Size: 15, Start Positions: 327,467
- Repeat: TAAACAAGGACTTTT, Size: 15, Start Positions: 1510,1556
- Repeat: GGGTTTCTCCATGTT, Size: 15, Start Positions: 4490,9032
- Repeat: GAAACCCCGTCTCT, Size: 14, Start Positions: 1693,2717
- Repeat: AGACTCCATCTCAA, Size: 14, Start Positions: 2887,8781
- Repeat: CTGCCTCAGCCTCC, Size: 14, Start Positions: 3802,8953
- Repeat: GATTACAGGCATGC, Size: 14, Start Positions: 3827,8978
- Repeat: TGTGGTGGTGCA, Size: 12, Start Positions: 436,1732
- Repeat: ACAATGCTGTAA, Size: 12, Start Positions: 847,9825
- Repeat: ACCCTGTCTCTA, Size: 12, Start Positions: 1194,5426
- Repeat: TGAGGTCAGGAG, Size: 12, Start Positions: 2992,5958
- Repeat: GCCTGTAATCC, Size: 11, Start Positions: 309,3081
- Repeat: AGGCTGGTCTC, Size: 11, Start Positions: 1436,9051
- Repeat: GTGTTTCTAAC, Size: 11, Start Positions: 2256,7173
- Repeat: ATGAACAAGGG, Size: 11, Start Positions: 7604,9373
- Repeat: AAGCAATTCTC, Size: 11, Start Positions: 8435,8942
- Repeat: TTCTTTTTGA, Size: 10, Start Positions: 65,4308
- Repeat: CTGTGAATAT, Size: 10, Start Positions: 260,6206
- Repeat: GATTTTCTAT, Size: 10, Start Positions: 633,9283
- Repeat: GCTGTCATTT, Size: 10, Start Positions: 652,5261
- Repeat: ATTAGTTTTC, Size: 10, Start Positions: 738,7084
- Repeat: AAGTTTCAAG, Size: 10, Start Positions: 928,5153
- Repeat: TTAGTTCTCA, Size: 10, Start Positions: 1955,6839
- Repeat: TCAGCCAGAT, Size: 10, Start Positions: 1988,9888
- Repeat: ATTTGCTTTT, Size: 10, Start Positions: 2246,9636
- Repeat: TGAGCTCTTA, Size: 10, Start Positions: 3565,9590
- Repeat: GCCCACATTA, Size: 10, Start Positions: 7007,8318
复制代码
Reverted Repeat 测试 (前面的 10000 个字符的文件, STR_MIN=10, STR_MAX=30)
注意: 语法为 ./1 -v r=1 datafile
- # time ./1 -v r=1 datafile > report2
- real 1m7.45s
- user 1m2.28s
- sys 0m0.10s
- # cat report2
- ---------------Reverted Repeat Match Line# 1 -----------------
- Reverted Repeat: AGTTTTTCTTTTTTT, Size: 15, Start Positions: 674,4303
- Reverted Repeat: TCATTCATGGTA, Size: 12, Start Positions: 4999,9542
- Reverted Repeat: TTCTCAGACTAA, Size: 12, Start Positions: 6843,7896
- Reverted Repeat: AGGTGGGCGGA, Size: 11, Start Positions: 1641,2976
- Reverted Repeat: GTCTCTTAAAA, Size: 11, Start Positions: 2591,8498
- Reverted Repeat: TTGAGGTGACA, Size: 11, Start Positions: 5358,5856
- Reverted Repeat: ACAGAATAAAA, Size: 11, Start Positions: 6222,6567
- Reverted Repeat: ACTAGAGCTTG, Size: 11, Start Positions: 7340,8599
- Reverted Repeat: GTTTTCTTAA, Size: 10, Start Positions: 230,8448
- Reverted Repeat: TTTACTTTAG, Size: 10, Start Positions: 611,1360
- Reverted Repeat: TACAAAGAAC, Size: 10, Start Positions: 871,2576
- Reverted Repeat: GTTCTCAACT, Size: 10, Start Positions: 882,2497
- Reverted Repeat: GTGAAACCCT, Size: 10, Start Positions: 1189,1469,4554
- Reverted Repeat: AAGGACTTTT, Size: 10, Start Positions: 1515,2225
- Reverted Repeat: CTTTTTCTGA, Size: 10, Start Positions: 1545,2382
- Reverted Repeat: CAATTTGATC, Size: 10, Start Positions: 2094,4775
- Reverted Repeat: TCAAAAAAGA, Size: 10, Start Positions: 2897,5675
- Reverted Repeat: GTGACAACAG, Size: 10, Start Positions: 3172,4730
- Reverted Repeat: ATTTAATCGT, Size: 10, Start Positions: 3597,9181
- Reverted Repeat: AATATCTTTG, Size: 10, Start Positions: 5550,7952
- Reverted Repeat: CCTGGGAAGG, Size: 10, Start Positions: 5563,9524
- Reverted Repeat: TCTCAAATAG, Size: 10, Start Positions: 5620,9293
- Reverted Repeat: AGTATTATCA, Size: 10, Start Positions: 5650,7393
- Reverted Repeat: CAATAAATGG, Size: 10, Start Positions: 8800,9810
- Reverted Repeat: TTGTACGTAT, Size: 10, Start Positions: 9563,9578
复制代码 |
|