- 论坛徽章:
- 0
|
两道题,问了N多人,没结果,再问一下看看
注意: 下面的例子中的 STR_MAX 及 STR_MIN 可根据需要设定。
- #!/bin/awk -f
- #
- # A script can be used to check any repeat pieces of nucleotide sequences.
- #
- # Design: lighspeed
- # Date: Dec. 12, 2004
- #
- # Usage:: $0 datafile
- #
- # Variables:
- #
- # left - a repeat string to be matched
- # right - the right side string used to match any left in it
- # rev_left - reverse string of left
- # rev_right - the right side string used to match any rev_left in it
- # flag[position] - an array element which will be set if the string in the position is matched
- #
- {
- L=length($0)
- STR_MIN=10
- # STR_MAX=int(L / 2)
- STR_MAX=20
- print "------------------- Line# "NR" --------------------\n"
- for ( Str_Len=STR_MIN; Str_Len <= STR_MAX; Str_Len ++ ) {
- for ( i in flag )
- delete flag[i]
- for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {
- if ( Position in flag )
- continue
- count=0
- pos=Position
- offset=Position
- rev_offset=offset
- rev_count=0
- rev_pos=""
- rev_left=""
- left=substr($0,Position,Str_Len)
- if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
- continue
- right=substr($0, Position + 1)
- for ( i=length(left); i>=1; i-- ) {
- rev_left=rev_left""substr(left,i,1)
- }
- rev_right=right
- while ( Str_Len <= length(right) ) {
- i=index(right,left)
- if ( i > 0 ) {
- count ++
- j=offset + i
- pos=pos";"j
- flag[j]=1
- right=substr(right, i + 1)
- offset+=i
- }
- else
- break
- }
- while ( Str_Len <= length(rev_right) ) {
- i=index(rev_right,rev_left)
- if ( i > 0 ) {
- rev_count ++
- j=rev_offset + i
- if ( rev_pos == "" ) {
- rev_pos=j
- }
- else
- rev_pos=rev_pos";"j
- flag[j]=1
- rev_right=substr(rev_right, i + 1)
- rev_offset+=i
- }
- else
- break
- }
- if (count > 0 )
- match_number[Str_Len] ++
- if (rev_count > 0 )
- rev_number[Str_Len] ++
- if (count > 0 || rev_count > 0) {
- print left, "Length="Str_Len, "Position="pos, (rev_count >0) ? "Rev_Position="rev_pos : ""
- }
-
- }
- }
- print ""
- print "Summary of Line# "NR
- print "------------------"
- for ( i in match_number ) {
- print "String length :: " i " Matched Strings :: " match_number[i]
- delete match_number[i]
- }
- print ""
- for ( i in rev_number ) {
- print "String length :: " i " Matched Reverse Strings :: " rev_number[i]
- delete rev_number[i]
- }
- }
-
复制代码
数据文件来自 DNA NCBI35 的片段。 处理成 10000 个字符的单行文件。
运行:
# time ./1 datafile > report
real 1m14.81s
user 0m50.65s
sys 0m0.13s
# wc -l report
3355 report
下面是 report 文件中的部分内容:
- # cat report
- ------------------- Line# 1 --------------------
- TAATCAACTA Length=10 Position=10 Rev_Position=12
- TTTATTACTA Length=10 Position=51;9703
- GGAGGCTGAG Length=10 Position=330;470;1129;1265;1633;2655;2793;3102;5936;8564
- AAAAAAAAAA Length=10 Position=1871;2906;2907;2908;2909;2910;3198;3199;3200;3201;3202;6183;6761;6762 Rev_Position=2906;2907;2908;2909;2910;3198;3199;3200;3201;3202;6183;6761;6762
- GTAGTGAGCCGA Length=12 Position=1811;2838
- AATAAATAAATA Length=12 Position=1889;1893 Rev_Position=1890;1894
- CCACTGCACTCCA Length=13 Position=534;1830;2857;6133
- GTTTTTCTTTTTT Length=13 Position=675 Rev_Position=4304
- GTAATCCCAGCTAC Length=14 Position=1248;2776;8678
- CCTGTAATCCTAGCA Length=15 Position=310;1109
- GGATCACTTGAGGTCA Length=16 Position=2671;8580
- CGGGCATGGTGGCTCAC Length=17 Position=2617;8526
- CACTTGAGGTCAGGAGTT Length=18 Position=2675;8584
- CCAGCCTGGCCAACATGGT Length=19 Position=1172;2698;3011
- CTTTGGGAGGCTGAGGCAGG Length=20 Position=5931;8559
- TTTTTTTTTTTTTTTTTTTT Length=20 Position=8841;8842 Rev_Position=8842
- Summary of Line# 1
- ------------------
- String length :: 10 Matched Strings :: 541
- String length :: 11 Matched Strings :: 438
- String length :: 12 Matched Strings :: 372
- String length :: 13 Matched Strings :: 329
- String length :: 14 Matched Strings :: 293
- String length :: 15 Matched Strings :: 258
- String length :: 16 Matched Strings :: 226
- String length :: 17 Matched Strings :: 200
- String length :: 18 Matched Strings :: 171
- String length :: 19 Matched Strings :: 146
- String length :: 20 Matched Strings :: 124
- String length :: 10 Matched Reverse Strings :: 142
- String length :: 11 Matched Reverse Strings :: 69
- String length :: 12 Matched Reverse Strings :: 40
- String length :: 13 Matched Reverse Strings :: 25
- String length :: 14 Matched Reverse Strings :: 15
- String length :: 15 Matched Reverse Strings :: 7
- String length :: 16 Matched Reverse Strings :: 5
- String length :: 17 Matched Reverse Strings :: 2
- String length :: 18 Matched Reverse Strings :: 1
- String length :: 19 Matched Reverse Strings :: 1
- String length :: 20 Matched Reverse Strings :: 1
复制代码 |
|